How to use TruLens

Developers can use TruLens as they are building their LLM applications in Python by following these steps. The example used here is of a Question-Answering (QA) application. You can also go to the TruLens for LLMs Quick Start to get started right away.

TruLens diagram TruLens diagram vertical

Figure 1: The workflow for integrating TruLens into LLM app development

1. Build your LLM app

When you are using TruLens, you would build the first version of your LLM app following your standard workflow.

We built a Question Answering app named TruBot following the widely used paradigm of retrieval-augmented LLMs. This approach grounds the app's responses in a source of truth or knowledge base – TruEra’s website in this case. It involved chaining together the OpenAI LLM with the Pinecone vector database in the LangChain framework (see Appendix A for more details).

The code for the key steps for implementing this app are shown in Figure 2.

How to use TruLens Figure 2

Figure 2: Code snippet of chained QA LLM app.

2. Connect your LLM app to TruLens and log inputs and responses

The next step is to instrument the app with TruLens to log inputs and responses from the chain. Note that this step is very easy – it involves wrapping the previously created chain, running it, and logging the interaction with just 3 lines of code.

How to use TruLens Figure 3

Figure 3: Code snippet for connecting the ConversationalRetrievalChain from Figure 2 to TruLens and logging the inputs and outputs

3. Use feedback functions to evaluate and log the quality of LLM app results

The third step is to run feedback functions on the prompt and responses from the app and to log the evaluation results. Note that as a developer you only need to add a few lines of code to start using feedback functions in your apps (see Figure 4). You can also easily add functions tailored to the needs of your application.

Our goal with feedback functions is to programmatically check the app for quality metrics.

  • The first feedback function checks for language match between the prompt and the response. It’s a useful check since a natural user expectation is that the response is in the same language as the prompt. It is implemented with a call to a HuggingFace API that programmatically checks for language match.
  • The next feedback function checks how relevant the answer is to the question by using an Open AI LLM that is prompted to produce a relevance score.
  • Finally, the third feedback function checks how relevant individual chunks retrieved from the vector database are to the question, again using an OpenAI LLM in a similar manner. This is useful because the retrieval step from a vector database may produce chunks that are not relevant to the question and the quality of the final response would be better if these chunks are filtered out before producing the final response.
How to use TruLens Figure 4

Figure 4: Code snippet for running feedback functions to check for language match & relevance

Note that feedback functions are a general abstraction. They can be implemented in a number of ways, including but not limited to using modern LLMs, previous generation of BERT-style models, as well as with simpler rule based systems in some cases. We refer the interested reader to our article on feedback functions for more detail.

4. Explore in dashboard

After running the feedback functions on a set of records (interactions), you can

  • see the aggregate results of the evaluation on a leaderboard (see Figure 5);
  • then drill down into an app version (or chain) and examine how it is performing on individual records (see Figure 6).

These steps can help you understand the quality of an app version and its failure modes.

For example, in Figure 5, we see that this app version (Chain 0) is not doing well on the language match score from our feedback function. Drilling down in Figure 6, we discover that when questions are asked in German it responds in English instead of German – that’s the failure mode.

The app is doing well on overall relevance of responses to questions. However, it is performing poorly on the qs_relevance feedback function. Recall that this function checks how relevant individual chunks retrieved from the vector database are to the question. See Appendix B for an explanation of this failure mode.

How to use TruLens Figure 5

Figure 5: The Leaderboard shows the quality of chains across records as scored by various feedback functions.

How to use TruLens Figure 6

Figure 6: The Evaluations table provides a drill down of the evaluation at the record level. Note that on the language match test, this chain does not perform well when asked questions in German (the first 3 records above where the responses are in English instead of German) whereas it correctly responds in Spanish when asked a question in Spanish (record 4 above).

5. Iterate to get to the best chain

Armed with an understanding of the failure modes of the first chain, you then proceed to iterate on your app.

To address the language match failure mode, you adjust your prompt as shown in Figure 7. Running the evaluation again on the new version of the app (Chain 1), you see an improvement on the leaderboard – the feedback function score for language_match went from 0.25 to 0.97 (see Figure 9).

How to use TruLens Figure 7

Figure 7: Adjusted prompt to address language match failure mode. Note the explicit instruction to answer in the same language as the question.

To address the irrelevant chunks failure mode, you adjust your prompt template as shown in Figure 8. Running the evaluation again on the new version of the app (Chain 2), you see an improvement on the leaderboard – the feedback function score for qs_relevance increases significantly (see Figure 9).

How to use TruLens Figure 8

Figure 8: Adjusted prompt to address individual chunk relevance failure mode.

How to use TruLens Figure 9

Figure 9: The Leaderboard shows the quality of chains across records as scored by various feedback functions.

Try TruLens today

TruLens is free, available today, and ready to help you evaluate and track your LLM experiments. TruLens increases developer productivity while offering greater visibility and confidence in the quality of LLM applications. We hope that you give it a try and let us know how it’s working out for you!

Give it a spin

Get TruLens

Give us a star

TruLens on Github

TruLens is shepherded by TruEra

TruEra is an AI Quality software company that helps organizations better test, debug, and monitor machine learning models and applications. Although TruEra both actively oversees the distribution of TruLens and helps organize the community around it, TruLens remains an open-source community project, not a TruEra product.

About the TruEra Research Team

TruLens originally emerged from the work of the TruEra Research Team. They are passionate about the importance of testing and quality in machine learning. They continue to be involved in the development of the TruLens community.

You can learn more about TruEra Research here.

Why a colossal squid?

The colossal squid’s eyeball is about the size of a soccer ball, making it the largest eyeball of any living creature. In addition, did you know that its eyeball contains light organs? That means that colossal squids have automatic headlights when looking around. We're hoping to bring similar guidance to model developers when creating, introspecting, and debugging neural networks. Read more about the amazing eyes of the colossal squid.