Developers can use TruLens as they are building their LLM applications in Python by following these steps. The example used here is of a Question-Answering (QA) application. You can also go to the TruLens for LLMs Quick Start to get started right away.
Figure 1: The workflow for integrating TruLens into LLM app development
When you are using TruLens, you would build the first version of your LLM app following your standard workflow.
We built a Question Answering app named TruBot following the widely used paradigm of retrieval-augmented LLMs. This approach grounds the app's responses in a source of truth or knowledge base – TruEra’s website in this case. It involved chaining together the OpenAI LLM with the Pinecone vector database in the LangChain framework (see Appendix A for more details).
The code for the key steps for implementing this app are shown in Figure 2.
Figure 2: Code snippet of chained QA LLM app.
The next step is to instrument the app with TruLens to log inputs and responses from the chain. Note that this step is very easy – it involves wrapping the previously created chain, running it, and logging the interaction with just 3 lines of code.
Figure 3: Code snippet for connecting the ConversationalRetrievalChain from Figure 2 to TruLens and logging the inputs and outputs
The third step is to run feedback functions on the prompt and responses from the app and to log the evaluation results. Note that as a developer you only need to add a few lines of code to start using feedback functions in your apps (see Figure 4). You can also easily add functions tailored to the needs of your application.
Our goal with feedback functions is to programmatically check the app for quality metrics.
Figure 4: Code snippet for running feedback functions to check for language match & relevance
Note that feedback functions are a general abstraction. They can be implemented in a number of ways, including but not limited to using modern LLMs, previous generation of BERT-style models, as well as with simpler rule based systems in some cases. We refer the interested reader to our article on feedback functions for more detail.
After running the feedback functions on a set of records (interactions), you can
These steps can help you understand the quality of an app version and its failure modes.
For example, in Figure 5, we see that this app version (Chain 0) is not doing well on the language match score from our feedback function. Drilling down in Figure 6, we discover that when questions are asked in German it responds in English instead of German – that’s the failure mode.
The app is doing well on overall relevance of responses to questions. However, it is performing poorly on the qs_relevance feedback function. Recall that this function checks how relevant individual chunks retrieved from the vector database are to the question. See Appendix B for an explanation of this failure mode.
Figure 5: The Leaderboard shows the quality of chains across records as scored by various feedback functions.
Figure 6: The Evaluations table provides a drill down of the evaluation at the record level. Note that on the language match test, this chain does not perform well when asked questions in German (the first 3 records above where the responses are in English instead of German) whereas it correctly responds in Spanish when asked a question in Spanish (record 4 above).
Armed with an understanding of the failure modes of the first chain, you then proceed to iterate on your app.
To address the language match failure mode, you adjust your prompt as shown in Figure 7. Running the evaluation again on the new version of the app (Chain 1), you see an improvement on the leaderboard – the feedback function score for language_match went from 0.25 to 0.97 (see Figure 9).
Figure 7: Adjusted prompt to address language match failure mode. Note the explicit instruction to answer in the same language as the question.
To address the irrelevant chunks failure mode, you adjust your prompt template as shown in Figure 8. Running the evaluation again on the new version of the app (Chain 2), you see an improvement on the leaderboard – the feedback function score for qs_relevance increases significantly (see Figure 9).
Figure 8: Adjusted prompt to address individual chunk relevance failure mode.
Figure 9: The Leaderboard shows the quality of chains across records as scored by various feedback functions.
TruLens is free, available today, and ready to help you evaluate and track your LLM experiments. TruLens increases developer productivity while offering greater visibility and confidence in the quality of LLM applications. We hope that you give it a try and let us know how it’s working out for you!
TruEra is an AI Quality software company that helps organizations better test, debug, and monitor machine learning models and applications. Although TruEra both actively oversees the distribution of TruLens and helps organize the community around it, TruLens remains an open-source community project, not a TruEra product.
TruLens originally emerged from the work of the TruEra Research Team. They are passionate about the importance of testing and quality in machine learning. They continue to be involved in the development of the TruLens community.
Why a colossal squid?
The colossal squid’s eyeball is about the size of a soccer ball, making it the largest eyeball of any living creature. In addition, did you know that its eyeball contains light organs? That means that colossal squids have automatic headlights when looking around. We're hoping to bring similar guidance to model developers when creating, introspecting, and debugging neural networks. Read more about the amazing eyes of the colossal squid.