📓 TruLens Quickstart¶
In this quickstart you will create a RAG from scratch and learn how to log it and get feedback on an LLM response.
For evaluation, we will leverage the "hallucination triad" of groundedness, context relevance and answer relevance.
# ! pip install trulens_eval chromadb openai
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
Get Data¶
In this case, we'll just initialize some simple text in the notebook.
uw_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""
wsu_info = """
Washington State University, commonly known as WSU, founded in 1890, is a public research university in Pullman, Washington.
With multiple campuses across the state, it is the state's second largest institution of higher education.
WSU is known for its programs in veterinary medicine, agriculture, engineering, architecture, and pharmacy.
"""
seattle_info = """
Seattle, a city on Puget Sound in the Pacific Northwest, is surrounded by water, mountains and evergreen forests, and contains thousands of acres of parkland.
It's home to a large tech industry, with Microsoft and Amazon headquartered in its metropolitan area.
The futuristic Space Needle, a legacy of the 1962 World's Fair, is its most iconic landmark.
"""
starbucks_info = """
Starbucks Corporation is an American multinational chain of coffeehouses and roastery reserves headquartered in Seattle, Washington.
As the world's largest coffeehouse chain, Starbucks is seen to be the main representation of the United States' second wave of coffee culture.
"""
Create Vector Store¶
Create a chromadb vector store in memory.
import os
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'),
model_name="text-embedding-ada-002")
chroma_client = chromadb.Client()
vector_store = chroma_client.get_or_create_collection(name="Washington",
embedding_function=embedding_function)
Populate the vector store.
vector_store.add("uw_info", documents=uw_info)
vector_store.add("wsu_info", documents=wsu_info)
vector_store.add("seattle_info", documents=seattle_info)
vector_store.add("starbucks_info", documents=starbucks_info)
Build RAG from scratch¶
Build a custom RAG from scratch, and add TruLens custom instrumentation.
from trulens_eval import Tru
from trulens_eval.tru_custom_app import instrument
tru = Tru()
tru.reset_database()
🦑 Tru initialized with db url sqlite:///default.sqlite . 🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.
from openai import OpenAI
oai_client = OpenAI()
from openai import OpenAI
oai_client = OpenAI()
class RAG_from_scratch:
@instrument
def retrieve(self, query: str) -> list:
"""
Retrieve relevant text from vector store.
"""
results = vector_store.query(
query_texts=query,
n_results=4
)
# Flatten the list of lists into a single list
return [doc for sublist in results['documents'] for doc in sublist]
@instrument
def generate_completion(self, query: str, context_str: list) -> str:
"""
Generate answer from context.
"""
completion = oai_client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=
[
{"role": "user",
"content":
f"We have provided context information below. \n"
f"---------------------\n"
f"{context_str}"
f"\n---------------------\n"
f"Given this information, please answer the question: {query}"
}
]
).choices[0].message.content
return completion
@instrument
def query(self, query: str) -> str:
context_str = self.retrieve(query)
completion = self.generate_completion(query, context_str)
return completion
rag = RAG_from_scratch()
Set up feedback functions.¶
Here we'll use groundedness, answer relevance and context relevance to detect hallucination.
from trulens_eval import Feedback, Select
from trulens_eval.feedback.provider.openai import OpenAI
import numpy as np
provider = OpenAI(model_engine="gpt-4o")
# Define a groundedness feedback function
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
.on(Select.RecordCalls.retrieve.rets.collect())
.on_output()
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
.on_input()
.on_output()
)
# Context relevance between question and each context chunk.
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
.on_input()
.on(Select.RecordCalls.retrieve.rets[:])
.aggregate(np.mean) # choose a different aggregation method if you wish
)
✅ In Groundedness, input source will be set to __record__.app.retrieve.rets.collect() . ✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` . ✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` . ✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` . ✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` . ✅ In Context Relevance, input context will be set to __record__.app.retrieve.rets[:] .
Construct the app¶
Wrap the custom RAG with TruCustomApp, add list of feedbacks for eval
from trulens_eval import TruCustomApp
tru_rag = TruCustomApp(rag,
app_id = 'RAG v1',
feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])
Run the app¶
Use tru_rag
as a context manager for the custom RAG-from-scratch app.
with tru_rag as recording:
rag.query("When was the University of Washington founded?")
Check results¶
We can view results in the leaderboard.
tru.get_leaderboard()
latency | total_cost | |
---|---|---|
app_id | ||
RAG v1 | 3.0 | 0.000511 |
last_record = recording.records[-1]
from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")
question | context | ret | |
---|---|---|---|
0 | When was the University of Washington founded? | \nThe University of Washington, founded in 186... | 1.0 |
1 | When was the University of Washington founded? | \nWashington State University, commonly known ... | 0.0 |
2 | When was the University of Washington founded? | \nSeattle, a city on Puget Sound in the Pacifi... | 0.0 |
3 | When was the University of Washington founded? | \nStarbucks Corporation is an American multina... | 0.0 |
Use guardrails¶
In addition to making informed iteration, we can also directly use feedback results as guardrails at inference time. In particular, here we show how to use the context relevance score as a guardrail to filter out irrelevant context before it gets passed to the LLM. This both reduces hallucination and improves efficiency.
To do so, we'll rebuild our RAG using the @context-filter decorator on the method we want to filter, and pass in the feedback function and threshold to use for guardrailing.
# note: feedback function used for guardrail must only return a score, not also reasons
f_context_relevance_score = (
Feedback(provider.context_relevance, name = "Context Relevance")
)
from trulens_eval.guardrails.base import context_filter
class filtered_RAG_from_scratch:
@instrument
@context_filter(f_context_relevance_score, 0.75, keyword_for_prompt="query")
def retrieve(self, query: str) -> list:
"""
Retrieve relevant text from vector store.
"""
results = vector_store.query(
query_texts=query,
n_results=4
)
return [doc for sublist in results['documents'] for doc in sublist]
@instrument
def generate_completion(self, query: str, context_str: list) -> str:
"""
Generate answer from context.
"""
completion = oai_client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=
[
{"role": "user",
"content":
f"We have provided context information below. \n"
f"---------------------\n"
f"{context_str}"
f"\n---------------------\n"
f"Given this information, please answer the question: {query}"
}
]
).choices[0].message.content
return completion
@instrument
def query(self, query: str) -> str:
context_str = self.retrieve(query=query)
completion = self.generate_completion(query=query, context_str=context_str)
return completion
filtered_rag = filtered_RAG_from_scratch()
Record and operate as normal¶
from trulens_eval import TruCustomApp
filtered_tru_rag = TruCustomApp(filtered_rag,
app_id = 'RAG v2',
feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])
with filtered_tru_rag as recording:
filtered_rag.query(query="when was the university of washington founded?")
tru.get_leaderboard(app_ids=[])
Answer Relevance | Groundedness | Context Relevance | latency | total_cost | |
---|---|---|---|---|---|
app_id | |||||
RAG v2 | 1.0 | 1.0 | 1.00 | 1.0 | 0.000203 |
RAG v1 | 1.0 | 1.0 | 0.25 | 1.0 | 0.000511 |
See the power of filtering!
last_record = recording.records[-1]
from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, "Context Relevance")
question | context | ret | |
---|---|---|---|
0 | when was the university of washington founded? | \nThe University of Washington, founded in 186... | 1.0 |
tru.run_dashboard(port=3453, force=True)
Starting dashboard ... Config file already exists. Skipping writing process. Credentials file already exists. Skipping writing process.
Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…
Dashboard started at http://192.168.4.206:3453 .
<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>