Groundtruth evaluation for LlamaIndex applications¶
Ground truth evaluation can be especially useful during early LLM experiments when you have a small set of example queries that are critical to get right. Ground truth evaluation works by comparing the similarity of an LLM response compared to its matching verified response.
This example walks through how to set up ground truth eval for a LlamaIndex app.
import from TruLens and LlamaIndex¶
In [ ]:
Copied!
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import SimpleWebPageReader
import openai
from trulens.core import Feedback
from trulens.core import TruSession
from trulens.feedback import GroundTruthAgreement
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI
session = TruSession()
from llama_index.core import VectorStoreIndex
from llama_index.readers.web import SimpleWebPageReader
import openai
from trulens.core import Feedback
from trulens.core import TruSession
from trulens.feedback import GroundTruthAgreement
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI
session = TruSession()
In [ ]:
Copied!
session.reset_database()
session.reset_database()
Add API keys¶
For this quickstart, you will need Open AI and Huggingface keys
In [ ]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "..."
openai.api_key = os.environ["OPENAI_API_KEY"]
import os
os.environ["OPENAI_API_KEY"] = "..."
openai.api_key = os.environ["OPENAI_API_KEY"]
Create Simple LLM Application¶
This example uses LlamaIndex which internally uses an OpenAI LLM.
In [ ]:
Copied!
documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
documents = SimpleWebPageReader(html_to_text=True).load_data(
["http://paulgraham.com/worked.html"]
)
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
Initialize Feedback Function(s)¶
In [ ]:
Copied!
# Initialize OpenAI-based feedback function collection class:
openai_provider = OpenAI()
# Initialize OpenAI-based feedback function collection class:
openai_provider = OpenAI()
In [ ]:
Copied!
golden_set = [
{
"query": "What was the author's undergraduate major?",
"response": "He didn't choose a major, and customized his courses.",
},
{
"query": "What company did the author start in 1995?",
"response": "Viaweb, to make software for building online stores.",
},
{
"query": "Where did the author move in 1998 after selling Viaweb?",
"response": "California, after Yahoo acquired Viaweb.",
},
{
"query": "What did the author do after leaving Yahoo in 1999?",
"response": "He focused on painting and tried to improve his art skills.",
},
{
"query": "What program did the author start with Jessica Livingston in 2005?",
"response": "Y Combinator, to provide seed funding for startups.",
},
]
golden_set = [
{
"query": "What was the author's undergraduate major?",
"response": "He didn't choose a major, and customized his courses.",
},
{
"query": "What company did the author start in 1995?",
"response": "Viaweb, to make software for building online stores.",
},
{
"query": "Where did the author move in 1998 after selling Viaweb?",
"response": "California, after Yahoo acquired Viaweb.",
},
{
"query": "What did the author do after leaving Yahoo in 1999?",
"response": "He focused on painting and tried to improve his art skills.",
},
{
"query": "What program did the author start with Jessica Livingston in 2005?",
"response": "Y Combinator, to provide seed funding for startups.",
},
]
In [ ]:
Copied!
f_groundtruth = Feedback(
GroundTruthAgreement(golden_set, provider=openai_provider).agreement_measure, name="Ground Truth Eval"
).on_input_output()
f_groundtruth = Feedback(
GroundTruthAgreement(golden_set, provider=openai_provider).agreement_measure, name="Ground Truth Eval"
).on_input_output()
Instrument the application with Ground Truth Eval¶
In [ ]:
Copied!
tru_query_engine_recorder = TruLlama(
query_engine,
app_name="LlamaIndex_App",
feedbacks=[f_groundtruth],
)
tru_query_engine_recorder = TruLlama(
query_engine,
app_name="LlamaIndex_App",
feedbacks=[f_groundtruth],
)
Run the application for all queries in the golden set¶
In [ ]:
Copied!
# Run and evaluate on groundtruth questions
for pair in golden_set:
with tru_query_engine_recorder as recording:
llm_response = query_engine.query(pair["query"])
print(llm_response)
# Run and evaluate on groundtruth questions
for pair in golden_set:
with tru_query_engine_recorder as recording:
llm_response = query_engine.query(pair["query"])
print(llm_response)
Explore with the TruLens dashboard¶
In [ ]:
Copied!
from trulens.dashboard import run_dashboard
run_dashboard(session) # open a local streamlit app to explore
# stop_dashboard(session) # stop if needed
from trulens.dashboard import run_dashboard
run_dashboard(session) # open a local streamlit app to explore
# stop_dashboard(session) # stop if needed
Or view results directly in your notebook¶
In [ ]:
Copied!
records, feedback = session.get_records_and_feedback()
records.head()
records, feedback = session.get_records_and_feedback()
records.head()