📓 Ground Truth Evaluations¶
In this quickstart you will create a evaluate a LangChain app using ground truth. Ground truth evaluation can be especially useful during early LLM experiments when you have a small set of example queries that are critical to get right.
Ground truth evaluation works by comparing the similarity of an LLM response compared to its matching verified response.
Add API keys¶
For this quickstart, you will need Open AI keys.
In [ ]:
Copied!
# ! pip install trulens_eval openai
# ! pip install trulens_eval openai
In [2]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
In [3]:
Copied!
from trulens_eval import Tru
tru = Tru()
from trulens_eval import Tru
tru = Tru()
Create Simple LLM Application¶
In [4]:
Copied!
from openai import OpenAI
oai_client = OpenAI()
from trulens_eval.tru_custom_app import instrument
class APP:
@instrument
def completion(self, prompt):
completion = oai_client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=
[
{"role": "user",
"content":
f"Please answer the question: {prompt}"
}
]
).choices[0].message.content
return completion
llm_app = APP()
from openai import OpenAI
oai_client = OpenAI()
from trulens_eval.tru_custom_app import instrument
class APP:
@instrument
def completion(self, prompt):
completion = oai_client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0,
messages=
[
{"role": "user",
"content":
f"Please answer the question: {prompt}"
}
]
).choices[0].message.content
return completion
llm_app = APP()
Initialize Feedback Function(s)¶
In [5]:
Copied!
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
{"query": "who invented the lightbulb?", "response": "Thomas Edison"},
{"query": "¿quien invento la bombilla?", "response": "Thomas Edison"}
]
f_groundtruth = Feedback(GroundTruthAgreement(golden_set).agreement_measure, name = "Ground Truth").on_input_output()
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
{"query": "who invented the lightbulb?", "response": "Thomas Edison"},
{"query": "¿quien invento la bombilla?", "response": "Thomas Edison"}
]
f_groundtruth = Feedback(GroundTruthAgreement(golden_set).agreement_measure, name = "Ground Truth").on_input_output()
✅ In Ground Truth, input prompt will be set to __record__.main_input or `Select.RecordInput` . ✅ In Ground Truth, input response will be set to __record__.main_output or `Select.RecordOutput` .
Instrument chain for logging with TruLens¶
In [6]:
Copied!
# add trulens as a context manager for llm_app
from trulens_eval import TruCustomApp
tru_app = TruCustomApp(llm_app, app_id = 'LLM App v1', feedbacks = [f_groundtruth])
# add trulens as a context manager for llm_app
from trulens_eval import TruCustomApp
tru_app = TruCustomApp(llm_app, app_id = 'LLM App v1', feedbacks = [f_groundtruth])
In [7]:
Copied!
# Instrumented query engine can operate as a context manager:
with tru_app as recording:
llm_app.completion("¿quien invento la bombilla?")
llm_app.completion("who invented the lightbulb?")
# Instrumented query engine can operate as a context manager:
with tru_app as recording:
llm_app.completion("¿quien invento la bombilla?")
llm_app.completion("who invented the lightbulb?")
See results¶
In [8]:
Copied!
tru.get_leaderboard(app_ids=[tru_app.app_id])
tru.get_leaderboard(app_ids=[tru_app.app_id])
Out[8]:
Ground Truth | positive_sentiment | Human Feedack | latency | total_cost | |
---|---|---|---|---|---|
app_id | |||||
LLM App v1 | 1.0 | 0.38994 | 1.0 | 1.75 | 0.000076 |