Groundedness
Groundedness Evaluations¶
In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models (e.g. gpt-3.5 and gpt-4) or prompting schemes (such as chain-of-thought reasoning).
This notebook follows an evaluation of a set of test cases generated from human annotated datasets. In particular, we generate test cases from SummEval.
SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (1 to 5) comprised of scoring from 3 human expert annotators and 5 croweded-sourced annotators. There are 16 models being used for generation in total for 100 paragraphs in the test set, so there are a total of 16,000 machine-generated summaries. Each paragraph also has several human-written summaries for comparative analysis.
For evaluating groundedness feedback functions, we calculate the annotated "relevance" and "consistency (aka factuality)" scores with equal weights and normalized to 0 to 1 score to match the output of feedback functions.
# Import groundedness feedback function
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set
Tru().reset_database()
# generator for groundedness golden set
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval_test_100.json")
# generate x number of test cases
groundedness_golden_set = []
for i in range(50):
groundedness_golden_set.append(next(test_cases_gen))
🦑 Tru initialized with db url sqlite:///default.sqlite . 🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this. Deleted 0 rows.
groundedness_golden_set[:3]
[{'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O\'Donnell said in a statement. "It also sets a precedent that the injured spouse can recover damages from the recipient of these ill-begotten gifts." Stiviano\'s gifts from Donald Sterling didn\'t just include uber-expensive items like luxury cars. According to the Los Angeles Times, the list also includes a $391 Easter bunny costume, a $299 two-speed blender and a $12 lace thong. Donald Sterling\'s downfall came after an audio recording surfaced of the octogenarian arguing with Stiviano. In the tape, Sterling chastises Stiviano for posting pictures on social media of her posing with African-Americans, including basketball legend Magic Johnson. "In your lousy f**ing Instagrams, you don\'t have to have yourself with -- walking with black people," Sterling said in the audio first posted by TMZ. He also tells Stiviano not to bring Johnson to Clippers games and not to post photos with the Hall of Famer so Sterling\'s friends can see. "Admire him, bring him here, feed him, f**k him, but don\'t put (Magic) on an Instagram for the world to have to see so they have to call me," Sterling said. NBA Commissioner Adam Silver banned Sterling from the league, fined him $2.5 million and pushed through a charge to terminate all of his ownership rights in the franchise. Fact check: Donald Sterling\'s claims vs. reality CNN\'s Dottie Evans contributed to this report.', 'response': "donald sterling , nba team last year . sterling 's wife sued for $ 2.6 million in gifts . sterling says he is the former female companion who has lost the . sterling has ordered v. stiviano to pay back $ 2.6 m in gifts after his wife sued . sterling also includes a $ 391 easter bunny costume , $ 299 and a $ 299 .", 'expected_score': 0.27}, {'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O\'Donnell said in a statement. "It also sets a precedent that the injured spouse can recover damages from the recipient of these ill-begotten gifts." Stiviano\'s gifts from Donald Sterling didn\'t just include uber-expensive items like luxury cars. According to the Los Angeles Times, the list also includes a $391 Easter bunny costume, a $299 two-speed blender and a $12 lace thong. Donald Sterling\'s downfall came after an audio recording surfaced of the octogenarian arguing with Stiviano. In the tape, Sterling chastises Stiviano for posting pictures on social media of her posing with African-Americans, including basketball legend Magic Johnson. "In your lousy f**ing Instagrams, you don\'t have to have yourself with -- walking with black people," Sterling said in the audio first posted by TMZ. He also tells Stiviano not to bring Johnson to Clippers games and not to post photos with the Hall of Famer so Sterling\'s friends can see. "Admire him, bring him here, feed him, f**k him, but don\'t put (Magic) on an Instagram for the world to have to see so they have to call me," Sterling said. NBA Commissioner Adam Silver banned Sterling from the league, fined him $2.5 million and pushed through a charge to terminate all of his ownership rights in the franchise. Fact check: Donald Sterling\'s claims vs. reality CNN\'s Dottie Evans contributed to this report.', 'response': "donald sterling accused stiviano of targeting extremely wealthy older men . she claimed donald sterling used the couple 's money to buy stiviano a ferrari , two bentleys and a range rover . stiviano countered that there was nothing wrong with donald sterling giving her gifts .", 'expected_score': 0.4}, {'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O\'Donnell said in a statement. "It also sets a precedent that the injured spouse can recover damages from the recipient of these ill-begotten gifts." Stiviano\'s gifts from Donald Sterling didn\'t just include uber-expensive items like luxury cars. According to the Los Angeles Times, the list also includes a $391 Easter bunny costume, a $299 two-speed blender and a $12 lace thong. Donald Sterling\'s downfall came after an audio recording surfaced of the octogenarian arguing with Stiviano. In the tape, Sterling chastises Stiviano for posting pictures on social media of her posing with African-Americans, including basketball legend Magic Johnson. "In your lousy f**ing Instagrams, you don\'t have to have yourself with -- walking with black people," Sterling said in the audio first posted by TMZ. He also tells Stiviano not to bring Johnson to Clippers games and not to post photos with the Hall of Famer so Sterling\'s friends can see. "Admire him, bring him here, feed him, f**k him, but don\'t put (Magic) on an Instagram for the world to have to see so they have to call me," Sterling said. NBA Commissioner Adam Silver banned Sterling from the league, fined him $2.5 million and pushed through a charge to terminate all of his ownership rights in the franchise. Fact check: Donald Sterling\'s claims vs. reality CNN\'s Dottie Evans contributed to this report.', 'response': "a los angeles judge has ordered v. stiviano to pay back more than $ 2.6 million in gifts after sterling 's wife sued her . -lrb- cnn -rrb- donald sterling 's racist remarks cost him an nba team last year . but now it 's his former female companion who has lost big . who is v. stiviano ? .", 'expected_score': 0.7}]
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."
Benchmarking various Groundedness feedback function providers (OpenAI GPT-3.5-turbo vs GPT-4 vs Huggingface)¶
from trulens_eval.feedback.provider.hugs import Huggingface
from trulens_eval.feedback.provider import OpenAI
import numpy as np
huggingface_provider = Huggingface()
groundedness_hug = Groundedness(groundedness_provider=huggingface_provider)
f_groundedness_hug = Feedback(groundedness_hug.groundedness_measure, name = "Groundedness Huggingface").on_input().on_output().aggregate(groundedness_hug.grounded_statements_aggregator)
def wrapped_groundedness_hug(input, output):
return np.mean(list(f_groundedness_hug(input, output)[0].values()))
groundedness_openai = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-3.5-turbo")) # GPT-3.5-turbot being the default model if not specified
f_groundedness_openai = Feedback(groundedness_openai.groundedness_measure, name = "Groundedness OpenAI GPT-3.5").on_input().on_output().aggregate(groundedness_openai.grounded_statements_aggregator)
def wrapped_groundedness_openai(input, output):
return f_groundedness_openai(input, output)[0]['full_doc_score']
groundedness_openai_gpt4 = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-4"))
f_groundedness_openai_gpt4 = Feedback(groundedness_openai_gpt4.groundedness_measure, name = "Groundedness OpenAI GPT-4").on_input().on_output().aggregate(groundedness_openai_gpt4.grounded_statements_aggregator)
def wrapped_groundedness_openai_gpt4(input, output):
return f_groundedness_openai_gpt4(input, output)[0]['full_doc_score']
✅ In Groundedness Huggingface, input source will be set to __record__.main_input or `Select.RecordInput` . ✅ In Groundedness Huggingface, input statement will be set to __record__.main_output or `Select.RecordOutput` . ✅ In Groundedness OpenAI GPT-3.5, input source will be set to __record__.main_input or `Select.RecordInput` . ✅ In Groundedness OpenAI GPT-3.5, input statement will be set to __record__.main_output or `Select.RecordOutput` . ✅ In Groundedness OpenAI GPT-4, input source will be set to __record__.main_input or `Select.RecordInput` . ✅ In Groundedness OpenAI GPT-4, input statement will be set to __record__.main_output or `Select.RecordOutput` .
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(groundedness_golden_set)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae = Feedback(ground_truth.mae, name = "Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()
✅ In Mean Absolute Error, input prompt will be set to __record__.calls[0].args.args[0] . ✅ In Mean Absolute Error, input response will be set to __record__.calls[0].args.args[1] . ✅ In Mean Absolute Error, input score will be set to __record__.main_output or `Select.RecordOutput` .
tru_wrapped_groundedness_hug = TruBasicApp(wrapped_groundedness_hug, app_id = "groundedness huggingface", feedbacks=[f_mae])
tru_wrapped_groundedness_openai = TruBasicApp(wrapped_groundedness_openai, app_id = "groundedness openai gpt-3.5", feedbacks=[f_mae])
tru_wrapped_groundedness_openai_gpt4 = TruBasicApp(wrapped_groundedness_openai_gpt4, app_id = "groundedness openai gpt-4", feedbacks=[f_mae])
✅ added app groundedness huggingface ✅ added feedback definition feedback_definition_hash_ca9bbb2338965a9d34e17ae53f641d9c ✅ added app groundedness openai ✅ added feedback definition feedback_definition_hash_ca9bbb2338965a9d34e17ae53f641d9c ✅ added app groundedness openai gpt4 ✅ added feedback definition feedback_definition_hash_ca9bbb2338965a9d34e17ae53f641d9c
for i in range(len(groundedness_golden_set)):
source = groundedness_golden_set[i]["query"]
response = groundedness_golden_set[i]["response"]
with tru_wrapped_groundedness_hug as recording:
tru_wrapped_groundedness_hug.app(source, response)
with tru_wrapped_groundedness_openai as recording:
tru_wrapped_groundedness_openai.app(source, response)
with tru_wrapped_groundedness_openai_gpt4 as recording:
tru_wrapped_groundedness_openai_gpt4.app(source, response)
Tru().get_leaderboard(app_ids=[]).sort_values(by="Mean Absolute Error")
Mean Absolute Error | latency | total_cost | |
---|---|---|---|
app_id | |||
groundedness huggingface | 0.251471 | 2.4 | 0.000000 |
groundedness openai | 2.371200 | 2.4 | 0.001344 |
groundedness openai gpt4 | 2.371200 | 2.4 | 0.001464 |