Stock Feedback Functions¶
Classification-based¶
π€ Huggingface¶
API Reference: Huggingface.
Out of the box feedback functions calling Huggingface APIs.
context_relevance
¶
Uses Huggingface's truera/context_relevance model, a model that uses computes the relevance of a given context to the prompt. The model can be found at https://huggingface.co/truera/context_relevance. Usage:
from trulens_eval import Feedback
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
feedback = Feedback(huggingface_provider.context_relevance).on_input_output()
on_input_output()
selector can be changed. See Feedback Function
Guide
hallucination_evaluator
¶
Evaluates the hallucination score for a combined input of two statements as a float 0<x<1 representing a
true/false boolean. if the return is greater than 0.5 the statement is evaluated as true. if the return is
less than 0.5 the statement is evaluated as a hallucination.
**!!! example
**
python
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
score = huggingface_provider.hallucination_evaluator("The sky is blue. [SEP] Apples are red , the grass is green.")
Args:
model_output (str): This is what an LLM returns based on the text chunks retrieved during RAG
retrieved_text_chunk (str): These are the text chunks you have retrieved during RAG
Returns:
float: Hallucination score
language_match
¶
Uses Huggingface's papluca/xlm-roberta-base-language-detection model. A
function that uses language detection on text1
and text2
and
calculates the probit difference on the language detected on text1. The
function is: 1.0 - (|probit_language_text1(text1) -
probit_language_text1(text2))
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
feedback = Feedback(huggingface_provider.language_match).on_input_output()
The on_input_output()
selector can be changed. See Feedback Function
Guide
Returns:
float: A value between 0 and 1. 0 being "different languages" and 1
being "same languages".
pii_detection
¶
NER model to detect PII.
Example
hugs = Huggingface()
# Define a pii_detection feedback function using HuggingFace.
f_pii_detection = Feedback(hugs.pii_detection).on_input()
The on(...)
selector can be changed. See Feedback Function Guide:
Selectors
pii_detection_with_cot_reasons
¶
NER model to detect PII, with reasons.
Example
hugs = Huggingface()
# Define a pii_detection feedback function using HuggingFace.
f_pii_detection = Feedback(hugs.pii_detection).on_input()
The on(...)
selector can be changed. See Feedback Function Guide
:
Selectors
positive_sentiment
¶
Uses Huggingface's cardiffnlp/twitter-roberta-base-sentiment model. A
function that uses a sentiment classifier on text
.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
feedback = Feedback(huggingface_provider.positive_sentiment).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
toxic
¶
Uses Huggingface's martin-ha/toxic-comment-model model. A function that
uses a toxic comment classifier on text
.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
feedback = Feedback(huggingface_provider.not_toxic).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
OpenAI¶
API Reference: OpenAI.
Out of the box feedback functions calling OpenAI APIs.
Create an OpenAI Provider with out of the box feedback functions.
Example
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
moderation_harassment
¶
Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_harassment_threatening
¶
Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_harassment_threatening, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_hate
¶
Uses OpenAI's Moderation API. A function that checks if text is hate speech.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hate, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_hatethreatening
¶
Uses OpenAI's Moderation API. A function that checks if text is threatening speech.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_hatethreatening, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_selfharm
¶
Uses OpenAI's Moderation API. A function that checks if text is about self harm.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_selfharm, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_sexual
¶
Uses OpenAI's Moderation API. A function that checks if text is sexual speech.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexual, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_sexualminors
¶
Uses OpenAI's Moderation API. A function that checks if text is about sexual minors.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_sexualminors, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_violence
¶
Uses OpenAI's Moderation API. A function that checks if text is about violence.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violence, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
moderation_violencegraphic
¶
Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.
Example
from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
feedback = Feedback(
openai_provider.moderation_violencegraphic, higher_is_better=False
).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
Generation-based: LLMProvider¶
API Reference: LLMProvider.
An LLM-based provider.
This is an abstract class and needs to be initialized as one of these:
-
OpenAI and subclass AzureOpenAI.
-
LiteLLM. LiteLLM provides an interface to a wide range of models.
coherence
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.coherence).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
coherence_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.coherence_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
comprehensiveness_with_cot_reasons
¶
Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.
Example
feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()
conciseness
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
conciseness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.conciseness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
context_relevance
¶
Uses chat completion model. A function that completes a template to check the relevance of the context to the question.
Example
from trulens_eval.app import App
context = App.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
The on(...)
selector can be changed. See Feedback Function Guide :
Selectors
context_relevance_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.
Example
from trulens_eval.app import App
context = App.select_context(rag_app)
feedback = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
on(...)
selector can be changed. See Feedback Function Guide : Selectors
controversiality
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.
Example
feedback = Feedback(provider.controversiality).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
controversiality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.controversiality_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
correctness
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.correctness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
correctness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.correctness_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
criminality
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.criminality).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
criminality_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.criminality_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
generate_score
¶
Base method to generate a score only, used for evaluation.
generate_score_and_reasons
¶
Base method to generate a score and reason, used for evaluation.
harmfulness
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.harmfulness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
harmfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.harmfulness_with_cot_reasons).on_output()
helpfulness
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.helpfulness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
helpfulness_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.helpfulness_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
insensitivity
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.insensitivity).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
insensitivity_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.insensitivity_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
maliciousness
¶
Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.maliciousness).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
maliciousness_with_cot_reasons
¶
Uses chat compoletion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.maliciousness_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
misogyny
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.
Example
feedback = Feedback(provider.misogyny).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
misogyny_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.misogyny_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
model_agreement
¶
Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.
Example
feedback = Feedback(provider.model_agreement).on_input_output()
The on_input_output()
selector can be changed. See Feedback Function
Guide
qs_relevance
¶
Question statement relevance is deprecated and will be removed in future versions. Please use context relevance in its place.
qs_relevance_with_cot_reasons
¶
Question statement relevance is deprecated and will be removed in future versions. Please use context relevance in its place.
relevance
¶
Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.
Example
feedback = Feedback(provider.relevance).on_input_output()
The on_input_output()
selector can be changed. See Feedback Function
Guide
Usage on RAG Contexts
feedback = Feedback(provider.relevance).on_input().on(
TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)
The on(...)
selector can be changed. See Feedback Function Guide :
Selectors
relevance_with_cot_reasons
¶
Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.relevance_with_cot_reasons).on_input_output()
The on_input_output()
selector can be changed. See Feedback Function
Guide
Usage on RAG Contexts
feedback = Feedback(provider.relevance_with_cot_reasons).on_input().on(
TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)
The on(...)
selector can be changed. See Feedback Function Guide :
Selectors
sentiment
¶
Uses chat completion model. A function that completes a template to check the sentiment of some text.
Example
feedback = Feedback(provider.sentiment).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
sentiment_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.
Example
feedback = Feedback(provider.sentiment_with_cot_reasons).on_output()
The on_output()
selector can be changed. See Feedback Function
Guide
stereotypes
¶
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes).on_input_output()
stereotypes_with_cot_reasons
¶
Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.
Example
feedback = Feedback(provider.stereotypes).on_input_output()
summarization_with_cot_reasons
¶
Summarization is deprecated in place of comprehensiveness. Defaulting to comprehensiveness_with_cot_reasons.
Embedding-based¶
API Reference: Embeddings.
Embedding related feedback function implementations.
cosine_distance
¶
Runs cosine distance on the query and document embeddings
Example
Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed_model = OpenAIEmbeddings(
model=model_name,
openai_api_key=OPENAI_API_KEY
)
# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.cosine_distance) .on_input() .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)
The on(...)
selector can be changed. See Feedback Function Guide
:
Selectors
euclidean_distance
¶
Runs L2 distance on the query and document embeddings
Example
Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed_model = OpenAIEmbeddings(
model=model_name,
openai_api_key=OPENAI_API_KEY
)
# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.euclidean_distance) .on_input() .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)
The on(...)
selector can be changed. See Feedback Function Guide
:
Selectors
manhattan_distance
¶
Runs L1 distance on the query and document embeddings
Example
Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings
model_name = 'text-embedding-ada-002'
embed_model = OpenAIEmbeddings(
model=model_name,
openai_api_key=OPENAI_API_KEY
)
# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.manhattan_distance) .on_input() .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)
The on(...)
selector can be changed. See Feedback Function Guide
:
Selectors
Combinators¶
Groundedness¶
API Reference: Groundedness
Measures Groundedness.
Currently the groundedness functions work well with a summarizer. This class will use an LLM to find the relevant strings in a text. The groundedness_provider can either be an LLM provider (such as OpenAI) or NLI with huggingface.
Example
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()
groundedness_imp = Groundedness(groundedness_provider=openai_provider)
Example
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
groundedness_imp = Groundedness(groundedness_provider=huggingface_provider)
grounded_statements_aggregator
¶
Compute the mean groundedness based on the best evidence available for each statement.
groundedness_measure
¶
Groundedness measure is deprecated in place of the chain-of-thought version. This function will raise a NotImplementedError.
groundedness_measure_with_cot_reasons
¶
A measure to track if the source material supports each sentence in the statement using an LLM provider.
The LLM will process the entire statement at once, using chain of thought methodology to emit the reasons.
Usage on RAG Contexts
from trulens_eval import Feedback
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI
grounded = feedback.Groundedness(groundedness_provider=OpenAI())
f_groundedness = feedback.Feedback(grounded.groundedness_measure_with_cot_reasons).on(
Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content # See note below
).on_output().aggregate(grounded.grounded_statements_aggregator)
The on(...)
selector can be changed. See Feedback Function Guide : Selectors
groundedness_measure_with_nli
¶
A measure to track if the source material supports each sentence in the statement using an NLI model.
First the response will be split into statements using a sentence tokenizer.The NLI model will process each statement using a natural language inference model, and will use the entire source.
Usage on RAG Contexts:
from trulens_eval import Feedback
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.hugs = Huggingface
grounded = feedback.Groundedness(groundedness_provider=Huggingface())
f_groundedness = feedback.Feedback(grounded.groundedness_measure_with_nli).on(
Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content # See note below
).on_output().aggregate(grounded.grounded_statements_aggregator)
on(...)
selector can be changed. See Feedback Function Guide : Selectors
groundedness_measure_with_summarize_step
¶
DEPRECATED: This method is deprecated and will be removed in a future release. Please use alternative groundedness measure methods.
A measure to track if the source material supports each sentence in the statement. This groundedness measure is more accurate; but slower using a two step process. - First find supporting evidence with an LLM - Then for each statement sentence, check groundedness
Usage on RAG Contexts:
from trulens_eval import Feedback
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.openai import OpenAI
grounded = feedback.Groundedness(groundedness_provider=OpenAI())
f_groundedness = feedback.Feedback(grounded.groundedness_measure_with_summarize_step).on(
Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content # See note below
).on_output().aggregate(grounded.grounded_statements_aggregator)
on(...)
selector can be changed. See Feedback Function Guide : Selectors
Ground Truth Agreement¶
API Reference: GroundTruthAgreement
Measures Agreement against a Ground Truth.
agreement_measure
¶
Uses OpenAI's Chat GPT Model. A function that that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.
Example
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
{"query": "who invented the lightbulb?", "response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)
feedback = Feedback(ground_truth_collection.agreement_measure).on_input_output()
on_input_output()
selector can be changed. See Feedback Function Guide
bert_score
¶
Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.
Example
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
{"query": "who invented the lightbulb?", "response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)
feedback = Feedback(ground_truth_collection.bert_score).on_input_output()
on_input_output()
selector can be changed. See Feedback Function Guide
bleu
¶
Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.
Example
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
{"query": "who invented the lightbulb?", "response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)
feedback = Feedback(ground_truth_collection.bleu).on_input_output()
on_input_output()
selector can be changed. See Feedback Function Guide
mae
¶
Method to look up the numeric expected score from a golden set and take the differnce.
Primarily used for evaluation of model generated feedback against human feedback
Example
from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set =
{"query": "How many stomachs does a cow have?", "response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "response": "I don't know", "expected_score": 0.8}
]
ground_truth_collection = GroundTruthAgreement(golden_set)
f_groundtruth = Feedback(ground_truth.mae).on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()
rouge
¶
Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.