Stock Feedback Functions¶

Classification-based¶

🤗 Huggingface¶

API Reference: Huggingface.

Out of the box feedback functions calling Huggingface APIs.

OpenAI¶

API Reference: OpenAI.

Out of the box feedback functions calling OpenAI APIs.

Create an OpenAI Provider with out of the box feedback functions.

Example

from trulens_eval.feedback.provider.openai import OpenAI 
openai_provider = OpenAI()

`moderation_harassment` ¶

Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_harassment, higher_is_better=False
).on_output()

`moderation_harassment_threatening` ¶

Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_harassment_threatening, higher_is_better=False
).on_output()

`moderation_hate` ¶

Uses OpenAI's Moderation API. A function that checks if text is hate speech.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_hate, higher_is_better=False
).on_output()

`moderation_hatethreatening` ¶

Uses OpenAI's Moderation API. A function that checks if text is threatening speech.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_hatethreatening, higher_is_better=False
).on_output()

`moderation_selfharm` ¶

Uses OpenAI's Moderation API. A function that checks if text is about self harm.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_selfharm, higher_is_better=False
).on_output()

`moderation_sexual` ¶

Uses OpenAI's Moderation API. A function that checks if text is sexual speech.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_sexual, higher_is_better=False
).on_output()

`moderation_sexualminors` ¶

Uses OpenAI's Moderation API. A function that checks if text is about sexual minors.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_sexualminors, higher_is_better=False
).on_output()

`moderation_violence` ¶

Uses OpenAI's Moderation API. A function that checks if text is about violence.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_violence, higher_is_better=False
).on_output()

`moderation_violencegraphic` ¶

Uses OpenAI's Moderation API. A function that checks if text is about graphic violence.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI
openai_provider = OpenAI()

feedback = Feedback(
    openai_provider.moderation_violencegraphic, higher_is_better=False
).on_output()

Generation-based: LLMProvider¶

API Reference: LLMProvider.

An LLM-based provider.

This is an abstract class and needs to be initialized as one of these:

OpenAI and subclass AzureOpenAI.
Bedrock.
LiteLLM. LiteLLM provides an interface to a wide range of models.
Langchain.

`coherence` ¶

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.coherence).on_output()

`coherence_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.coherence_with_cot_reasons).on_output()

`comprehensiveness_with_cot_reasons` ¶

Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.

Example

feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()

`conciseness` ¶

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.conciseness).on_output()

`conciseness_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.conciseness).on_output()

Args: text: The text to evaluate the conciseness of.

`context_relevance` ¶

Uses chat completion model. A function that completes a template to check the relevance of the context to the question.

Example

from trulens_eval.app import App
context = App.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance_with_cot_reasons)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

`context_relevance_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.

Example

from trulens_eval.app import App
context = App.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance_with_cot_reasons)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

`controversiality` ¶

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.

Example

feedback = Feedback(provider.controversiality).on_output()

`controversiality_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.controversiality_with_cot_reasons).on_output()

`correctness` ¶

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.correctness).on_output()

`correctness_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.correctness_with_cot_reasons).on_output()

`criminality` ¶

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.criminality).on_output()

`criminality_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.criminality_with_cot_reasons).on_output()

`generate_score` ¶

Base method to generate a score only, used for evaluation.

`generate_score_and_reasons` ¶

Base method to generate a score and reason, used for evaluation.

`groundedness_measure_with_cot_reasons` ¶

A measure to track if the source material supports each sentence in the statement using an LLM provider.

The LLM will process the entire statement at once, using chain of thought methodology to emit the reasons.

Example

from trulens_eval import Feedback
from trulens_eval.feedback.provider.openai import OpenAI

provider = OpenAI()

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(context.collect()
    .on_output()
    )

Args: source: The source that should support the statement. statement: The statement to check groundedness.

`harmfulness` ¶

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.harmfulness).on_output()

`harmfulness_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.harmfulness_with_cot_reasons).on_output()

`helpfulness` ¶

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.helpfulness).on_output()

`helpfulness_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.helpfulness_with_cot_reasons).on_output()

`insensitivity` ¶

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.insensitivity).on_output()

`insensitivity_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.insensitivity_with_cot_reasons).on_output()

`maliciousness` ¶

Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.maliciousness).on_output()

`maliciousness_with_cot_reasons` ¶

Uses chat compoletion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.maliciousness_with_cot_reasons).on_output()

`misogyny` ¶

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.

Example

feedback = Feedback(provider.misogyny).on_output()

`misogyny_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.misogyny_with_cot_reasons).on_output()

`model_agreement` ¶

Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.

Example

feedback = Feedback(provider.model_agreement).on_input_output()

`relevance` ¶

Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.

Example

feedback = Feedback(provider.relevance).on_input_output()

Usage on RAG Contexts

feedback = Feedback(provider.relevance).on_input().on(
    TruLlama.select_source_nodes().node.text # See note below
).aggregate(np.mean)

`relevance_with_cot_reasons` ¶

Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.relevance_with_cot_reasons)
    .on_input()
    .on_output()

`sentiment` ¶

Uses chat completion model. A function that completes a template to check the sentiment of some text.

Example

feedback = Feedback(provider.sentiment).on_output()

`sentiment_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.sentiment_with_cot_reasons).on_output()

`stereotypes` ¶

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.

Example

feedback = Feedback(provider.stereotypes).on_input_output()

`stereotypes_with_cot_reasons` ¶

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.

Example

feedback = Feedback(provider.stereotypes_with_cot_reasons).on_input_output()

`summarization_with_cot_reasons` ¶

Summarization is deprecated in place of comprehensiveness. This function is no longer implemented.

Embedding-based¶

API Reference: Embeddings.

Embedding related feedback function implementations.

`cosine_distance` ¶

Runs cosine distance on the query and document embeddings

Example

Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed_model = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.cosine_distance)                .on_input()                .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)

The on(...) selector can be changed. See Feedback Function Guide : Selectors

`euclidean_distance` ¶

Runs L2 distance on the query and document embeddings

Example

Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed_model = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.euclidean_distance)                .on_input()                .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)

The on(...) selector can be changed. See Feedback Function Guide : Selectors

`manhattan_distance` ¶

Runs L1 distance on the query and document embeddings

Example

Below is just one example. See supported embedders: https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/root.html from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed_model = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

# Create the feedback function
f_embed = feedback.Embeddings(embed_model=embed_model)
f_embed_dist = feedback.Feedback(f_embed.manhattan_distance)                .on_input()                .on(Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[:].page_content)

The on(...) selector can be changed. See Feedback Function Guide : Selectors

Combinations¶

Ground Truth Agreement¶

API Reference: GroundTruthAgreement

Measures Agreement against a Ground Truth.

`agreement_measure` ¶

Uses OpenAI's Chat GPT Model. A function that that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.

Example

from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
    {"query": "who invented the lightbulb?", "response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)

feedback = Feedback(ground_truth_collection.agreement_measure).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

`bert_score` ¶

Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.

Example

from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
    {"query": "who invented the lightbulb?", "response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)

feedback = Feedback(ground_truth_collection.bert_score).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

`bleu` ¶

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

Example

from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement
golden_set = [
    {"query": "who invented the lightbulb?", "response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set)

feedback = Feedback(ground_truth_collection.bleu).on_input_output()

The on_input_output() selector can be changed. See Feedback Function Guide

`mae` ¶

Method to look up the numeric expected score from a golden set and take the differnce.

Primarily used for evaluation of model generated feedback against human feedback

Example

from trulens_eval import Feedback
from trulens_eval.feedback import GroundTruthAgreement

golden_set =
{"query": "How many stomachs does a cow have?", "response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "response": "I don't know", "expected_score": 0.8}
]
ground_truth_collection = GroundTruthAgreement(golden_set)

f_groundtruth = Feedback(ground_truth.mae).on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

`rouge` ¶

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

Stock Feedback Functions¶

Classification-based¶

🤗 Huggingface¶

OpenAI¶

moderation_harassment ¶

moderation_harassment_threatening ¶

moderation_hate ¶

moderation_hatethreatening ¶

moderation_selfharm ¶

moderation_sexual ¶

moderation_sexualminors ¶

moderation_violence ¶

moderation_violencegraphic ¶

Generation-based: LLMProvider¶

coherence ¶

coherence_with_cot_reasons ¶

comprehensiveness_with_cot_reasons ¶

conciseness ¶

conciseness_with_cot_reasons ¶

context_relevance ¶

context_relevance_with_cot_reasons ¶

controversiality ¶

controversiality_with_cot_reasons ¶

correctness ¶

correctness_with_cot_reasons ¶

criminality ¶

criminality_with_cot_reasons ¶

generate_score ¶

generate_score_and_reasons ¶

groundedness_measure_with_cot_reasons ¶

harmfulness ¶

harmfulness_with_cot_reasons ¶

helpfulness ¶

helpfulness_with_cot_reasons ¶

insensitivity ¶

insensitivity_with_cot_reasons ¶

maliciousness ¶

maliciousness_with_cot_reasons ¶

misogyny ¶

misogyny_with_cot_reasons ¶

model_agreement ¶

relevance ¶

relevance_with_cot_reasons ¶

sentiment ¶

sentiment_with_cot_reasons ¶

stereotypes ¶

stereotypes_with_cot_reasons ¶

summarization_with_cot_reasons ¶

Embedding-based¶

cosine_distance ¶

euclidean_distance ¶

manhattan_distance ¶

Combinations¶

Ground Truth Agreement¶

agreement_measure ¶

bert_score ¶

bleu ¶

mae ¶

rouge ¶

`moderation_harassment` ¶

`moderation_harassment_threatening` ¶

`moderation_hate` ¶

`moderation_hatethreatening` ¶

`moderation_selfharm` ¶

`moderation_sexual` ¶

`moderation_sexualminors` ¶

`moderation_violence` ¶

`moderation_violencegraphic` ¶

`coherence` ¶

`coherence_with_cot_reasons` ¶

`comprehensiveness_with_cot_reasons` ¶

`conciseness` ¶

`conciseness_with_cot_reasons` ¶

`context_relevance` ¶

`context_relevance_with_cot_reasons` ¶

`controversiality` ¶

`controversiality_with_cot_reasons` ¶

`correctness` ¶

`correctness_with_cot_reasons` ¶

`criminality` ¶

`criminality_with_cot_reasons` ¶

`generate_score` ¶

`generate_score_and_reasons` ¶

`groundedness_measure_with_cot_reasons` ¶

`harmfulness` ¶

`harmfulness_with_cot_reasons` ¶

`helpfulness` ¶

`helpfulness_with_cot_reasons` ¶

`insensitivity` ¶

`insensitivity_with_cot_reasons` ¶

`maliciousness` ¶

`maliciousness_with_cot_reasons` ¶

`misogyny` ¶

`misogyny_with_cot_reasons` ¶

`model_agreement` ¶

`relevance` ¶

`relevance_with_cot_reasons` ¶

`sentiment` ¶

`sentiment_with_cot_reasons` ¶

`stereotypes` ¶

`stereotypes_with_cot_reasons` ¶

`summarization_with_cot_reasons` ¶

`cosine_distance` ¶

`euclidean_distance` ¶

`manhattan_distance` ¶

`agreement_measure` ¶

`bert_score` ¶

`bleu` ¶

`mae` ¶

`rouge` ¶