trulens.feedback.groundtruth¶

trulens.feedback.groundtruth ¶

Classes¶

GroundTruthAgreement ¶

Bases: WithClassInfo, SerialModel

Measures Agreement against a Ground Truth.

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

Functions¶

init ¶

__init__(
    ground_truth: Union[
        List[Dict], Callable, DataFrame, FunctionOrMethod
    ],
    provider: Optional[LLMProvider] = None,
    bert_scorer: Optional[BERTScorer] = None,
    **kwargs
)

Measures Agreement against a Ground Truth.

Usage 1

from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

Usage 2

from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
from trulens.core.session import TruSession

session = TruSession()
ground_truth_dataset = session.get_ground_truths_by_dataset("hotpotqa") # assuming a dataset "hotpotqa" has been created and persisted in the DB

ground_truth_collection = GroundTruthAgreement(ground_truth_dataset, provider=OpenAI())

Usage 3

from snowflake.snowpark import Session
from trulens.feedback import GroundTruthAgreement
from trulens.providers.cortex import Cortex
ground_truth_imp = llm_app
response = llm_app(prompt)

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
    "database": os.environ["SNOWFLAKE_DATABASE"],
    "schema": os.environ["SNOWFLAKE_SCHEMA"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}

snowpark_session = Session.builder.configs(snowflake_connection_parameters).create()

ground_truth_collection = GroundTruthAgreement(
    ground_truth_imp,
    provider=Cortex(
        snowpark_session=snowpark_session,
        model_engine="mistral-7b",
    ),
)

PARAMETER	DESCRIPTION
`ground_truth`	A list of query/response pairs or a function, or a dataframe containing ground truth dataset, or callable that returns a ground truth string given a prompt string. TYPE: `Union[List[Dict], Callable, DataFrame, FunctionOrMethod]`
`provider`	The provider to use for agreement measures. TYPE: `Optional[LLMProvider]` DEFAULT: `None`
`bert_scorer`	Internal Usage for DB serialization. TYPE: `Optional[BERTScorer]` DEFAULT: `None`

agreement_measure ¶

agreement_measure(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses OpenAI's Chat GPT Model. A function that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.

Example

from trulens.core import Metric, Selector
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI

golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Metric(
    implementation=ground_truth_collection.agreement_measure,
    name="Agreement Measure",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

ndcg_at_k ¶

ndcg_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute NDCG@k for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute NDCG. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed NDCG@k score. TYPE: `float`

precision_at_k ¶

precision_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Precision@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute Precision. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Precision@k score. TYPE: `float`

recall_at_k ¶

recall_at_k(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
    k: Optional[int] = None,
) -> float

Compute Recall@k for a given query and retrieved context chunks, considering tie handling.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`relevance_scores`	Relevance scores for each retrieved chunk. TYPE: `Optional[List[float]]` DEFAULT: `None`
`k`	Rank position up to which to compute Recall. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Recall@k score. TYPE: `float`

mrr ¶

mrr(
    query: str,
    retrieved_context_chunks: List[str],
    relevance_scores: Optional[List[float]] = None,
) -> float

Compute Mean Reciprocal Rank (MRR) for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`

RETURNS	DESCRIPTION
`float`	Computed MRR score. TYPE: `float`

ir_hit_rate ¶

ir_hit_rate(
    query: str,
    retrieved_context_chunks: List[str],
    k: Optional[int] = None,
) -> float

Compute IR Hit Rate (Hit Rate@k) for a given query and retrieved context chunks.

PARAMETER	DESCRIPTION
`query`	The input query string. TYPE: `str`
`retrieved_context_chunks`	List of retrieved context chunks. TYPE: `List[str]`
`k`	Rank position up to which to compute Hit Rate. If None, compute for all retrieved chunks. TYPE: `Optional[int]` DEFAULT: `None`

RETURNS	DESCRIPTION
`float`	Computed Hit Rate@k score. TYPE: `float`

absolute_error ¶

absolute_error(
    prompt: str, response: str, score: float
) -> Tuple[float, Dict[str, float]]

Method to look up the numeric expected score from a golden set and take the difference.

Primarily used for evaluation of model generated feedback against human feedback

Example

from trulens.core import Metric, Selector
from trulens.feedback import GroundTruthAgreement
from trulens.providers.bedrock import Bedrock

golden_set = [
{"query": "How many stomachs does a cow have?", "expected_response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "expected_response": "I don't know", "expected_score": 0.8}
]

bedrock = Bedrock(
    model_id="amazon.nova-lite-v1:0", region_name="us-east-1"
)
ground_truth_collection = GroundTruthAgreement(golden_set, provider=bedrock)

f_groundtruth = Metric(
    implementation=ground_truth_collection.absolute_error,
    name="Absolute Error",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

bert_score ¶

bert_score(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.

Example

from trulens.core import Metric, Selector
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Metric(
    implementation=ground_truth_collection.bert_score,
    name="BERT Score",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

bleu ¶

bleu(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

Example

from trulens.core import Metric, Selector
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
    {"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
    {"query": "¿quien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())

feedback = Metric(
    implementation=ground_truth_collection.bleu,
    name="BLEU",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
)

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement". TYPE: `Union[float, Tuple[float, Dict[str, str]]]`
`dict`	with key 'ground_truth_response' TYPE: `Union[float, Tuple[float, Dict[str, str]]]`

rouge ¶

rouge(
    prompt: str, response: str
) -> Union[float, Tuple[float, Dict[str, str]]]

Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`Union[float, Tuple[float, Dict[str, str]]]`	float: A value between 0 and 1. 0 being "not in agreement" and 1 being "in agreement".
`Union[float, Tuple[float, Dict[str, str]]]`	dict: with key 'ground_truth_response'

repr ¶

__repr__() -> str

Safe repr that handles circular references.

Pydantic's default __repr__ does not guard against circular references among model instances, which leads to RecursionError (see GitHub issue #1862). This override uses the same formatted_objects context-variable that __rich_repr__ uses so that already-visited objects are replaced with a short placeholder instead of recursing infinitely.

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

GroundTruthAggregator ¶

Bases: WithClassInfo, SerialModel

Attributes¶

model_config `class-attribute` ¶

model_config: ConfigDict = ConfigDict(
    arbitrary_types_allowed=True, extra="allow"
)

Aggregate benchmarking metrics for ground-truth-based evaluation on feedback functions.

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

Functions¶

register_custom_agg_func ¶

register_custom_agg_func(
    name: str,
    func: Callable[
        [List[float], GroundTruthAggregator], float
    ],
) -> None

Register a custom aggregation function.

auc ¶

auc(scores: List[float]) -> float

Calculate the area under the ROC curve. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Area under the ROC curve TYPE: `float`

kendall_tau ¶

kendall_tau(
    scores: Union[List[float], List[List]],
) -> float

Calculate Kendall's tau. Can be used for meta-evaluation. Kendall’s tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement. This is the tau-b version of Kendall’s tau which accounts for ties.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Kendall's tau TYPE: `float`

spearman_correlation ¶

spearman_correlation(
    scores: Union[List[float], List[List]],
) -> float

Calculate the Spearman correlation. Can be used for meta-evaluation. The Spearman correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Spearman correlation TYPE: `float`

pearson_correlation ¶

pearson_correlation(
    scores: Union[List[float], List[List]],
) -> float

Calculate the Pearson correlation. Can be used for meta-evaluation. The Pearson correlation coefficient is a measure of the linear relationship between two variables.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Pearson correlation TYPE: `float`

matthews_correlation ¶

matthews_correlation(
    scores: Union[List[float], List[List]],
) -> float

Calculate the Matthews correlation coefficient. Can be used for meta-evaluation. The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Matthews correlation coefficient TYPE: `float`

cohens_kappa ¶

cohens_kappa(
    scores: Union[List[float], List[List]], threshold=0.5
) -> float

Computes Cohen's Kappa score between true labels and predicted scores.

Parameters: - true_labels (list): A list of true labels. - scores (list): A list of predicted labels or scores.

Returns: - float: Cohen's Kappa score.

recall ¶

recall(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates recall given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The recall score.

precision ¶

precision(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates precision given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The precision score.

f1_score ¶

f1_score(
    scores: Union[List[float], List[List]], threshold=0.5
)

Calculates the F1 score given true labels and model-generated scores.

Parameters: - scores (list of float): A list of model-generated scores (0 to 1.0). - threshold (float): The threshold to convert scores to binary predictions. Default is 0.5.

Returns: - float: The F1 score.

brier_score ¶

brier_score(
    scores: Union[List[float], List[List]],
) -> float

assess both calibration and sharpness of the probability estimates Args: scores (List[float]): relevance scores returned by feedback function Returns: float: Brier score

ece ¶

ece(score_confidence_pairs: List[Tuple[float]]) -> float

Calculate the expected calibration error. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`score_confidence_pairs`	list of tuples of relevance scores and confidences returned by feedback function TYPE: `List[Tuple[float]]`

RETURNS	DESCRIPTION
`float`	Expected calibration error TYPE: `float`

mae ¶

mae(scores: Union[List[float], List[List]]) -> float

Calculate the mean absolute error. Can be used for meta-evaluation.

PARAMETER	DESCRIPTION
`scores`	scores returned by feedback function TYPE: `List[float]`

RETURNS	DESCRIPTION
`float`	Mean absolute error TYPE: `float`

repr ¶

__repr__() -> str

Safe repr that handles circular references.

Pydantic's default __repr__ does not guard against circular references among model instances, which leads to RecursionError (see GitHub issue #1862). This override uses the same formatted_objects context-variable that __rich_repr__ uses so that already-visited objects are replaced with a short placeholder instead of recursing infinitely.

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

trulens.feedback.groundtruth¶