trulens.feedback.groundtruth¶
trulens.feedback.groundtruth
¶
Classes¶
GroundTruthAgreement
¶
Bases: WithClassInfo
, SerialModel
Measures Agreement against a Ground Truth.
Attributes¶
tru_class_info
instance-attribute
¶
tru_class_info: Class
Class information of this pydantic object for use in deserialization.
Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.
Functions¶
load
staticmethod
¶
load(obj, *args, **kwargs)
Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.
model_validate
classmethod
¶
model_validate(*args, **kwargs) -> Any
Deserialized a jsonized version of the app into the instance of the class it was serialized from.
Note
This process uses extra information stored in the jsonized object and handled by WithClassInfo.
__init__
¶
__init__(
ground_truth: Union[
List[Dict], Callable, DataFrame, FunctionOrMethod
],
provider: Optional[LLMProvider] = None,
bert_scorer: Optional[BERTScorer] = None,
**kwargs
)
Measures Agreement against a Ground Truth.
Usage 1:
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
{"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())
Usage 2: from trulens.feedback import GroundTruthAgreement from trulens.providers.openai import OpenAI
session = TruSession() ground_truth_dataset = session.get_ground_truths_by_dataset("hotpotqa") # assuming a dataset "hotpotqa" has been created and persisted in the DB
ground_truth_collection = GroundTruthAgreement(ground_truth_dataset, provider=OpenAI())
Usage 3:
from trulens.feedback import GroundTruthAgreement
from trulens.providers.cortex import Cortex
ground_truth_imp = llm_app
response = llm_app(prompt)
snowflake_connection_parameters = {
"account": os.environ["SNOWFLAKE_ACCOUNT"],
"user": os.environ["SNOWFLAKE_USER"],
"password": os.environ["SNOWFLAKE_USER_PASSWORD"],
"database": os.environ["SNOWFLAKE_DATABASE"],
"schema": os.environ["SNOWFLAKE_SCHEMA"],
"warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}
ground_truth_collection = GroundTruthAgreement(
ground_truth_imp,
provider=Cortex(
snowflake.connector.connect(**snowflake_connection_parameters),
model_engine="mistral-7b",
),
)
PARAMETER | DESCRIPTION |
---|---|
ground_truth |
A list of query/response pairs or a function, or a dataframe containing ground truth dataset, or callable that returns a ground truth string given a prompt string. provider (LLMProvider): The provider to use for agreement measures. bert_scorer (Optional["BERTScorer"], optional): Internal Usage for DB serialization.
TYPE:
|
agreement_measure
¶
Uses OpenAI's Chat GPT Model. A function that that measures similarity to ground truth. A second template is given to Chat GPT with a prompt that the original response is correct, and measures whether previous Chat GPT's response is similar.
Example:
```python
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
{"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())
feedback = Feedback(ground_truth_collection.agreement_measure).on_input_output()
```
The `on_input_output()` selector can be changed. See [Feedback Function Guide](https://www.trulens.org/trulens/feedback_function_guide/)
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[float, Tuple[float, Dict[str, str]]]
|
|
Union[float, Tuple[float, Dict[str, str]]]
|
|
absolute_error
¶
Method to look up the numeric expected score from a golden set and take the difference.
Primarily used for evaluation of model generated feedback against human feedback
Example:
```python
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.bedrock import Bedrock
golden_set =
{"query": "How many stomachs does a cow have?", "expected_response": "Cows' diet relies primarily on grazing.", "expected_score": 0.4},
{"query": "Name some top dental floss brands", "expected_response": "I don't know", "expected_score": 0.8}
]
bedrock = Bedrock(
model_id="amazon.titan-text-express-v1", region_name="us-east-1"
)
ground_truth_collection = GroundTruthAgreement(golden_set, provider=bedrock)
f_groundtruth = Feedback(ground_truth.absolute_error.on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()
```
bert_score
¶
Uses BERT Score. A function that that measures similarity to ground truth using bert embeddings.
Example:
```python
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
{"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())
feedback = Feedback(ground_truth_collection.bert_score).on_input_output()
```
The `on_input_output()` selector can be changed. See [Feedback Function Guide](https://www.trulens.org/trulens/feedback_function_guide/)
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[float, Tuple[float, Dict[str, str]]]
|
|
Union[float, Tuple[float, Dict[str, str]]]
|
|
bleu
¶
Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.
Example:
```python
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
golden_set = [
{"query": "who invented the lightbulb?", "expected_response": "Thomas Edison"},
{"query": "ΒΏquien invento la bombilla?", "expected_response": "Thomas Edison"}
]
ground_truth_collection = GroundTruthAgreement(golden_set, provider=OpenAI())
feedback = Feedback(ground_truth_collection.bleu).on_input_output()
```
The `on_input_output()` selector can be changed. See [Feedback Function Guide](https://www.trulens.org/trulens/feedback_function_guide/)
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[float, Tuple[float, Dict[str, str]]]
|
|
Union[float, Tuple[float, Dict[str, str]]]
|
|
rouge
¶
Uses BLEU Score. A function that that measures similarity to ground truth using token overlap.
PARAMETER | DESCRIPTION |
---|---|
prompt |
A text prompt to an agent.
TYPE:
|
response |
The agent's response to the prompt.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Union[float, Tuple[float, Dict[str, str]]]
|
|
Union[float, Tuple[float, Dict[str, str]]]
|
|
GroundTruthAggregator
¶
Bases: WithClassInfo
, SerialModel
Attributes¶
tru_class_info
instance-attribute
¶
tru_class_info: Class
Class information of this pydantic object for use in deserialization.
Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.
model_config
class-attribute
¶
Aggregate benchmarking metrics for ground-truth-based evaluation on feedback functions.
Functions¶
load
staticmethod
¶
load(obj, *args, **kwargs)
Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.
model_validate
classmethod
¶
model_validate(*args, **kwargs) -> Any
Deserialized a jsonized version of the app into the instance of the class it was serialized from.
Note
This process uses extra information stored in the jsonized object and handled by WithClassInfo.
register_custom_agg_func
¶
register_custom_agg_func(
name: str,
func: Callable[
[List[float], GroundTruthAggregator], float
],
) -> None
Register a custom aggregation function.
ndcg_at_k
¶
precision_at_k
¶
recall_at_k
¶
ir_hit_rate
¶
Calculate the IR hit rate at top k. the proportion of queries for which at least one relevant document is retrieved in the top k results. This metric evaluates whether a relevant document is present among the top k retrieved Parameters: scores (list or array): The list of scores generated by the model.
Returns: float: The hit rate at top k. Binary 0 or 1.
mrr
¶
auc
¶
kendall_tau
¶
Calculate Kendall's tau. Can be used for meta-evaluation. Kendallβs tau is a measure of the correspondence between two rankings. Values close to 1 indicate strong agreement, values close to -1 indicate strong disagreement. This is the tau-b version of Kendallβs tau which accounts for ties.
PARAMETER | DESCRIPTION |
---|---|
scores |
scores returned by feedback function |
RETURNS | DESCRIPTION |
---|---|
float
|
Kendall's tau
TYPE:
|
spearman_correlation
¶
Calculate the Spearman correlation. Can be used for meta-evaluation. The Spearman correlation coefficient is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).
PARAMETER | DESCRIPTION |
---|---|
scores |
scores returned by feedback function |
RETURNS | DESCRIPTION |
---|---|
float
|
Spearman correlation
TYPE:
|
brier_score
¶
assess both calibration and sharpness of the probability estimates Args: scores (List[float]): relevance scores returned by feedback function Returns: float: Brier score