Skip to content

🦑 TruLens

tru_benchmark_experiment

trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment¶

trulens.benchmark.benchmark_frameworks.tru_benchmark_experiment ¶

Classes¶

TruBenchmarkExperiment ¶

Example

snowflake_connection_parameters = {
    "account": os.environ["SNOWFLAKE_ACCOUNT"],
    "user": os.environ["SNOWFLAKE_USER"],
    "password": os.environ["SNOWFLAKE_USER_PASSWORD"],
    "database": os.environ["SNOWFLAKE_DATABASE"],
    "schema": os.environ["SNOWFLAKE_SCHEMA"],
    "warehouse": os.environ["SNOWFLAKE_WAREHOUSE"],
}
snowpark_session = Session.builder.configs(connection_params).create()
cortex = Cortex(
    snowpark_session=snowpark_session,
    model_engine="snowflake-arctic",
)

def context_relevance_ff_to_score(input, output, temperature=0):
    return cortex.context_relevance(question=input, context=output, temperature=temperature)

tru_labels = [1, 0, 0, ...] # ground truth labels collected from ground truth data collection
mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae

tru_benchmark_arctic = session.BenchmarkExperiment(
    app_name="MAE",
    feedback_fn=context_relevance_ff_to_score,
    agg_funcs=[mae_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)

Functions¶

init ¶

__init__(
    feedback_fn: Callable,
    agg_funcs: List[AggCallable],
    benchmark_params: BenchmarkParams,
)

Create a benchmark experiment class which defines custom feedback functions and aggregators to evaluate the feedback function on a ground truth dataset.

PARAMETER	DESCRIPTION
`feedback_fn`	function that takes in a row of ground truth data and returns a score by typically a LLM-as-judge TYPE: `Callable`
`agg_funcs`	list of aggregation functions to compute metrics on the feedback scores TYPE: `List[AggCallable]`
`benchmark_params`	benchmark configuration parameters TYPE: `BenchmarkParams`

run_score_generation_on_single_row ¶

run_score_generation_on_single_row(
    feedback_fn: Callable, feedback_args: List[Any]
) -> Union[float, Tuple[float, float]]

Generate a score with the feedback_fn

PARAMETER	DESCRIPTION
`row`	A single row from the dataset.
`feedback_fn`	The function used to generate feedback scores. TYPE: `Callable`

RETURNS	DESCRIPTION
`Union[float, Tuple[float, float]]`	Union[float, Tuple[float, float]]: Feedback score (with metadata) after running the benchmark on a single entry in ground truth data.

call ¶

__call__(
    ground_truth: DataFrame,
) -> Union[
    List[float],
    List[Tuple[float]],
    Tuple[List[float], List[float]],
]

Collect the list of generated feedback scores as input to the benchmark aggregation functions Note the order of generated scores must be preserved to match the order of the true labels.

PARAMETER	DESCRIPTION
`ground_truth`	ground truth dataset / collection to evaluate the feedback function on TYPE: `DataFrame`

RETURNS	DESCRIPTION
`Union[List[float], List[Tuple[float]], Tuple[List[float], List[float]]]`	List[float]: feedback scores after running the benchmark on all entries in ground truth data

Functions¶

create_benchmark_experiment_app ¶

create_benchmark_experiment_app(
    app_name: str,
    app_version: str,
    benchmark_experiment: TruBenchmarkExperiment,
    **kwargs
) -> TruApp

Create an app for special use case: benchmarking feedback functions.

PARAMETER	DESCRIPTION
`app_name`	user-defined name of the experiment run. TYPE: `str`
`app_version`	user-defined version of the experiment run. TYPE: `str`
`feedback_fn`	feedback function of interest to perform meta-evaluation
`agg_funcs`	list of aggregation functions to compute metrics for the benchmark.
`benchmark_params`	parameters for the benchmarking experiment.

RETURNS	DESCRIPTION
`TruApp`	Custom app wrapper for benchmarking feedback functions.