Skip to content

🦴 Anatomy of Metrics

The Metric class is the starting point for metric specification and evaluation.

Example

from trulens.core import Metric, Selector
import numpy as np

# Context relevance between question and each context chunk.
f_context_relevance = Metric(
    implementation=provider.context_relevance_with_cot_reasons,
    name="Context Relevance",
    selectors={
        "question": Selector.select_record_input(),
        "context": Selector.select_context(collect_list=False),
    },
    agg=np.mean,
)

The components of this specification are:

Providers

The provider is the back-end on which a given metric is run. Multiple underlying models are available through each provider, such as GPT-4 or Llama-2. In many, but not all cases, the metric implementation is shared across providers (such as with LLM-based evaluations).

Read more about providers.

Metric Implementations

OpenAI.context_relevance is an example of a metric implementation.

Metric implementations are simple callables that can be run on any arguments matching their signatures. In the example, the implementation has the following signature:

Example

def context_relevance(self, prompt: str, context: str) -> float:

That is, context_relevance is a plain Python method that accepts the prompt and context, both strings, and produces a float (assumed to be between 0.0 and 1.0).

Read more about metric implementations

Metric Constructor

The Metric(implementation=provider.relevance) constructs a Metric object with a metric implementation.

Selectors

The selectors parameter specifies how the metric implementation's arguments are determined from an app record or app definition. Selectors map parameter names to span data using the Selector class.

Common selector methods:

  • Selector.select_record_input() - The main app input
  • Selector.select_record_output() - The main app output
  • Selector.select_context(collect_list=True/False) - Retrieved contexts

Read more about selectors.

Aggregation Specification

The agg=np.mean parameter specifies how metric outputs are to be aggregated. This only applies to cases where the selector names more than one value for an input (e.g., when collect_list=False returns multiple context chunks). The function is called on the float results of metric evaluations to produce a single float. The default is numpy.mean.

Read more about aggregation.