Selecting Spans for Evaluation¶
LLM applications come in all shapes and sizes and with a variety of different control flows. As a result, it's a challenge to consistently evaluate parts of an LLM application trace.
Therefore, we've adapted the use of OpenTelemetry spans to refer to parts of an execution flow when defining evaluations.
Selecting Span Attributes for Evaluation¶
When defining evaluations, we want to evaluate particular span attributes, such as retrieved context, or an agent's plan.
This happens in two phases:
- Instrumentation is used to annotate span attributes. This is covered in detail in the instrumentation guide.
- Then when defining the evaluation, you can refer to those span attributes using the
Selector.
Let's walk through an example. Take this example where a method named query is instrumented. In this example, we annotate both the span type, and set span attributes to refer to the query argument to the function and the return argument of the function.
Setting Span Attributes in Instrumentation
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
@instrument(
attributes={
SpanAttributes.RECORD_ROOT.INPUT: "query",
SpanAttributes.RECORD_ROOT.OUTPUT: "return",
},
)
def query(self, query: str) -> str:
context_str = self.retrieve(query=query)
completion = self.generate_completion(query=query, context_str=context_str)
return completion
Once we've done this, now we can map the inputs to a metric to these span attributes:
Connection to Instrumentation
The span attributes used in evaluation (RECORD_ROOT.INPUT, RETRIEVAL.RETRIEVED_CONTEXTS, etc.) must first be set during instrumentation. If you're using custom attributes, make sure they are properly instrumented using the techniques described in Instrumenting Custom Attributes and Manipulating Custom Attributes.
Selecting Instrumented Span Attributes for Evaluation
from trulens.core import Metric, Selector
from trulens.otel.semconv.trace import SpanAttributes
f_answer_relevance = Metric(
implementation=provider.relevance_with_cot_reasons,
name="Answer Relevance",
selectors={
"prompt": Selector(
span_type=SpanAttributes.SpanType.RECORD_ROOT,
span_attribute=SpanAttributes.RECORD_ROOT.INPUT,
),
"response": Selector(
span_type=SpanAttributes.SpanType.RECORD_ROOT,
span_attribute=SpanAttributes.RECORD_ROOT.OUTPUT,
),
},
)
In the example above, you can see how a dictionary is passed to selectors that maps the metric implementation argument names to span attributes, accessed via a Selector.
Using collect_list¶
In the above examples you see we can set the collect_list argument in the Selector. Setting collect_list to True concatenates the selected span attributes into a single blob for evaluation. Alternatively, when set to False each span attribute selected will be evaluated individually.
Using collect_list is particularly advantageous when working with retrieved context. When evaluating context relevance, we evaluate each context individually (by setting collect_list=False).
Using Collect List to Evaluate Individual Contexts
from trulens.core import Metric, Selector
from trulens.otel.semconv.trace import SpanAttributes
import numpy as np
f_context_relevance = Metric(
implementation=provider.context_relevance_with_cot_reasons,
name="Context Relevance",
selectors={
"question": Selector.select_record_input(),
"context": Selector(
span_type=SpanAttributes.SpanType.RETRIEVAL,
span_attribute=SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS,
collect_list=False,
),
},
agg=np.mean,
)
Alternatively, when evaluating groundedness we assess if each LLM claim can be attributed to any evidence from the entire set of retrieved contexts (by setting collect_list=True).
Using Collect List to Evaluate All Contexts At Once
from trulens.core import Metric, Selector
from trulens.otel.semconv.trace import SpanAttributes
f_groundedness = Metric(
implementation=provider.groundedness_measure_with_cot_reasons,
name="Groundedness",
selectors={
"source": Selector(
span_type=SpanAttributes.SpanType.RETRIEVAL,
span_attribute=SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS,
collect_list=True,
),
"statement": Selector.select_record_output(),
},
)
Evaluating retrieved context from other frameworks¶
The Selector.select_context() shortcut can also be used for LangChain and LlamaIndex apps to refer to the retrieved contexts. Doing so does not require annotating your app with the RETRIEVAL.RETRIEVED_CONTEXTS span attribute, as that is done for you.
Selecting at the Trace Level¶
In addition to selecting individual spans or span attributes, you can also select and evaluate at the trace level. This is useful when you want to apply metrics to an entire trace or to all spans matching certain criteria within a trace.
Trace-Level Selection with Selector¶
The Selector class supports a trace_level argument. When trace_level=True, the selector will match all spans in a trace, optionally filtered by function_name, span_name, or span_type. This allows you to evaluate metrics across multiple spans in a single trace.
Each filter field (e.g., function_name) accepts a single value (not a list). Filters across fields are combined with AND logic (i.e., a span must match all specified criteria).
Evaluating All Spans in a Trace
from trulens.core import Metric, Selector
f_trace_level = Metric(
implementation=provider.some_trace_level_metric,
name="Trace Level Metric",
selectors={
"trace": Selector(trace_level=True),
},
)
Example: Filtering Spans by Function Name¶
You can filter spans at the trace level by specifying a function name. This is useful if you want to evaluate only those spans in a trace that correspond to a particular function.
Filtering Spans by Function Name at the Trace Level
from trulens.core import Metric, Selector
# Example metric that counts the number of selected spans
def count_spans(trace):
# trace is a ProcessedContentNode representing the filtered trace
def count_nodes(node):
return 1 + sum(count_nodes(child) for child in getattr(node, 'children', []))
return count_nodes(trace)
f_filtered_trace = Metric(
implementation=count_spans,
name="Count Query Spans",
selectors={
"trace": Selector(
trace_level=True,
function_name="query",
),
},
)
In this example, the metric count_spans will receive a tree of spans (as a ProcessedContentNode) filtered to only those with function_name="query", and will return the total count of such spans in the trace.
When to Use Trace-Level Selection¶
Use trace-level selection when your metric needs to consider the relationships between multiple spans, or when you want to aggregate information across an entire trace, such as holistic trace quality.