Batch Evaluation with Runs¶
Runs provide a structured way to evaluate your LLM application over a dataset. Instead of evaluating one query at a time, you define a run that invokes your app on every row of a dataset, ingests the resulting traces, and then computes metrics across the entire batch.
Overview¶
The batch evaluation workflow has four steps:
- Wrap your app with
TruAppand connect to a session. - Create a
RunConfigthat points to a dataset and maps its columns to span attributes. run.start()β invokes your app for every row, exports OTEL spans, and kicks off server-side ingestion.run.compute_metrics()β computes metrics (server-side or client-side) over the ingested traces.
graph LR
A[Define RunConfig] --> B[add_run]
B --> C[run.start]
C --> D[Poll get_status]
D --> E[run.compute_metrics]
E --> F[Poll get_status]
F --> G[COMPLETED]
Key Concepts¶
RunConfig¶
A RunConfig describes what data to evaluate and how to interpret it.
| Field | Description |
|---|---|
run_name |
Unique name for this run. |
dataset_name |
Name of the source table (or user-specified name for a dataframe). |
source_type |
"TABLE" for a database table, "DATAFRAME" for a user-provided dataframe. |
dataset_spec |
Maps reserved span attribute fields to column names in your dataset. |
description |
Optional description. |
label |
Optional text label to group related runs. |
mode |
APP_INVOCATION (default) invokes your app; LOG_INGESTION creates spans from existing data without invoking the app. |
dataset_spec¶
The dataset_spec dictionary maps span attribute paths to column names in your dataset. Common mappings:
dataset_spec = {
"input": "QUESTION_COLUMN", # maps to record root input
"ground_truth_output": "EXPECTED_COL", # optional ground truth
}
RunStatus¶
After starting a run, poll run.get_status() to track progress:
| Status | Meaning |
|---|---|
CREATED |
Run exists but hasn't been started. |
INVOCATION_IN_PROGRESS |
App is being invoked on dataset rows. |
INVOCATION_COMPLETED |
All invocations finished; traces ingested. |
COMPUTATION_IN_PROGRESS |
Metrics are being computed. |
COMPLETED |
All metrics computed successfully. |
PARTIALLY_COMPLETED |
Some metrics computed; others failed. |
FAILED |
The run encountered an error. |
CANCELLED |
The run was cancelled. |
Example: Batch Evaluation with a Snowflake Connector¶
1. Define your instrumented app¶
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
class MyApp:
@instrument(
span_type=SpanAttributes.SpanType.RECORD_ROOT,
attributes={
SpanAttributes.RECORD_ROOT.INPUT: "query",
SpanAttributes.RECORD_ROOT.OUTPUT: "return",
},
)
def respond(self, query: str) -> str:
# Your app logic here
return f"Response to: {query}"
2. Connect and create the dataset¶
from trulens.connectors.snowflake import SnowflakeConnector
from trulens.core.run import RunConfig, RunStatus
from trulens.core.session import TruSession
from snowflake.snowpark import Session
snowpark_session = Session.builder.configs({...}).create()
connector = SnowflakeConnector(snowpark_session=snowpark_session)
session = TruSession(connector)
# Create a dataset table
snowpark_session.sql("""
CREATE OR REPLACE TABLE eval_questions (input VARCHAR)
""").collect()
snowpark_session.sql("""
INSERT INTO eval_questions (input) VALUES
('What is machine learning?'),
('Explain gradient descent'),
('What is overfitting?')
""").collect()
3. Wrap the app and add a run¶
from trulens.apps.app import TruApp
app = MyApp()
tru_app = TruApp(
app,
app_name="my_app",
app_version="v1",
connector=connector,
main_method=app.respond,
)
run = tru_app.add_run(
run_config=RunConfig(
run_name="eval_run_1",
dataset_name="eval_questions",
source_type="TABLE",
dataset_spec={"input": "INPUT"},
description="Batch evaluation of v1",
)
)
4. Start the run¶
run.start()
This invokes app.respond() for each row in the dataset, exports the OTEL spans, and starts server-side ingestion.
Ingestion is asynchronous
run.start() returns before ingestion completes. You must poll
run.get_status() before calling compute_metrics().
5. Poll for ingestion completion¶
while True:
status = run.get_status()
if status in (RunStatus.INVOCATION_COMPLETED,
RunStatus.COMPLETED,
RunStatus.FAILED):
break
time.sleep(10)
6. Compute metrics¶
Once ingestion is complete, compute metrics over the traces:
run.compute_metrics(["groundedness"])
Metrics can be:
- Server-side β pass a string name (e.g.
"groundedness"). These run entirely on the server. - Client-side β pass a
Metricobject with an implementation and selectors.
7. Poll for metric completion¶
while True:
status = run.get_status()
if status in (RunStatus.COMPLETED,
RunStatus.PARTIALLY_COMPLETED,
RunStatus.FAILED):
break
time.sleep(10)
Example: Local Batch Evaluation with a DataFrame¶
You don't need a Snowflake connection to use Runs. With a local TruSession(), TruLens auto-wires a SQLite-backed DefaultRunDao so the same Run API works out of the box.
1. Wrap your app and define the dataset¶
import pandas as pd
from trulens.apps.app import TruApp
from trulens.core import TruSession
from trulens.core.run import RunConfig
session = TruSession()
app = MyApp()
tru_app = TruApp(
app,
app_name="my_app",
app_version="v1",
main_method=app.respond,
feedbacks=[...], # Metric objects
)
test_dataset = pd.DataFrame({"input": [
"What is machine learning?",
"Explain gradient descent",
"What is overfitting?",
]})
2. Create and start a run¶
run = tru_app.add_run(
run_config=RunConfig(
run_name="local_eval",
dataset_name="test_questions",
source_type="DATAFRAME",
dataset_spec={"input": "input"},
)
)
run.start(input_df=test_dataset)
3. Compute metrics and inspect results¶
run.compute_metrics([my_groundedness_metric, my_relevance_metric])
run.get_records()
For a complete walkthrough, see the Batch Evaluation quickstart notebook.
Comparing App Versions¶
Runs are scoped to an app version. To compare versions, create separate TruApp instances with different app_version values and run the same dataset through each:
tru_v1 = TruApp(app_v1, app_name="my_app", app_version="v1",
connector=connector, main_method=app_v1.respond)
tru_v2 = TruApp(app_v2, app_name="my_app", app_version="v2",
connector=connector, main_method=app_v2.respond)
run_v1 = tru_v1.add_run(run_config=RunConfig(
run_name="v1_eval", dataset_name="eval_questions",
source_type="TABLE", dataset_spec={"input": "INPUT"},
))
run_v1.start()
run_v2 = tru_v2.add_run(run_config=RunConfig(
run_name="v2_eval", dataset_name="eval_questions",
source_type="TABLE", dataset_spec={"input": "INPUT"},
))
run_v2.start()
Log Ingestion Mode¶
If you already have app outputs and don't want to re-invoke your app, use LOG_INGESTION mode. This creates OTEL spans directly from your data without calling the app:
run_config = RunConfig(
run_name="log_only_run",
dataset_name="existing_results",
source_type="TABLE",
dataset_spec={
"input": "QUESTION",
"RECORD_ROOT.OUTPUT": "ANSWER",
"RETRIEVAL.RETRIEVED_CONTEXTS": "CONTEXTS",
},
mode="LOG_INGESTION",
)
run = tru_app.add_run(run_config=run_config)
run.start() # Creates spans from data, no app invocation
Run Management¶
List runs¶
runs = tru_app.list_runs()
for r in runs:
print(f"{r.run_name}: {r.get_status()}")
Cancel a run¶
run.cancel()
Delete a run¶
run.delete()
Describe a run¶
run.describe() # Returns run metadata as a dict