TruLens 2.8: Parallel Batch Evals, Schema Validation, and a Faster Dashboard¶
TruLens 2.8: the Run API now works on any backend with parallel execution (up to 5.4x speedup), there's a new SchemaValidator for programmatic output checks, and the dashboard leaderboard is 2-5x faster via SQL aggregation. Database session initialization drops from 1.6s to 0.4s without uninformative output.
Parallel Batch Evaluation, Now on Any Backend¶
The Run API was Snowflake-only. Now run.start(), run.compute_metrics(), and run.get_records() work with any connector (SQLite, PostgreSQL, Snowflake) and run in parallel.
New RunConfig Parameters¶
Two knobs for concurrency control:
invocation_max_workers: threads forrun.start()(default:min(len(input_df), 4))metric_max_workers: threads forrun.compute_metrics()(default:len(metrics))
from trulens.core.run import RunConfig
config = RunConfig(
run_name="batch_eval_v1",
dataset_name="eval_questions",
source_type="TABLE",
dataset_spec={"input": "QUESTION"},
invocation_max_workers=8, # parallel app calls
metric_max_workers=4, # parallel metric computation
)
run = tru_app.add_run(run_config=config)
run.start() # invokes app on all rows in parallel
run.compute_metrics(["groundedness", my_custom_metric])
Benchmark: OSS (SQLite, Claude Sonnet, 8 questions, 4 metrics)¶
| Step | Sequential (workers=1) | Parallel (workers=4) | Speedup |
|---|---|---|---|
run.start() |
15.79s | 5.36s | 2.95x |
run.compute_metrics() |
384.89s | 145.95s | 2.64x |
Benchmark: Snowflake (client-side metrics, 4 LLM-as-judge evals)¶
| Step | Sequential | Parallel (workers=4) | Speedup |
|---|---|---|---|
run.compute_metrics() |
417.85s | 77.83s | 5.37x |
This fixes nested recording errors (#2325), the "how do I wait for feedbacks" problem (#2335), and gives explicit rate-limit control (#687).
Docs: Batch Evaluation
SchemaValidator: Programmatic Output Validation¶
The new SchemaValidator validates output against a Pydantic model or JSON schema dict. It works like any other metric in the Metric API.
Pydantic model¶
import pydantic
from trulens.feedback.schema_validator import SchemaValidator
from trulens.core.metric.metric import Metric
class ToolCall(pydantic.BaseModel):
tool_name: str
arguments: dict
reasoning: str
validator = SchemaValidator(schema=ToolCall)
f_schema = Metric(validator.validate_json).on_output()
JSON schema dict¶
from trulens.feedback.schema_validator import SchemaValidator
from trulens.core.metric.metric import Metric
schema = {
"type": "object",
"properties": {
"answer": {"type": "string"},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
},
"required": ["answer", "confidence"],
}
validator = SchemaValidator(schema=schema)
f_schema = Metric(validator.validate_json).on_output()
Returns 1.0 (valid) or 0.0 (invalid) with an "explanation" in metadata. There's also a validate_json_partial method for streaming/partial output.
Pydantic validation needs no extra deps. JSON schema dict mode requires the optional jsonschema package.
Dashboard Performance: Up to 5.2x Faster¶
The leaderboard used to fetch every record with full JSON payloads, deserialize in Python, then groupby in pandas. At 10k+ records that meant 20-30s loads.
We moved the aggregation into SQL (SQLAlchemy, works across SQLite/Postgres). The leaderboard now fetches only grouped results instead of raw records.
Benchmark (10k records, SQLite, 5 runs each)¶
| Scenario | Old (Python agg) | New (SQL agg) | Speedup |
|---|---|---|---|
| All apps, 15 versions | 1.330s | 0.255s | 5.2x |
| Single app, 3 versions | 0.309s | 0.114s | 2.7x |
| Single app, limit=1000 | 0.194s | 0.118s | 1.6x |
| Single version | 0.163s | 0.090s | 1.8x |
Also:
- New indexes on
start_timestampandtimestamp - Histogram tab lazy-loads raw records only when selected
- Sort order fixed (newest first)
- EVAL_ROOT metric name parsing fix for Compare page
Fast and Quiet TruSession Startup¶
TruSession() took ~1.6s to init (eager provider imports in _track_costs()) and printed 6 lines to stdout. Both fixed.
Background cost tracking¶
_track_costs() now runs in a daemon thread. On first span, on_start joins it, but it's usually done by then.
| Metric | Before | After | Improvement |
|---|---|---|---|
| Init time | 1.63s | 0.44s | 73% faster |
| Total (import + init) | 2.78s | 1.62s | 42% faster |
First-span latency vs. time since init:
| Delay after init | Extra latency |
|---|---|
| 0ms | 1.14s |
| 500ms | 0.66s |
| 1.0s | 0.22s |
| 1.5s+ | 0s |
Silent by default¶
All prints converted to logger.info/logger.debug or suppressed. Use logging.basicConfig(level=logging.INFO) to see the db URL.
New Example: OpenAI Agent SDK + Snowflake Tools¶
End-to-end example showing TruLens + OpenAI Agent SDK + Snowflake Cortex:
- OpenAI Agent SDK with Cortex Analyst + Cortex Search tools
- Batch eval runner using the Run API
- FastAPI monitoring backend
- React chat UI
- Streamlit observability dashboard
Code: openai_agent_sdk_snowflake_tools
Bug Fixes¶
- Arrow-backed DataFrame fix:
run.compute_metricsconverts Arrow columns to object dtype (#2387) - Duplicate Alembic revision: Fixed conflicting revision 11, relaxed
importlib-resourcesbound (#2429) - CI pipeline timeouts: Added pip dependency caching (#2395)
- Testset generation docs: Fixed imports and install instructions for
trulens.benchmark.generate(#2384)
Get Started¶
pip install trulens --upgrade
Links¶
Questions or feedback? Open an issue or discussion.