Skip to content

Blog

Celebrating 3,000 Stars: Evolving TruLens for Real-World Agent Workflows

When we started building TruLens, our mission was clear: make it easier to evaluate, trace, and improve LLM-based applications so they can be trusted in production. Reaching 3,000 GitHub stars is a meaningful milestone, and we are genuinely grateful to everyone in the community who has contributed ideas, submitted issues and shared their feedback. Thank you. We would not be here without this community behind us.

It also reflects why TruLens has become even more relevant today in the age of AI agents. AI agents are more complex than the previous generation of AI applications, and small errors quickly compound into larger failures. As more teams adopt agents in real workflows, the demand for reliable tracing and evaluation has never been higher.

For AI agents, traces are dramatically more useful as they can uncover when agent execution diverges from expectation. As these traces grow in size and complexity, automated evaluation becomes just as critical. Instead of manually digging through the trace to find mistakes, LLM judges can quickly find errors so you can get back to improving your agent. In short, tracing and evaluation now play a central role in getting reliable agents into production.

This post highlights how weโ€™re evolving TruLens for these real-world agent workflows, including new evaluation methods, support for reasoning models, and improved handling of MCP-based tool calls.

Measuring Goal-Plan-Action Alignment

One challenge weโ€™ve observed with teams building agents is that annotating agent traces is extremely time intensive and difficult. But without annotation, getting signals on ways to improve the agent is practically impossible.

To address this challenge, weโ€™ve introduced a new framework for evaluating the alignment of an agentโ€™s goal, plan and actions which weโ€™ve dubbed the Agentโ€™s GPA.

We benchmarked this framework to cover 95% of internal agent errors on the open-source TRAIL dataset, composed of agent traces for software engineering and data tasks. This framework provides an exciting new set of reference-free metrics that review agent traces and identify ways to improve the agent.

TruLens Agent GPA

Enabling reasoning models for agent evals

A major paradigm shift in the industry has been the rise of reasoning models. These powerful models often have different API shapes and output formats. We enabled support for Deepseek models, and OpenAI's GPT-5 and o-series, so they can be used as LLM judges in TruLens and enable richer reasoning for evaluation.

Because agent traces are massively more complex than prior evolutions of LLM applications, reasoning models are particularly useful for evaluation. In internal benchmarks of agent evaluation metrics, particularly logical consistency, we saw significant improvements for using reasoning models (such as GPT-4o) compared to oneshot (such as GPT-4.1).

MCP support

Many of todayโ€™s agent systems now use the Model Context Protocol to connect tools to agents.

To ensure TruLens fits naturally into these emerging workflows, we added a new MCP span type so that tool calls can be properly annotated as MCP-based tools. This new span type allows for finer segmentation of failure modes and faster agent debugging.

The road ahead

As we look forward, we will double down on improving your ability to debug and improve AI agents with tools for tracing, evaluation, and optimization. If you have ideas or feature requests - please create a GitHub discussion thread.

If you havenโ€™t joined the community yet:

โญ Star us on GitHub ๐Ÿง  Try our free course on DeepLearning.ai: Building and Evaluating Data Agents ๐Ÿ“š Get started and check out TruLens docs

Hereโ€™s to the next 3,000 stars.

Telemetry for the Agentic World: TruLens + OpenTelemetry

Agents are rapidly gaining traction across AI applications. With this growth comes a new set of challenges: how do we trace, observe, and evaluate these dynamic, distributed systems? Today, weโ€™re excited to share that TruLens now supports OpenTelemetry (OTel), unlocking powerful, interoperable observability for the agentic world.


Challenge for Tracing Agents

Tracing agentic applications is fundamentally different from tracing traditional software systems:

  • Language-agnostic: Agents can be written in Python, Go, Java, or more, requiring tracing that transcends language boundaries.
  • Distributed by nature: Multi-agent systems often span multiple machines or processes.
  • Existing telemetry stacks: Many developers and enterprises already use OpenTelemetry, so tracing compatibility is essential.
  • Dynamic execution: Unlike traditional apps, agents often make decisions on the fly, with branching workflows that canโ€™t be fully known in advance.
  • Interoperability standards: As frameworks like Model Context Protocol (MCP) and Agent2Agent Protocol (A2A) emerge, tracing must support agents working across different systems.
  • Repeated tool usage: Agents may call the same function or tool multiple times in a single execution trace, requiring fine-grained visibility into span grouping to understand whatโ€™s happening and why.

What is TruLens

TruLens is an open source library for evaluating and tracing AI agents, including RAG systems and other LLM applications. It combines OpenTelemetry-based tracing with trustworthy evaluations, including both ground truth metrics and reference-free (LLM-as-a-Judge) feedback.

TruLens pioneered the RAG Triadโ€”a structured evaluation of:

  • Context relevance
  • Groundedness
  • Answer relevance

These evaluations provide a foundation for understanding the performance of RAGs and agentic RAGs, supported by benchmarks like LLM-AggreFact, TREC-DL, and HotPotQA.

This combination of trusted evaluators and open standard tracing gives you tools to both improve your application offline and monitor once it reaches production.


How TruLens Augments OpenTelemetry

As AI applications become increasingly agentic, TruLensโ€™ shift to OpenTelemetry enables observability that is:

  • Interoperable with existing telemetry stacks
  • Compatible across languages and frameworks
  • Capable of tracing dynamic agent workflows

TruLens now accepts any span that adheres to the OTel standard.


What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework for generating, collecting, and exporting telemetry data such as traces, metrics, and logs.

In LLM and agentic contexts, OpenTelemetry enables language-agnostic, interoperable tracing for:

  • Multi-agent systems
  • Distributed environments
  • Tooling interoperability

What is a span? A span represents a single unit of work. In LLM apps, this might be: planning, routing, retrieval, tool usage, or generation.


TruLens Defines Semantic Conventions for the Agentic World

TruLens maps span attributes to common definitions using semantic conventions to ensure:

  • Cross-framework interoperability
  • Shared instrumentation for MCP and A2A
  • Consistent evaluation across implementations

Read more about TruLens Semantic Conventions.


Using Semantic Conventions to Compute Evaluation Metrics

TruLens allows evaluation of metrics based on span instrumentation.

@instrument(
    span_type=SpanAttributes.SpanType.RETRIEVAL,
    attributes={
        SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
        SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
    },
)
def retrieve(self, query: str) -> list:
    results = vector_store.query(query_texts=query, n_results=4)
    return [doc for sublist in results["documents"] for doc in sublist]
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on_context(call_feedback_function_per_entry_in_list=True)
    .aggregate(np.mean)
)

Computing Metrics on Complex Execution Flows

TruLens introduces span groups to handle repeated tool calls within a trace.

class App:

    @instrument(attributes={SpanAttributes.SPAN_GROUPS: "idx"})
    def clean_up_question(question: str, idx: str) -> str:
        ...

    @instrument(attributes={SpanAttributes.SPAN_GROUPS: "idx"})
    def clean_up_response(response: str, idx: str) -> str:
        ...

    @instrument()
    def combine_responses(cleaned_responses: List[str]) -> str:
        ...

    @instrument()
    def query(complex_question: str) -> str:
        questions = break_question_down(complex_question)
        cleaned_responses = []
        for i, question in enumerate(questions):
            cleaned_question = clean_up_question(question, str(i))
            response = call_llm(cleaned_question)
            cleaned_response = clean_up_response(response, str(i))
            cleaned_responses.append(cleaned_response)
        return combine_responses(cleaned_responses)

How to Examine Execution Flows in TruLens

Run:

session.run_dashboard()

โ€ฆand visually inspect execution traces. Span types are shown directly in the dashboard to help identify branching, errors, or performance issues.

TruLens Trace


How to Get Started

Ready to get started?

Today, we are launching a pre-release of TruLens on Otel. Below is a minimal walkthrough of using TruLens with OpenTelemetry. You can also find a curated list of examples of working with TruLens and Otel in this folder, including a new LangGraph quickstart - showing how to trace and evaluate a multi-agent graph.

  1. Install TruLens:
pip install trulens-core==1.5.0
  1. OpenTelemetry is enabled by default:
# OpenTelemetry is now enabled by default
# To disable it, set: os.environ["TRULENS_OTEL_TRACING"] = "0"
  1. Instrument Methods:
from trulens.core.otel.instrument import instrument

@instrument(
    attributes={
        SpanAttributes.RECORD_ROOT.INPUT: "query",
        SpanAttributes.RECORD_ROOT.OUTPUT: "return",
    },
)
def query(self, query: str) -> str:
    context_str = self.retrieve(query=query)
    completion = self.generate_completion(query=query, context_str=context_str)
    return completion
  1. Add Evaluations:
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input()
    .on_output()
)

Using selectors:

from trulens.core.feedback.selector import Selector

f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on({
        "prompt": Selector(
            span_type=SpanAttributes.SpanType.RECORD_ROOT,
            span_attribute=SpanAttributes.RECORD_ROOT.INPUT,
        ),
    })
    .on({
        "response": Selector(
            span_type=SpanAttributes.SpanType.RECORD_ROOT,
            span_attribute=SpanAttributes.RECORD_ROOT.OUTPUT,
        ),
    })
)
  1. Register Your App:
from trulens.apps.app import TruApp

rag = RAG(model_name="gpt-4.1-mini")

tru_rag = TruApp(
    rag,
    app_name="OTEL-RAG",
    app_version="4.1-mini",
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
  1. Run the Dashboard:
from trulens.dashboard import run_dashboard

run_dashboard(session)

Concluding Thoughts

By building on top of OpenTelemetry, TruLens delivers a universal tracing and evaluation platform for modern AI systems. Whether your agents are built in Python, composed via MCP, or distributed across systemsโ€”TruLens provides a common observability layer for telemetry and evaluation.

Try our new TruLens-OTel quickstarts for custom python apps, LangGraph, and LlamaIndex.

Letโ€™s build the future of trustworthy agentic AI together.

Moving to TruLens v1: Reliable and Modular Logging and Evaluation

It has always been our goal to make it easy to build trustworthy LLM applications. Since we launched last May, the package has grown up before our eyes, morphing from a hacked-together addition to an existing project (trulens-explain) to a thriving, agnostic standard for tracking and evaluating LLM apps. Along the way, weโ€™ve experienced growing pains and discovered inefficiencies in the way TruLens was built. Weโ€™ve also heard that the reasons people use TruLens today are diverse, and many of its use cases do not require its full footprint.

Today weโ€™re announcing an extensive re-architecture of TruLens that aims to give developers a stable, modular platform for logging and evaluation they can rely on.