Tracing and Evaluating Multi-Agent Systems with TruLens and LlamaIndex AgentWorkflow¶

In this notebook, we demonstrate how to use TruLens to trace, monitor, and evaluate multi-agent systems built with LlamaIndex's AgentWorkflow.

What You'll Learn¶

How to instrument LlamaIndex AgentWorkflow with TruLens for comprehensive tracing
How to capture agent-level spans and tool calls in the TruLens dashboard
How to evaluate multi-agent system performance using TruLens feedback functions
How to monitor execution efficiency and logical consistency across agent handoffs

The Multi-Agent System¶

We'll build a report generation system with three specialized agents:

ResearchAgent: Searches the web and records research notes
WriteAgent: Creates markdown reports based on research
ReviewAgent: Reviews and provides feedback on reports

The key focus is on observability and evaluation - understanding how agents interact, where bottlenecks occur, and how to measure system performance.

Setup¶

This example requires several key components:

TruLens: For tracing, monitoring, and evaluating the multi-agent system
LlamaIndex: For the AgentWorkflow and agent implementations
OpenAI: As the LLM provider for all agents
Tavily: For web search capabilities

We'll use OpenAI's GPT-4 as our LLM across all agents for consistency. TruLens will capture every interaction, tool call, and agent handoff, providing complete visibility into the system's behavior.

In [ ]:

Copied!

%pip install llama-index trulens-apps-llamaindex trulens-providers-openai tavily-python -q
%pip install llama-index trulens-apps-llamaindex trulens-providers-openai tavily-python -q

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
os.environ["TAVILY_API_KEY"] = "tvly-dev-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
os.environ["TAVILY_API_KEY"] = "tvly-dev-..."

In [ ]:

Copied!

from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))

System Design & Tracing Architecture¶

Our multi-agent system consists of three specialized agents that work together in a coordinated workflow:

Agent Roles¶

ResearchAgent: Searches the web and records research notes
WriteAgent: Creates markdown reports based on research findings
ReviewAgent: Reviews reports and provides feedback for improvements

Tools & Observability¶

Each agent uses specific tools that TruLens will trace:

web_search: Web search queries and results
record_notes: Note-taking and knowledge storage
write_report: Report generation process
review_report: Review and feedback generation

What TruLens Captures¶

With TruLens instrumentation, we'll observe:

Agent-level spans: Each agent's execution time and context
Tool call traces: Individual tool invocations and their results
Agent handoffs: When and why control passes between agents
State transitions: How the shared context evolves
Performance metrics: Execution efficiency and logical consistency

The Context class enables state sharing between agents, and TruLens will track how this state evolves throughout the workflow execution.

In [ ]:

Copied!





from tavily import AsyncTavilyClient
from llama_index.core.workflow import Context


async def search_web(query: str) -> str:
    """Useful for using the web to answer questions."""
    client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
    return str(await client.search(query))


async def record_notes(ctx: Context, notes: str, notes_title: str) -> str:
    """Useful for recording notes on a given topic. Your input should be notes with a title to save the notes under."""
    async with ctx.store.edit_state() as ctx_state:
        if "research_notes" not in ctx_state["state"]:
            ctx_state["state"]["research_notes"] = {}
        ctx_state["state"]["research_notes"][notes_title] = notes
    return "Notes recorded."


async def write_report(ctx: Context, report_content: str) -> str:
    """Useful for writing a report on a given topic. Your input should be a markdown formatted report."""
    async with ctx.store.edit_state() as ctx_state:
        ctx_state["state"]["report_content"] = report_content
    return "Report written."


async def review_report(ctx: Context, review: str) -> str:
    """Useful for reviewing a report and providing feedback. Your input should be a review of the report."""
    async with ctx.store.edit_state() as ctx_state:
        ctx_state["state"]["review"] = review
    return "Report reviewed."
from tavily import AsyncTavilyClient
from llama_index.core.workflow import Context


async def search_web(query: str) -> str:
    """Useful for using the web to answer questions."""
    client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
    return str(await client.search(query))


async def record_notes(ctx: Context, notes: str, notes_title: str) -> str:
    """Useful for recording notes on a given topic. Your input should be notes with a title to save the notes under."""
    async with ctx.store.edit_state() as ctx_state:
        if "research_notes" not in ctx_state["state"]:
            ctx_state["state"]["research_notes"] = {}
        ctx_state["state"]["research_notes"][notes_title] = notes
    return "Notes recorded."


async def write_report(ctx: Context, report_content: str) -> str:
    """Useful for writing a report on a given topic. Your input should be a markdown formatted report."""
    async with ctx.store.edit_state() as ctx_state:
        ctx_state["state"]["report_content"] = report_content
    return "Report written."


async def review_report(ctx: Context, review: str) -> str:
    """Useful for reviewing a report and providing feedback. Your input should be a review of the report."""
    async with ctx.store.edit_state() as ctx_state:
        ctx_state["state"]["review"] = review
    return "Report reviewed."

Creating Traceable Agents¶

Now we'll create our three agents using LlamaIndex's FunctionAgent class. Each agent has:

Unique names: Essential for TruLens to distinguish agents in traces
Clear descriptions: Help with agent handoff decisions and trace clarity
Specific tools: Each tool call will appear as a distinct span in TruLens
Handoff capabilities: TruLens will track when and why agents transfer control

The agent names (ResearchAgent, WriteAgent, ReviewAgent) will appear as span names in the TruLens dashboard, making it easy to follow the execution flow.

In [ ]:

Copied!





from llama_index.core.agent.workflow import FunctionAgent, ReActAgent

research_agent = FunctionAgent(
    name="ResearchAgent",
    description="Useful for searching the web for information on a given topic and recording notes on the topic.",
    system_prompt=(
        "You are the ResearchAgent that can search the web for information on a given topic and record notes on the topic. "
        "Once notes are recorded and you are satisfied, you should hand off control to the WriteAgent to write a report on the topic. "
        "You should have at least some notes on a topic before handing off control to the WriteAgent."
    ),
    llm=llm,
    tools=[search_web, record_notes],
    can_handoff_to=["WriteAgent"],
)

write_agent = FunctionAgent(
    name="WriteAgent",
    description="Useful for writing a report on a given topic.",
    system_prompt=(
        "You are the WriteAgent that can write a report on a given topic. "
        "Your report should be in a markdown format. The content should be grounded in the research notes. "
        "Once the report is written, you should get feedback at least once from the ReviewAgent."
    ),
    llm=llm,
    tools=[write_report],
    can_handoff_to=["ReviewAgent", "ResearchAgent"],
)

review_agent = FunctionAgent(
    name="ReviewAgent",
    description="Useful for reviewing a report and providing feedback.",
    system_prompt=(
        "You are the ReviewAgent that can review the write report and provide feedback. "
        "Your review should either approve the current report or request changes for the WriteAgent to implement. "
        "If you have feedback that requires changes, you should hand off control to the WriteAgent to implement the changes after submitting the review."
    ),
    llm=llm,
    tools=[review_report],
    can_handoff_to=["WriteAgent"],
)
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent

research_agent = FunctionAgent(
    name="ResearchAgent",
    description="Useful for searching the web for information on a given topic and recording notes on the topic.",
    system_prompt=(
        "You are the ResearchAgent that can search the web for information on a given topic and record notes on the topic. "
        "Once notes are recorded and you are satisfied, you should hand off control to the WriteAgent to write a report on the topic. "
        "You should have at least some notes on a topic before handing off control to the WriteAgent."
    ),
    llm=llm,
    tools=[search_web, record_notes],
    can_handoff_to=["WriteAgent"],
)

write_agent = FunctionAgent(
    name="WriteAgent",
    description="Useful for writing a report on a given topic.",
    system_prompt=(
        "You are the WriteAgent that can write a report on a given topic. "
        "Your report should be in a markdown format. The content should be grounded in the research notes. "
        "Once the report is written, you should get feedback at least once from the ReviewAgent."
    ),
    llm=llm,
    tools=[write_report],
    can_handoff_to=["ReviewAgent", "ResearchAgent"],
)

review_agent = FunctionAgent(
    name="ReviewAgent",
    description="Useful for reviewing a report and providing feedback.",
    system_prompt=(
        "You are the ReviewAgent that can review the write report and provide feedback. "
        "Your review should either approve the current report or request changes for the WriteAgent to implement. "
        "If you have feedback that requires changes, you should hand off control to the WriteAgent to implement the changes after submitting the review."
    ),
    llm=llm,
    tools=[review_report],
    can_handoff_to=["WriteAgent"],
)

Creating the AgentWorkflow¶

With our agents defined, we create the AgentWorkflow that orchestrates their interactions. The workflow configuration includes:

Agent list: All participating agents
Root agent: The starting point (ResearchAgent)
Initial state: Shared context that TruLens will track as it evolves

This workflow will be instrumented by TruLens to capture the complete execution trace.

In [ ]:

Copied!





from llama_index.core.agent.workflow import AgentWorkflow

agent_workflow = AgentWorkflow(
    agents=[research_agent, write_agent, review_agent],
    root_agent=research_agent.name,
    initial_state={
        "research_notes": {},
        "report_content": "Not written yet.",
        "review": "Review required.",
    },
)
from llama_index.core.agent.workflow import AgentWorkflow

agent_workflow = AgentWorkflow(
    agents=[research_agent, write_agent, review_agent],
    root_agent=research_agent.name,
    initial_state={
        "research_notes": {},
        "report_content": "Not written yet.",
        "review": "Review required.",
    },
)

Initialize TruLens for Tracing¶

We start by initializing a TruLens session that will:

Store all traces: Every agent call, tool usage, and handoff
Enable OTEL tracing: Advanced OpenTelemetry-based instrumentation
Prepare for evaluation: Set up the infrastructure for feedback functions

The database will capture detailed execution traces that we can analyze in the TruLens dashboard.

In [ ]:

Copied!

from trulens.core import TruSession

session = TruSession()
session.reset_database()

from trulens.core import TruSession

session = TruSession()
session.reset_database()

Define Evaluation Metrics¶

For multi-agent systems, we focus on evaluating:

Execution Efficiency¶

Measures how effectively the agents coordinate and complete tasks
Identifies bottlenecks and unnecessary steps in the workflow
Evaluates resource utilization across agent handoffs

Logical Consistency¶

Ensures agents make coherent decisions throughout the workflow
Validates that handoffs occur at appropriate times
Checks that the final output aligns with the initial request

These trace-level evaluations analyze the entire workflow execution, providing insights into system-wide performance rather than individual component behavior.

In [ ]:

Copied!





from trulens.core import Feedback
from trulens.core.feedback.selector import Selector
from trulens.providers.openai import OpenAI as OpenAIProvider

llm_judge = OpenAIProvider(model_engine="gpt-4.1")

f_execution_efficiency = Feedback(
    llm_judge.execution_efficiency_with_cot_reasons,
    name="Execution Efficiency",
).on({
    "trace": Selector(trace_level=True),
})

f_logical_consistency = Feedback(
    llm_judge.logical_consistency_with_cot_reasons,
    name="Logical Consistency",
).on({
    "trace": Selector(trace_level=True),
})
from trulens.core import Feedback
from trulens.core.feedback.selector import Selector
from trulens.providers.openai import OpenAI as OpenAIProvider

llm_judge = OpenAIProvider(model_engine="gpt-4.1")

f_execution_efficiency = Feedback(
    llm_judge.execution_efficiency_with_cot_reasons,
    name="Execution Efficiency",
).on({
    "trace": Selector(trace_level=True),
})

f_logical_consistency = Feedback(
    llm_judge.logical_consistency_with_cot_reasons,
    name="Logical Consistency",
).on({
    "trace": Selector(trace_level=True),
})

Instrument the Workflow with TruLens¶

Now we wrap our AgentWorkflow with TruLlamaWorkflow to enable comprehensive tracing:

Automatic instrumentation: Captures all agent calls and tool usage
Agent-level spans: Each agent execution appears as a distinct trace segment
Tool call tracking: Individual tool invocations are traced with inputs/outputs
Evaluation integration: Feedback functions run automatically on each trace

The instrumentation happens transparently - no changes needed to your workflow code!

In [ ]:

Copied!





from trulens.apps.llamaindex import TruLlamaWorkflow

tru_workflow_recorder = TruLlamaWorkflow(
    agent_workflow,
    app_name="AgentWorkflow",
    app_version="base",
    main_method=agent_workflow.run,
    feedbacks=[f_execution_efficiency, f_logical_consistency],
)
from trulens.apps.llamaindex import TruLlamaWorkflow

tru_workflow_recorder = TruLlamaWorkflow(
    agent_workflow,
    app_name="AgentWorkflow",
    app_version="base",
    main_method=agent_workflow.run,
    feedbacks=[f_execution_efficiency, f_logical_consistency],
)

Execute and Trace the Workflow¶

Now we run the workflow within a TruLens recording context. This will capture:

Complete execution trace: Every agent call, tool usage, and state change
Timing information: How long each agent and tool takes to execute
Input/output data: What each agent receives and produces
Agent handoffs: When and why control transfers between agents
Evaluation scores: Automatic assessment of execution efficiency and logical consistency

The workflow will execute normally while TruLens captures everything in the background.

In [ ]:

Copied!





with tru_workflow_recorder as recording:
    # For TruLens recording, we need to use the regular run method and await the result
    result = await agent_workflow.run(
        user_msg=(
            "Write me a report on the history of the internet. "
            "Briefly describe the history of the internet, including the development of the internet, the development of the web, "
            "and the development of the internet in the 21st century."
        )
    )
    print("✅ Workflow completed successfully!")

from IPython.display import Markdown

Markdown(result.response.blocks[0].text)
with tru_workflow_recorder as recording:
    # For TruLens recording, we need to use the regular run method and await the result
    result = await agent_workflow.run(
        user_msg=(
            "Write me a report on the history of the internet. "
            "Briefly describe the history of the internet, including the development of the internet, the development of the web, "
            "and the development of the internet in the 21st century."
        )
    )
    print("✅ Workflow completed successfully!")

from IPython.display import Markdown

Markdown(result.response.blocks[0].text)

Analyze Results in TruLens Dashboard¶

Launch the TruLens dashboard to explore your multi-agent system's behavior:

What You'll See¶

Trace Timeline: Visual representation of agent execution and handoffs
Agent Spans: Individual agent executions with timing and context
Tool Call Details: Each tool invocation with inputs, outputs, and duration
Evaluation Scores: Execution efficiency and logical consistency metrics
Performance Insights: Bottlenecks, optimization opportunities, and system health

Key Metrics to Monitor¶

Agent utilization: Which agents are most/least active
Handoff patterns: How control flows between agents
Tool effectiveness: Which tools provide the most value
Execution bottlenecks: Where the system spends the most time

💡 Tip: Evaluations may take a few moments to compute. Refresh the dashboard to see updated results as they become available.

In [ ]:

Copied!

from trulens.dashboard import run_dashboard

run_dashboard()
from trulens.dashboard import run_dashboard

run_dashboard()