Tracing and Evaluating Multi-Agent Systems with TruLens and LlamaIndex AgentWorkflowยถ
In this notebook, we demonstrate how to use TruLens to trace, monitor, and evaluate multi-agent systems built with LlamaIndex's AgentWorkflow
.
What You'll Learnยถ
- How to instrument LlamaIndex
AgentWorkflow
with TruLens for comprehensive tracing - How to capture agent-level spans and tool calls in the TruLens dashboard
- How to evaluate multi-agent system performance using TruLens feedback functions
- How to monitor execution efficiency and logical consistency across agent handoffs
The Multi-Agent Systemยถ
We'll build a report generation system with three specialized agents:
- ResearchAgent: Searches the web and records research notes
- WriteAgent: Creates markdown reports based on research
- ReviewAgent: Reviews and provides feedback on reports
The key focus is on observability and evaluation - understanding how agents interact, where bottlenecks occur, and how to measure system performance.
Setupยถ
This example requires several key components:
- TruLens: For tracing, monitoring, and evaluating the multi-agent system
- LlamaIndex: For the
AgentWorkflow
and agent implementations - OpenAI: As the LLM provider for all agents
- Tavily: For web search capabilities
We'll use OpenAI's GPT-4 as our LLM across all agents for consistency. TruLens will capture every interaction, tool call, and agent handoff, providing complete visibility into the system's behavior.
%pip install llama-index trulens-apps-llamaindex trulens-providers-openai tavily-python -q
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-..."
os.environ["TAVILY_API_KEY"] = "tvly-dev-..."
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))
System Design & Tracing Architectureยถ
Our multi-agent system consists of three specialized agents that work together in a coordinated workflow:
Agent Rolesยถ
- ResearchAgent: Searches the web and records research notes
- WriteAgent: Creates markdown reports based on research findings
- ReviewAgent: Reviews reports and provides feedback for improvements
Tools & Observabilityยถ
Each agent uses specific tools that TruLens will trace:
web_search
: Web search queries and resultsrecord_notes
: Note-taking and knowledge storagewrite_report
: Report generation processreview_report
: Review and feedback generation
What TruLens Capturesยถ
With TruLens instrumentation, we'll observe:
- Agent-level spans: Each agent's execution time and context
- Tool call traces: Individual tool invocations and their results
- Agent handoffs: When and why control passes between agents
- State transitions: How the shared context evolves
- Performance metrics: Execution efficiency and logical consistency
The Context
class enables state sharing between agents, and TruLens will track how this state evolves throughout the workflow execution.
from tavily import AsyncTavilyClient
from llama_index.core.workflow import Context
async def search_web(query: str) -> str:
"""Useful for using the web to answer questions."""
client = AsyncTavilyClient(api_key=os.getenv("TAVILY_API_KEY"))
return str(await client.search(query))
async def record_notes(ctx: Context, notes: str, notes_title: str) -> str:
"""Useful for recording notes on a given topic. Your input should be notes with a title to save the notes under."""
async with ctx.store.edit_state() as ctx_state:
if "research_notes" not in ctx_state["state"]:
ctx_state["state"]["research_notes"] = {}
ctx_state["state"]["research_notes"][notes_title] = notes
return "Notes recorded."
async def write_report(ctx: Context, report_content: str) -> str:
"""Useful for writing a report on a given topic. Your input should be a markdown formatted report."""
async with ctx.store.edit_state() as ctx_state:
ctx_state["state"]["report_content"] = report_content
return "Report written."
async def review_report(ctx: Context, review: str) -> str:
"""Useful for reviewing a report and providing feedback. Your input should be a review of the report."""
async with ctx.store.edit_state() as ctx_state:
ctx_state["state"]["review"] = review
return "Report reviewed."
Creating Traceable Agentsยถ
Now we'll create our three agents using LlamaIndex's FunctionAgent
class. Each agent has:
- Unique names: Essential for TruLens to distinguish agents in traces
- Clear descriptions: Help with agent handoff decisions and trace clarity
- Specific tools: Each tool call will appear as a distinct span in TruLens
- Handoff capabilities: TruLens will track when and why agents transfer control
The agent names (ResearchAgent
, WriteAgent
, ReviewAgent
) will appear as span names in the TruLens dashboard, making it easy to follow the execution flow.
from llama_index.core.agent.workflow import FunctionAgent, ReActAgent
research_agent = FunctionAgent(
name="ResearchAgent",
description="Useful for searching the web for information on a given topic and recording notes on the topic.",
system_prompt=(
"You are the ResearchAgent that can search the web for information on a given topic and record notes on the topic. "
"Once notes are recorded and you are satisfied, you should hand off control to the WriteAgent to write a report on the topic. "
"You should have at least some notes on a topic before handing off control to the WriteAgent."
),
llm=llm,
tools=[search_web, record_notes],
can_handoff_to=["WriteAgent"],
)
write_agent = FunctionAgent(
name="WriteAgent",
description="Useful for writing a report on a given topic.",
system_prompt=(
"You are the WriteAgent that can write a report on a given topic. "
"Your report should be in a markdown format. The content should be grounded in the research notes. "
"Once the report is written, you should get feedback at least once from the ReviewAgent."
),
llm=llm,
tools=[write_report],
can_handoff_to=["ReviewAgent", "ResearchAgent"],
)
review_agent = FunctionAgent(
name="ReviewAgent",
description="Useful for reviewing a report and providing feedback.",
system_prompt=(
"You are the ReviewAgent that can review the write report and provide feedback. "
"Your review should either approve the current report or request changes for the WriteAgent to implement. "
"If you have feedback that requires changes, you should hand off control to the WriteAgent to implement the changes after submitting the review."
),
llm=llm,
tools=[review_report],
can_handoff_to=["WriteAgent"],
)
Creating the AgentWorkflowยถ
With our agents defined, we create the AgentWorkflow
that orchestrates their interactions. The workflow configuration includes:
- Agent list: All participating agents
- Root agent: The starting point (
ResearchAgent
) - Initial state: Shared context that TruLens will track as it evolves
This workflow will be instrumented by TruLens to capture the complete execution trace.
from llama_index.core.agent.workflow import AgentWorkflow
agent_workflow = AgentWorkflow(
agents=[research_agent, write_agent, review_agent],
root_agent=research_agent.name,
initial_state={
"research_notes": {},
"report_content": "Not written yet.",
"review": "Review required.",
},
)
Initialize TruLens for Tracingยถ
We start by initializing a TruLens session that will:
- Store all traces: Every agent call, tool usage, and handoff
- Enable OTEL tracing: Advanced OpenTelemetry-based instrumentation
- Prepare for evaluation: Set up the infrastructure for feedback functions
The database will capture detailed execution traces that we can analyze in the TruLens dashboard.
from trulens.core import TruSession
session = TruSession()
session.reset_database()
Define Evaluation Metricsยถ
For multi-agent systems, we focus on evaluating:
Execution Efficiencyยถ
- Measures how effectively the agents coordinate and complete tasks
- Identifies bottlenecks and unnecessary steps in the workflow
- Evaluates resource utilization across agent handoffs
Logical Consistencyยถ
- Ensures agents make coherent decisions throughout the workflow
- Validates that handoffs occur at appropriate times
- Checks that the final output aligns with the initial request
These trace-level evaluations analyze the entire workflow execution, providing insights into system-wide performance rather than individual component behavior.
from trulens.core import Feedback
from trulens.core.feedback.selector import Selector
from trulens.providers.openai import OpenAI as OpenAIProvider
llm_judge = OpenAIProvider(model_engine="gpt-4.1")
f_execution_efficiency = Feedback(
llm_judge.execution_efficiency_with_cot_reasons,
name="Execution Efficiency",
).on({
"trace": Selector(trace_level=True),
})
f_logical_consistency = Feedback(
llm_judge.logical_consistency_with_cot_reasons,
name="Logical Consistency",
).on({
"trace": Selector(trace_level=True),
})
Instrument the Workflow with TruLensยถ
Now we wrap our AgentWorkflow
with TruLlamaWorkflow
to enable comprehensive tracing:
- Automatic instrumentation: Captures all agent calls and tool usage
- Agent-level spans: Each agent execution appears as a distinct trace segment
- Tool call tracking: Individual tool invocations are traced with inputs/outputs
- Evaluation integration: Feedback functions run automatically on each trace
The instrumentation happens transparently - no changes needed to your workflow code!
from trulens.apps.llamaindex import TruLlamaWorkflow
tru_workflow_recorder = TruLlamaWorkflow(
agent_workflow,
app_name="AgentWorkflow",
app_version="base",
main_method=agent_workflow.run,
feedbacks=[f_execution_efficiency, f_logical_consistency],
)
Execute and Trace the Workflowยถ
Now we run the workflow within a TruLens recording context. This will capture:
- Complete execution trace: Every agent call, tool usage, and state change
- Timing information: How long each agent and tool takes to execute
- Input/output data: What each agent receives and produces
- Agent handoffs: When and why control transfers between agents
- Evaluation scores: Automatic assessment of execution efficiency and logical consistency
The workflow will execute normally while TruLens captures everything in the background.
with tru_workflow_recorder as recording:
# For TruLens recording, we need to use the regular run method and await the result
result = await agent_workflow.run(
user_msg=(
"Write me a report on the history of the internet. "
"Briefly describe the history of the internet, including the development of the internet, the development of the web, "
"and the development of the internet in the 21st century."
)
)
print("โ
Workflow completed successfully!")
from IPython.display import Markdown
Markdown(result.response.blocks[0].text)
Analyze Results in TruLens Dashboardยถ
Launch the TruLens dashboard to explore your multi-agent system's behavior:
What You'll Seeยถ
- Trace Timeline: Visual representation of agent execution and handoffs
- Agent Spans: Individual agent executions with timing and context
- Tool Call Details: Each tool invocation with inputs, outputs, and duration
- Evaluation Scores: Execution efficiency and logical consistency metrics
- Performance Insights: Bottlenecks, optimization opportunities, and system health
Key Metrics to Monitorยถ
- Agent utilization: Which agents are most/least active
- Handoff patterns: How control flows between agents
- Tool effectiveness: Which tools provide the most value
- Execution bottlenecks: Where the system spends the most time
๐ก Tip: Evaluations may take a few moments to compute. Refresh the dashboard to see updated results as they become available.
from trulens.dashboard import run_dashboard
run_dashboard()