TruLens Scorers with MLflow¶
TruLens feedback functions are available as first-class scorers in MLflow's GenAI evaluation framework starting with MLflow 3.10.0. This integration allows you to use TruLens' powerful evaluation capabilities directly within your MLflow workflows.
In this notebook, we'll demonstrate how to:
- Use TruLens scorers for RAG evaluation (Groundedness, ContextRelevance, AnswerRelevance)
- Use output scorers (Coherence)
- Integrate with MLflow's batch evaluation via
mlflow.genai.evaluate - Use TruLens scorers with MLflow tracing
- Evaluate agentic workflows with Agent Trace Scorers (ToolSelection, ToolCalling, etc.)
# !pip install 'mlflow>=3.10.0' trulens trulens-providers-litellm openai
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-..."
Available TruLens Scorers in MLflow¶
TruLens provides several categories of scorers:
RAG Evaluation Scorers:
Groundedness- Evaluates whether the response is grounded in the provided contextContextRelevance- Evaluates whether the retrieved context is relevant to the queryAnswerRelevance- Evaluates whether the response is relevant to the input query
Output Scorers:
Coherence- Evaluates the coherence and logical flow of any LLM output
Agent Trace Scorers:
LogicalConsistency- Evaluates logical consistency of agent decisionsExecutionEfficiency- Evaluates efficiency of agent executionPlanAdherence- Evaluates whether the agent followed its planPlanQuality- Evaluates the quality of agent planningToolSelection- Evaluates appropriateness of tool selectionToolCalling- Evaluates correctness of tool calls
Basic Usage: Direct Scorer Calls¶
Let's start with a simple example of using TruLens scorers directly.
from mlflow.genai.scorers.trulens import Groundedness
# Create a Groundedness scorer
groundedness_scorer = Groundedness(model="openai:/gpt-4o-mini")
# Evaluate a response against context
feedback = groundedness_scorer(
outputs="Paris is the capital of France and is known for the Eiffel Tower.",
expectations={
"context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower, which was built in 1889."
},
)
print(f"Groundedness: {feedback.value}") # "yes" or "no"
print(f"Score: {feedback.metadata['score']}") # 0.0 to 1.0
# Test with a hallucinated response
hallucinated_feedback = groundedness_scorer(
outputs="Paris is the capital of France and was founded by Julius Caesar in 52 BC.",
expectations={
"context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower."
},
)
print(f"Groundedness: {hallucinated_feedback.value}")
print(f"Score: {hallucinated_feedback.metadata['score']}")
Batch Evaluation with mlflow.genai.evaluate¶
TruLens scorers integrate seamlessly with MLflow's batch evaluation framework. This is useful for evaluating multiple examples at once.
import mlflow
from mlflow.genai.scorers.trulens import (
AnswerRelevance,
ContextRelevance,
Groundedness,
)
# Prepare evaluation dataset
eval_dataset = [
{
"inputs": {"question": "What is MLflow?"},
"outputs": "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.",
"expectations": {
"context": "MLflow is an open-source platform created by Databricks for managing the end-to-end machine learning lifecycle. It includes experiment tracking, model registry, and deployment capabilities."
},
},
{
"inputs": {"question": "What is TruLens used for?"},
"outputs": "TruLens is used for evaluating and monitoring LLM applications using feedback functions.",
"expectations": {
"context": "TruLens is a library for evaluating LLM applications. It provides feedback functions like groundedness, relevance, and coherence to measure the quality of LLM outputs."
},
},
{
"inputs": {"question": "What programming language is Python?"},
"outputs": "Python is a high-level, interpreted programming language known for its simplicity.",
"expectations": {
"context": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
},
},
]
# Run batch evaluation with multiple TruLens scorers
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[
Groundedness(model="openai:/gpt-4o-mini"),
ContextRelevance(model="openai:/gpt-4o-mini"),
AnswerRelevance(model="openai:/gpt-4o-mini"),
],
)
# View results
print("Evaluation Results:")
print(results.tables["eval_results"])
# View aggregate metrics
print("\nAggregate Metrics:")
for metric, value in results.metrics.items():
print(f" {metric}: {value:.3f}")
Configuring Thresholds¶
TruLens scorers return a score between 0 and 1. You can configure the threshold for pass/fail decisions.
# Create scorer with custom threshold
strict_groundedness = Groundedness(model="openai:/gpt-4o-mini", threshold=0.8)
# Test with the same response
feedback = strict_groundedness(
outputs="Paris is the capital of France.",
expectations={
"context": "France is a country in Europe. Its capital is Paris."
},
)
print(f"Pass/Fail: {feedback.value}") # "yes" if score >= 0.8, else "no"
print(f"Actual Score: {feedback.metadata['score']}")
print(f"Threshold: {feedback.metadata['threshold']}")
Using Different LLM Providers¶
TruLens scorers in MLflow support multiple LLM providers through LiteLLM.
# OpenAI
openai_scorer = Groundedness(model="openai:/gpt-4o-mini")
# Anthropic (requires ANTHROPIC_API_KEY)
# anthropic_scorer = Groundedness(model="anthropic:/claude-3-5-sonnet")
# Azure OpenAI (requires Azure configuration)
# azure_scorer = Groundedness(model="azure:/my-deployment-name")
# AWS Bedrock
# bedrock_scorer = Groundedness(model="bedrock:/anthropic.claude-3-sonnet")
# Google Vertex AI
# vertex_scorer = Groundedness(model="vertex_ai:/gemini-pro")
Dynamic Scorer Creation with get_scorer¶
Use get_scorer to create scorers dynamically by name. This is useful when you need to configure scorers from configuration files or user input.
from mlflow.genai.scorers.trulens import get_scorer
# Define which scorers to use (could come from config file)
scorer_names = ["Groundedness", "ContextRelevance", "AnswerRelevance"]
# Create scorers dynamically
scorers = [get_scorer(name, model="openai:/gpt-4o-mini") for name in scorer_names]
# Use in evaluation
results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=scorers,
)
print(results.tables["eval_results"])
Output Scorer: Coherence¶
The Coherence scorer evaluates the logical flow and coherence of LLM outputs, independent of any context.
from mlflow.genai.scorers.trulens import Coherence
coherence_scorer = Coherence(model="openai:/gpt-4o-mini")
# Test coherent text
coherent_feedback = coherence_scorer(
outputs="Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention. Common applications include image recognition, natural language processing, and recommendation systems."
)
print(f"Coherence (good text): {coherent_feedback.value}")
print(f"Score: {coherent_feedback.metadata['score']}")
# Test incoherent text
incoherent_feedback = coherence_scorer(
outputs="Machine learning uses pizza. The sun is blue therefore cats can fly. Python programming because weather patterns indicate database normalization."
)
print(f"Coherence (poor text): {incoherent_feedback.value}")
print(f"Score: {incoherent_feedback.metadata['score']}")
Integrating with MLflow Tracing¶
TruLens scorers can be used with MLflow's tracing infrastructure to evaluate traced LLM applications.
import mlflow
from openai import OpenAI
# Enable MLflow autologging for OpenAI
mlflow.openai.autolog()
client = OpenAI()
# Define a simple RAG function with tracing
@mlflow.trace
def simple_rag(question: str, context: str) -> str:
"""A simple RAG function that answers questions based on context."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question based only on the provided context.",
},
{
"role": "user",
"content": f"Context: {context}\n\nQuestion: {question}",
},
],
)
return response.choices[0].message.content
# Run the RAG function
context = "TruLens is an open-source library for evaluating and tracking LLM applications. It provides feedback functions for measuring groundedness, relevance, and other quality metrics."
question = "What does TruLens do?"
answer = simple_rag(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")
Building a Complete RAG Evaluation Pipeline¶
Let's put it all together with a complete example that builds a simple RAG system and evaluates it using TruLens scorers.
# Sample knowledge base
knowledge_base = {
"mlflow": "MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry.",
"trulens": "TruLens is a library for evaluating LLM applications using feedback functions. It supports groundedness, relevance, and coherence evaluations.",
"langchain": "LangChain is a framework for developing applications powered by language models. It provides tools for prompt management, chains, and agents.",
"llamaindex": "LlamaIndex is a data framework for LLM applications. It provides tools for ingesting, structuring, and accessing private or domain-specific data.",
}
def retrieve_context(question: str) -> str:
"""Simple keyword-based retrieval."""
question_lower = question.lower()
for key, value in knowledge_base.items():
if key in question_lower:
return value
return "No relevant context found."
def generate_answer(question: str, context: str) -> str:
"""Generate answer using OpenAI."""
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the question based on the provided context. If the context doesn't contain relevant information, say so.",
},
{
"role": "user",
"content": f"Context: {context}\n\nQuestion: {question}",
},
],
)
return response.choices[0].message.content
# Generate answers for test questions
test_questions = [
"What is MLflow used for?",
"How does TruLens evaluate LLM applications?",
"What is LangChain?",
]
# Build evaluation dataset
rag_eval_dataset = []
for question in test_questions:
context = retrieve_context(question)
answer = generate_answer(question, context)
rag_eval_dataset.append({
"inputs": {"question": question},
"outputs": answer,
"expectations": {"context": context},
})
print(f"Q: {question}")
print(f"A: {answer}")
print(f"Context: {context[:100]}...")
print("-" * 50)
# Evaluate the RAG system
rag_results = mlflow.genai.evaluate(
data=rag_eval_dataset,
scorers=[
Groundedness(model="openai:/gpt-4o-mini", threshold=0.7),
ContextRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
AnswerRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
Coherence(model="openai:/gpt-4o-mini"),
],
)
print("\nRAG Evaluation Results:")
print(rag_results.tables["eval_results"])
# Print summary metrics
print("\nSummary Metrics:")
print("=" * 40)
for metric, value in sorted(rag_results.metrics.items()):
print(f"{metric}: {value:.3f}")
Evaluating Agents with TruLens Scorers¶
TruLens provides specialized scorers for evaluating agentic workflows. These scorers analyze agent traces to evaluate planning, tool usage, and execution quality.
Available Agent Scorers:
ToolSelection- Evaluates whether the agent selected appropriate tools for each stepToolCalling- Evaluates the correctness of tool call parameters and executionPlanQuality- Evaluates the quality of the agent's planningPlanAdherence- Evaluates whether the agent followed its planLogicalConsistency- Evaluates logical consistency across agent decisionsExecutionEfficiency- Evaluates efficiency of the agent's execution
import json
import mlflow
from mlflow.genai.scorers.trulens import ToolCalling, ToolSelection
from openai import OpenAI
# Define tools for the agent
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA",
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit",
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query",
},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate",
},
},
"required": ["expression"],
},
},
},
]
# Simulated tool execution
def execute_tool(tool_name: str, arguments: dict) -> str:
if tool_name == "get_weather":
return json.dumps({
"location": arguments["location"],
"temperature": 72,
"unit": arguments.get("unit", "fahrenheit"),
"conditions": "sunny",
})
elif tool_name == "search_web":
return json.dumps({
"query": arguments["query"],
"results": [
"Result 1: Relevant information about " + arguments["query"],
"Result 2: More details on " + arguments["query"],
],
})
elif tool_name == "calculate":
try:
result = eval(arguments["expression"]) # Note: use safe eval in production
return json.dumps({"expression": arguments["expression"], "result": result})
except Exception:
return json.dumps({"error": "Invalid expression"})
return json.dumps({"error": "Unknown tool"})
@mlflow.trace
def run_agent(user_query: str) -> str:
"""Run an agent that can use tools to answer questions."""
client = OpenAI()
messages = [
{
"role": "system",
"content": "You are a helpful assistant with access to tools. Use them when appropriate to answer the user's questions.",
},
{"role": "user", "content": user_query},
]
# First call to get tool selection
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
tool_choice="auto",
)
assistant_message = response.choices[0].message
# If the model wants to call tools
if assistant_message.tool_calls:
messages.append(assistant_message)
# Execute each tool call
for tool_call in assistant_message.tool_calls:
tool_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
# Execute the tool
tool_result = execute_tool(tool_name, arguments)
# Add tool result to messages
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": tool_result,
})
# Get final response after tool execution
final_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
return final_response.choices[0].message.content
return assistant_message.content
# Run the agent with different queries
agent_queries = [
"What's the weather like in San Francisco?",
"Calculate 25 * 4 + 10",
"Search for information about TruLens evaluation",
]
agent_results = []
for query in agent_queries:
result = run_agent(query)
agent_results.append({"query": query, "response": result})
print(f"Query: {query}")
print(f"Response: {result}")
print("-" * 50)
# Get the traces from MLflow
# search_traces returns a DataFrame with trace info including the trace objects
traces_df = mlflow.search_traces(max_results=3)
# The 'trace' column contains the actual Trace objects
traces = traces_df["trace"].tolist()
# Evaluate agent traces with TruLens scorers
tool_selection_scorer = ToolSelection(model="openai:/gpt-4o-mini")
tool_calling_scorer = ToolCalling(model="openai:/gpt-4o-mini")
print("Agent Evaluation Results:")
print("=" * 60)
for i, trace in enumerate(traces):
print(f"\nQuery {i+1}: {agent_queries[i]}")
# Evaluate tool selection
selection_feedback = tool_selection_scorer(trace=trace)
print(f" Tool Selection: {selection_feedback.value}")
if selection_feedback.rationale:
print(f" {selection_feedback.rationale[:100]}...")
# Evaluate tool calling
calling_feedback = tool_calling_scorer(trace=trace)
print(f" Tool Calling: {calling_feedback.value}")
if calling_feedback.rationale:
print(f" {calling_feedback.rationale[:100]}...")
# You can also use agent scorers in batch evaluation with mlflow.genai.evaluate
# by providing traces directly in the evaluation dataset
from mlflow.genai.scorers.trulens import (
ExecutionEfficiency,
LogicalConsistency,
PlanAdherence,
PlanQuality,
)
# Run comprehensive agent evaluation using the agent function directly
agent_eval_results = mlflow.genai.evaluate(
data=[
{"inputs": {"user_query": q}} for q in agent_queries
],
predict_fn=run_agent,
scorers=[
ToolSelection(model="openai:/gpt-4o-mini"),
ToolCalling(model="openai:/gpt-4o-mini"),
Coherence(model="openai:/gpt-4o-mini"),
],
)
print("\nComprehensive Agent Evaluation:")
print(agent_eval_results.tables["eval_results"])