TruLens Scorers with MLflow¶

TruLens feedback functions are available as first-class scorers in MLflow's GenAI evaluation framework starting with MLflow 3.10.0. This integration allows you to use TruLens' powerful evaluation capabilities directly within your MLflow workflows.

In this notebook, we'll demonstrate how to:

Use TruLens scorers for RAG evaluation (Groundedness, ContextRelevance, AnswerRelevance)
Use output scorers (Coherence)
Integrate with MLflow's batch evaluation via mlflow.genai.evaluate
Use TruLens scorers with MLflow tracing
Evaluate agentic workflows with Agent Trace Scorers (ToolSelection, ToolCalling, etc.)

In [ ]:

Copied!

# !pip install 'mlflow>=3.10.0' trulens trulens-providers-litellm openai
# !pip install 'mlflow>=3.10.0' trulens trulens-providers-litellm openai

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."

Available TruLens Scorers in MLflow¶

TruLens provides several categories of scorers:

RAG Evaluation Scorers:

Groundedness - Evaluates whether the response is grounded in the provided context
ContextRelevance - Evaluates whether the retrieved context is relevant to the query
AnswerRelevance - Evaluates whether the response is relevant to the input query

Output Scorers:

Coherence - Evaluates the coherence and logical flow of any LLM output

Agent Trace Scorers:

LogicalConsistency - Evaluates logical consistency of agent decisions
ExecutionEfficiency - Evaluates efficiency of agent execution
PlanAdherence - Evaluates whether the agent followed its plan
PlanQuality - Evaluates the quality of agent planning
ToolSelection - Evaluates appropriateness of tool selection
ToolCalling - Evaluates correctness of tool calls

Basic Usage: Direct Scorer Calls¶

Let's start with a simple example of using TruLens scorers directly.

In [ ]:

Copied!





from mlflow.genai.scorers.trulens import Groundedness

# Create a Groundedness scorer
groundedness_scorer = Groundedness(model="openai:/gpt-4o-mini")

# Evaluate a response against context
feedback = groundedness_scorer(
    outputs="Paris is the capital of France and is known for the Eiffel Tower.",
    expectations={
        "context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower, which was built in 1889."
    },
)

print(f"Groundedness: {feedback.value}")  # "yes" or "no"
print(f"Score: {feedback.metadata['score']}")  # 0.0 to 1.0
from mlflow.genai.scorers.trulens import Groundedness

# Create a Groundedness scorer
groundedness_scorer = Groundedness(model="openai:/gpt-4o-mini")

# Evaluate a response against context
feedback = groundedness_scorer(
    outputs="Paris is the capital of France and is known for the Eiffel Tower.",
    expectations={
        "context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower, which was built in 1889."
    },
)

print(f"Groundedness: {feedback.value}")  # "yes" or "no"
print(f"Score: {feedback.metadata['score']}")  # 0.0 to 1.0

In [ ]:

Copied!





# Test with a hallucinated response
hallucinated_feedback = groundedness_scorer(
    outputs="Paris is the capital of France and was founded by Julius Caesar in 52 BC.",
    expectations={
        "context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower."
    },
)

print(f"Groundedness: {hallucinated_feedback.value}")
print(f"Score: {hallucinated_feedback.metadata['score']}")
# Test with a hallucinated response
hallucinated_feedback = groundedness_scorer(
    outputs="Paris is the capital of France and was founded by Julius Caesar in 52 BC.",
    expectations={
        "context": "France is a country in Western Europe. Its capital city is Paris. Paris is famous for the Eiffel Tower."
    },
)

print(f"Groundedness: {hallucinated_feedback.value}")
print(f"Score: {hallucinated_feedback.metadata['score']}")

Batch Evaluation with mlflow.genai.evaluate¶

TruLens scorers integrate seamlessly with MLflow's batch evaluation framework. This is useful for evaluating multiple examples at once.

In [ ]:

Copied!





import mlflow
from mlflow.genai.scorers.trulens import (
    AnswerRelevance,
    ContextRelevance,
    Groundedness,
)

# Prepare evaluation dataset
eval_dataset = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.",
        "expectations": {
            "context": "MLflow is an open-source platform created by Databricks for managing the end-to-end machine learning lifecycle. It includes experiment tracking, model registry, and deployment capabilities."
        },
    },
    {
        "inputs": {"question": "What is TruLens used for?"},
        "outputs": "TruLens is used for evaluating and monitoring LLM applications using feedback functions.",
        "expectations": {
            "context": "TruLens is a library for evaluating LLM applications. It provides feedback functions like groundedness, relevance, and coherence to measure the quality of LLM outputs."
        },
    },
    {
        "inputs": {"question": "What programming language is Python?"},
        "outputs": "Python is a high-level, interpreted programming language known for its simplicity.",
        "expectations": {
            "context": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
        },
    },
]
import mlflow
from mlflow.genai.scorers.trulens import (
    AnswerRelevance,
    ContextRelevance,
    Groundedness,
)

# Prepare evaluation dataset
eval_dataset = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.",
        "expectations": {
            "context": "MLflow is an open-source platform created by Databricks for managing the end-to-end machine learning lifecycle. It includes experiment tracking, model registry, and deployment capabilities."
        },
    },
    {
        "inputs": {"question": "What is TruLens used for?"},
        "outputs": "TruLens is used for evaluating and monitoring LLM applications using feedback functions.",
        "expectations": {
            "context": "TruLens is a library for evaluating LLM applications. It provides feedback functions like groundedness, relevance, and coherence to measure the quality of LLM outputs."
        },
    },
    {
        "inputs": {"question": "What programming language is Python?"},
        "outputs": "Python is a high-level, interpreted programming language known for its simplicity.",
        "expectations": {
            "context": "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation."
        },
    },
]

In [ ]:

Copied!





# Run batch evaluation with multiple TruLens scorers
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini"),
        ContextRelevance(model="openai:/gpt-4o-mini"),
        AnswerRelevance(model="openai:/gpt-4o-mini"),
    ],
)

# View results
print("Evaluation Results:")
print(results.tables["eval_results"])
# Run batch evaluation with multiple TruLens scorers
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini"),
        ContextRelevance(model="openai:/gpt-4o-mini"),
        AnswerRelevance(model="openai:/gpt-4o-mini"),
    ],
)

# View results
print("Evaluation Results:")
print(results.tables["eval_results"])

In [ ]:

Copied!





# View aggregate metrics
print("\nAggregate Metrics:")
for metric, value in results.metrics.items():
    print(f"  {metric}: {value:.3f}")
# View aggregate metrics
print("\nAggregate Metrics:")
for metric, value in results.metrics.items():
    print(f"  {metric}: {value:.3f}")

Configuring Thresholds¶

TruLens scorers return a score between 0 and 1. You can configure the threshold for pass/fail decisions.

In [ ]:

Copied!





# Create scorer with custom threshold
strict_groundedness = Groundedness(model="openai:/gpt-4o-mini", threshold=0.8)

# Test with the same response
feedback = strict_groundedness(
    outputs="Paris is the capital of France.",
    expectations={
        "context": "France is a country in Europe. Its capital is Paris."
    },
)

print(f"Pass/Fail: {feedback.value}")  # "yes" if score >= 0.8, else "no"
print(f"Actual Score: {feedback.metadata['score']}")
print(f"Threshold: {feedback.metadata['threshold']}")
# Create scorer with custom threshold
strict_groundedness = Groundedness(model="openai:/gpt-4o-mini", threshold=0.8)

# Test with the same response
feedback = strict_groundedness(
    outputs="Paris is the capital of France.",
    expectations={
        "context": "France is a country in Europe. Its capital is Paris."
    },
)

print(f"Pass/Fail: {feedback.value}")  # "yes" if score >= 0.8, else "no"
print(f"Actual Score: {feedback.metadata['score']}")
print(f"Threshold: {feedback.metadata['threshold']}")

Using Different LLM Providers¶

TruLens scorers in MLflow support multiple LLM providers through LiteLLM.

In [ ]:

Copied!





# OpenAI
openai_scorer = Groundedness(model="openai:/gpt-4o-mini")

# Anthropic (requires ANTHROPIC_API_KEY)
# anthropic_scorer = Groundedness(model="anthropic:/claude-3-5-sonnet")

# Azure OpenAI (requires Azure configuration)
# azure_scorer = Groundedness(model="azure:/my-deployment-name")

# AWS Bedrock
# bedrock_scorer = Groundedness(model="bedrock:/anthropic.claude-3-sonnet")

# Google Vertex AI
# vertex_scorer = Groundedness(model="vertex_ai:/gemini-pro")
# OpenAI
openai_scorer = Groundedness(model="openai:/gpt-4o-mini")

# Anthropic (requires ANTHROPIC_API_KEY)
# anthropic_scorer = Groundedness(model="anthropic:/claude-3-5-sonnet")

# Azure OpenAI (requires Azure configuration)
# azure_scorer = Groundedness(model="azure:/my-deployment-name")

# AWS Bedrock
# bedrock_scorer = Groundedness(model="bedrock:/anthropic.claude-3-sonnet")

# Google Vertex AI
# vertex_scorer = Groundedness(model="vertex_ai:/gemini-pro")

Dynamic Scorer Creation with get_scorer¶

Use get_scorer to create scorers dynamically by name. This is useful when you need to configure scorers from configuration files or user input.

In [ ]:

Copied!





from mlflow.genai.scorers.trulens import get_scorer

# Define which scorers to use (could come from config file)
scorer_names = ["Groundedness", "ContextRelevance", "AnswerRelevance"]

# Create scorers dynamically
scorers = [get_scorer(name, model="openai:/gpt-4o-mini") for name in scorer_names]

# Use in evaluation
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=scorers,
)

print(results.tables["eval_results"])
from mlflow.genai.scorers.trulens import get_scorer

# Define which scorers to use (could come from config file)
scorer_names = ["Groundedness", "ContextRelevance", "AnswerRelevance"]

# Create scorers dynamically
scorers = [get_scorer(name, model="openai:/gpt-4o-mini") for name in scorer_names]

# Use in evaluation
results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=scorers,
)

print(results.tables["eval_results"])

Output Scorer: Coherence¶

The Coherence scorer evaluates the logical flow and coherence of LLM outputs, independent of any context.

In [ ]:

Copied!





from mlflow.genai.scorers.trulens import Coherence

coherence_scorer = Coherence(model="openai:/gpt-4o-mini")

# Test coherent text
coherent_feedback = coherence_scorer(
    outputs="Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention. Common applications include image recognition, natural language processing, and recommendation systems."
)

print(f"Coherence (good text): {coherent_feedback.value}")
print(f"Score: {coherent_feedback.metadata['score']}")
from mlflow.genai.scorers.trulens import Coherence

coherence_scorer = Coherence(model="openai:/gpt-4o-mini")

# Test coherent text
coherent_feedback = coherence_scorer(
    outputs="Machine learning is a subset of artificial intelligence that enables systems to learn from data. It uses algorithms to identify patterns and make decisions with minimal human intervention. Common applications include image recognition, natural language processing, and recommendation systems."
)

print(f"Coherence (good text): {coherent_feedback.value}")
print(f"Score: {coherent_feedback.metadata['score']}")

In [ ]:

Copied!





# Test incoherent text
incoherent_feedback = coherence_scorer(
    outputs="Machine learning uses pizza. The sun is blue therefore cats can fly. Python programming because weather patterns indicate database normalization."
)

print(f"Coherence (poor text): {incoherent_feedback.value}")
print(f"Score: {incoherent_feedback.metadata['score']}")
# Test incoherent text
incoherent_feedback = coherence_scorer(
    outputs="Machine learning uses pizza. The sun is blue therefore cats can fly. Python programming because weather patterns indicate database normalization."
)

print(f"Coherence (poor text): {incoherent_feedback.value}")
print(f"Score: {incoherent_feedback.metadata['score']}")

Integrating with MLflow Tracing¶

TruLens scorers can be used with MLflow's tracing infrastructure to evaluate traced LLM applications.

In [ ]:

Copied!





import mlflow
from openai import OpenAI

# Enable MLflow autologging for OpenAI
mlflow.openai.autolog()

client = OpenAI()


# Define a simple RAG function with tracing
@mlflow.trace
def simple_rag(question: str, context: str) -> str:
    """A simple RAG function that answers questions based on context."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer the question based only on the provided context.",
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content
import mlflow
from openai import OpenAI

# Enable MLflow autologging for OpenAI
mlflow.openai.autolog()

client = OpenAI()


# Define a simple RAG function with tracing
@mlflow.trace
def simple_rag(question: str, context: str) -> str:
    """A simple RAG function that answers questions based on context."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer the question based only on the provided context.",
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content

In [ ]:

Copied!





# Run the RAG function
context = "TruLens is an open-source library for evaluating and tracking LLM applications. It provides feedback functions for measuring groundedness, relevance, and other quality metrics."
question = "What does TruLens do?"

answer = simple_rag(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")
# Run the RAG function
context = "TruLens is an open-source library for evaluating and tracking LLM applications. It provides feedback functions for measuring groundedness, relevance, and other quality metrics."
question = "What does TruLens do?"

answer = simple_rag(question, context)
print(f"Question: {question}")
print(f"Answer: {answer}")

Building a Complete RAG Evaluation Pipeline¶

Let's put it all together with a complete example that builds a simple RAG system and evaluates it using TruLens scorers.

In [ ]:

Copied!





# Sample knowledge base
knowledge_base = {
    "mlflow": "MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry.",
    "trulens": "TruLens is a library for evaluating LLM applications using feedback functions. It supports groundedness, relevance, and coherence evaluations.",
    "langchain": "LangChain is a framework for developing applications powered by language models. It provides tools for prompt management, chains, and agents.",
    "llamaindex": "LlamaIndex is a data framework for LLM applications. It provides tools for ingesting, structuring, and accessing private or domain-specific data.",
}


def retrieve_context(question: str) -> str:
    """Simple keyword-based retrieval."""
    question_lower = question.lower()
    for key, value in knowledge_base.items():
        if key in question_lower:
            return value
    return "No relevant context found."


def generate_answer(question: str, context: str) -> str:
    """Generate answer using OpenAI."""
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer the question based on the provided context. If the context doesn't contain relevant information, say so.",
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content
# Sample knowledge base
knowledge_base = {
    "mlflow": "MLflow is an open-source platform for the machine learning lifecycle, including experimentation, reproducibility, deployment, and a central model registry.",
    "trulens": "TruLens is a library for evaluating LLM applications using feedback functions. It supports groundedness, relevance, and coherence evaluations.",
    "langchain": "LangChain is a framework for developing applications powered by language models. It provides tools for prompt management, chains, and agents.",
    "llamaindex": "LlamaIndex is a data framework for LLM applications. It provides tools for ingesting, structuring, and accessing private or domain-specific data.",
}


def retrieve_context(question: str) -> str:
    """Simple keyword-based retrieval."""
    question_lower = question.lower()
    for key, value in knowledge_base.items():
        if key in question_lower:
            return value
    return "No relevant context found."


def generate_answer(question: str, context: str) -> str:
    """Generate answer using OpenAI."""
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "Answer the question based on the provided context. If the context doesn't contain relevant information, say so.",
            },
            {
                "role": "user",
                "content": f"Context: {context}\n\nQuestion: {question}",
            },
        ],
    )
    return response.choices[0].message.content

In [ ]:

Copied!





# Generate answers for test questions
test_questions = [
    "What is MLflow used for?",
    "How does TruLens evaluate LLM applications?",
    "What is LangChain?",
]

# Build evaluation dataset
rag_eval_dataset = []
for question in test_questions:
    context = retrieve_context(question)
    answer = generate_answer(question, context)

    rag_eval_dataset.append({
        "inputs": {"question": question},
        "outputs": answer,
        "expectations": {"context": context},
    })

    print(f"Q: {question}")
    print(f"A: {answer}")
    print(f"Context: {context[:100]}...")
    print("-" * 50)
# Generate answers for test questions
test_questions = [
    "What is MLflow used for?",
    "How does TruLens evaluate LLM applications?",
    "What is LangChain?",
]

# Build evaluation dataset
rag_eval_dataset = []
for question in test_questions:
    context = retrieve_context(question)
    answer = generate_answer(question, context)

    rag_eval_dataset.append({
        "inputs": {"question": question},
        "outputs": answer,
        "expectations": {"context": context},
    })

    print(f"Q: {question}")
    print(f"A: {answer}")
    print(f"Context: {context[:100]}...")
    print("-" * 50)

In [ ]:

Copied!





# Evaluate the RAG system
rag_results = mlflow.genai.evaluate(
    data=rag_eval_dataset,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini", threshold=0.7),
        ContextRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
        AnswerRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
        Coherence(model="openai:/gpt-4o-mini"),
    ],
)

print("\nRAG Evaluation Results:")
print(rag_results.tables["eval_results"])
# Evaluate the RAG system
rag_results = mlflow.genai.evaluate(
    data=rag_eval_dataset,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini", threshold=0.7),
        ContextRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
        AnswerRelevance(model="openai:/gpt-4o-mini", threshold=0.7),
        Coherence(model="openai:/gpt-4o-mini"),
    ],
)

print("\nRAG Evaluation Results:")
print(rag_results.tables["eval_results"])

In [ ]:

Copied!





# Print summary metrics
print("\nSummary Metrics:")
print("=" * 40)
for metric, value in sorted(rag_results.metrics.items()):
    print(f"{metric}: {value:.3f}")
# Print summary metrics
print("\nSummary Metrics:")
print("=" * 40)
for metric, value in sorted(rag_results.metrics.items()):
    print(f"{metric}: {value:.3f}")

Evaluating Agents with TruLens Scorers¶

TruLens provides specialized scorers for evaluating agentic workflows. These scorers analyze agent traces to evaluate planning, tool usage, and execution quality.

Available Agent Scorers:

ToolSelection - Evaluates whether the agent selected appropriate tools for each step
ToolCalling - Evaluates the correctness of tool call parameters and execution
PlanQuality - Evaluates the quality of the agent's planning
PlanAdherence - Evaluates whether the agent followed its plan
LogicalConsistency - Evaluates logical consistency across agent decisions
ExecutionEfficiency - Evaluates efficiency of the agent's execution

In [ ]:

Copied!





import json

import mlflow
from mlflow.genai.scorers.trulens import ToolCalling, ToolSelection
from openai import OpenAI

# Define tools for the agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query",
                    },
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Perform mathematical calculations",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate",
                    },
                },
                "required": ["expression"],
            },
        },
    },
]


# Simulated tool execution
def execute_tool(tool_name: str, arguments: dict) -> str:
    if tool_name == "get_weather":
        return json.dumps({
            "location": arguments["location"],
            "temperature": 72,
            "unit": arguments.get("unit", "fahrenheit"),
            "conditions": "sunny",
        })
    elif tool_name == "search_web":
        return json.dumps({
            "query": arguments["query"],
            "results": [
                "Result 1: Relevant information about " + arguments["query"],
                "Result 2: More details on " + arguments["query"],
            ],
        })
    elif tool_name == "calculate":
        try:
            result = eval(arguments["expression"])  # Note: use safe eval in production
            return json.dumps({"expression": arguments["expression"], "result": result})
        except Exception:
            return json.dumps({"error": "Invalid expression"})
    return json.dumps({"error": "Unknown tool"})
import json

import mlflow
from mlflow.genai.scorers.trulens import ToolCalling, ToolSelection
from openai import OpenAI

# Define tools for the agent
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["location"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query",
                    },
                },
                "required": ["query"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Perform mathematical calculations",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate",
                    },
                },
                "required": ["expression"],
            },
        },
    },
]


# Simulated tool execution
def execute_tool(tool_name: str, arguments: dict) -> str:
    if tool_name == "get_weather":
        return json.dumps({
            "location": arguments["location"],
            "temperature": 72,
            "unit": arguments.get("unit", "fahrenheit"),
            "conditions": "sunny",
        })
    elif tool_name == "search_web":
        return json.dumps({
            "query": arguments["query"],
            "results": [
                "Result 1: Relevant information about " + arguments["query"],
                "Result 2: More details on " + arguments["query"],
            ],
        })
    elif tool_name == "calculate":
        try:
            result = eval(arguments["expression"])  # Note: use safe eval in production
            return json.dumps({"expression": arguments["expression"], "result": result})
        except Exception:
            return json.dumps({"error": "Invalid expression"})
    return json.dumps({"error": "Unknown tool"})

In [ ]:

Copied!





@mlflow.trace
def run_agent(user_query: str) -> str:
    """Run an agent that can use tools to answer questions."""
    client = OpenAI()
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant with access to tools. Use them when appropriate to answer the user's questions.",
        },
        {"role": "user", "content": user_query},
    ]

    # First call to get tool selection
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    assistant_message = response.choices[0].message

    # If the model wants to call tools
    if assistant_message.tool_calls:
        messages.append(assistant_message)

        # Execute each tool call
        for tool_call in assistant_message.tool_calls:
            tool_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)

            # Execute the tool
            tool_result = execute_tool(tool_name, arguments)

            # Add tool result to messages
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": tool_result,
            })

        # Get final response after tool execution
        final_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )
        return final_response.choices[0].message.content

    return assistant_message.content


# Run the agent with different queries
agent_queries = [
    "What's the weather like in San Francisco?",
    "Calculate 25 * 4 + 10",
    "Search for information about TruLens evaluation",
]

agent_results = []
for query in agent_queries:
    result = run_agent(query)
    agent_results.append({"query": query, "response": result})
    print(f"Query: {query}")
    print(f"Response: {result}")
    print("-" * 50)
@mlflow.trace
def run_agent(user_query: str) -> str:
    """Run an agent that can use tools to answer questions."""
    client = OpenAI()
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant with access to tools. Use them when appropriate to answer the user's questions.",
        },
        {"role": "user", "content": user_query},
    ]

    # First call to get tool selection
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    assistant_message = response.choices[0].message

    # If the model wants to call tools
    if assistant_message.tool_calls:
        messages.append(assistant_message)

        # Execute each tool call
        for tool_call in assistant_message.tool_calls:
            tool_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)

            # Execute the tool
            tool_result = execute_tool(tool_name, arguments)

            # Add tool result to messages
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": tool_result,
            })

        # Get final response after tool execution
        final_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
        )
        return final_response.choices[0].message.content

    return assistant_message.content


# Run the agent with different queries
agent_queries = [
    "What's the weather like in San Francisco?",
    "Calculate 25 * 4 + 10",
    "Search for information about TruLens evaluation",
]

agent_results = []
for query in agent_queries:
    result = run_agent(query)
    agent_results.append({"query": query, "response": result})
    print(f"Query: {query}")
    print(f"Response: {result}")
    print("-" * 50)

In [ ]:

Copied!





# Get the traces from MLflow
# search_traces returns a DataFrame with trace info including the trace objects
traces_df = mlflow.search_traces(max_results=3)

# The 'trace' column contains the actual Trace objects
traces = traces_df["trace"].tolist()

# Evaluate agent traces with TruLens scorers
tool_selection_scorer = ToolSelection(model="openai:/gpt-4o-mini")
tool_calling_scorer = ToolCalling(model="openai:/gpt-4o-mini")

print("Agent Evaluation Results:")
print("=" * 60)

for i, trace in enumerate(traces):
    print(f"\nQuery {i+1}: {agent_queries[i]}")

    # Evaluate tool selection
    selection_feedback = tool_selection_scorer(trace=trace)
    print(f"  Tool Selection: {selection_feedback.value}")
    if selection_feedback.rationale:
        print(f"    {selection_feedback.rationale[:100]}...")

    # Evaluate tool calling
    calling_feedback = tool_calling_scorer(trace=trace)
    print(f"  Tool Calling: {calling_feedback.value}")
    if calling_feedback.rationale:
        print(f"    {calling_feedback.rationale[:100]}...")
# Get the traces from MLflow
# search_traces returns a DataFrame with trace info including the trace objects
traces_df = mlflow.search_traces(max_results=3)

# The 'trace' column contains the actual Trace objects
traces = traces_df["trace"].tolist()

# Evaluate agent traces with TruLens scorers
tool_selection_scorer = ToolSelection(model="openai:/gpt-4o-mini")
tool_calling_scorer = ToolCalling(model="openai:/gpt-4o-mini")

print("Agent Evaluation Results:")
print("=" * 60)

for i, trace in enumerate(traces):
    print(f"\nQuery {i+1}: {agent_queries[i]}")

    # Evaluate tool selection
    selection_feedback = tool_selection_scorer(trace=trace)
    print(f"  Tool Selection: {selection_feedback.value}")
    if selection_feedback.rationale:
        print(f"    {selection_feedback.rationale[:100]}...")

    # Evaluate tool calling
    calling_feedback = tool_calling_scorer(trace=trace)
    print(f"  Tool Calling: {calling_feedback.value}")
    if calling_feedback.rationale:
        print(f"    {calling_feedback.rationale[:100]}...")

In [ ]:

Copied!





# You can also use agent scorers in batch evaluation with mlflow.genai.evaluate
# by providing traces directly in the evaluation dataset

from mlflow.genai.scorers.trulens import (
    ExecutionEfficiency,
    LogicalConsistency,
    PlanAdherence,
    PlanQuality,
)

# Run comprehensive agent evaluation using the agent function directly
agent_eval_results = mlflow.genai.evaluate(
    data=[
        {"inputs": {"user_query": q}} for q in agent_queries
    ],
    predict_fn=run_agent,
    scorers=[
        ToolSelection(model="openai:/gpt-4o-mini"),
        ToolCalling(model="openai:/gpt-4o-mini"),
        Coherence(model="openai:/gpt-4o-mini"),
    ],
)

print("\nComprehensive Agent Evaluation:")
print(agent_eval_results.tables["eval_results"])
# You can also use agent scorers in batch evaluation with mlflow.genai.evaluate
# by providing traces directly in the evaluation dataset

from mlflow.genai.scorers.trulens import (
    ExecutionEfficiency,
    LogicalConsistency,
    PlanAdherence,
    PlanQuality,
)

# Run comprehensive agent evaluation using the agent function directly
agent_eval_results = mlflow.genai.evaluate(
    data=[
        {"inputs": {"user_query": q}} for q in agent_queries
    ],
    predict_fn=run_agent,
    scorers=[
        ToolSelection(model="openai:/gpt-4o-mini"),
        ToolCalling(model="openai:/gpt-4o-mini"),
        Coherence(model="openai:/gpt-4o-mini"),
    ],
)

print("\nComprehensive Agent Evaluation:")
print(agent_eval_results.tables["eval_results"])

TruLens Scorers with MLflow¶

Available TruLens Scorers in MLflow¶

Basic Usage: Direct Scorer Calls¶

Batch Evaluation with mlflow.genai.evaluate¶

Configuring Thresholds¶

Using Different LLM Providers¶

Dynamic Scorer Creation with get_scorer¶

Output Scorer: Coherence¶

Integrating with MLflow Tracing¶

Building a Complete RAG Evaluation Pipeline¶

Evaluating Agents with TruLens Scorers¶

Related Resources¶