Skip to content

MLflow Integration

TruLens feedback functions are available as first-class scorers in MLflow's GenAI evaluation framework starting with MLflow 3.10.0.

Installation

Install MLflow with TruLens support:

pip install 'mlflow>=3.10.0' trulens trulens-providers-litellm

Available Scorers

TruLens provides three categories of scorers in MLflow:

RAG Evaluation Scorers

Scorer Description
Groundedness Evaluates whether the response is grounded in the provided context
ContextRelevance Evaluates whether the retrieved context is relevant to the query
AnswerRelevance Evaluates whether the response is relevant to the input query

Output Scorers

Scorer Description
Coherence Evaluates the coherence and logical flow of any LLM output

Agent Trace Scorers

For evaluating agentic workflows and tool usage:

Scorer Description
LogicalConsistency Evaluates logical consistency of agent decisions
ExecutionEfficiency Evaluates efficiency of agent execution
PlanAdherence Evaluates whether the agent followed its plan
PlanQuality Evaluates the quality of agent planning
ToolSelection Evaluates appropriateness of tool selection
ToolCalling Evaluates correctness of tool calls

Basic Usage

Direct Scorer Calls

from mlflow.genai.scorers.trulens import Groundedness

scorer = Groundedness(model="openai:/gpt-4o")

feedback = scorer(
    outputs="Paris is the capital of France.",
    expectations={"context": "France is a country in Europe. Its capital is Paris."},
)

print(feedback.value)  # "yes" or "no"
print(feedback.metadata["score"])  # 0.0 to 1.0

Batch Evaluation with mlflow.genai.evaluate

import mlflow
from mlflow.genai.scorers.trulens import Groundedness, ContextRelevance, AnswerRelevance

eval_dataset = [
    {
        "inputs": {"question": "What is MLflow?"},
        "outputs": "MLflow is an open-source platform for ML lifecycle management.",
        "expectations": {
            "context": "MLflow is an open-source platform for managing the end-to-end machine learning lifecycle."
        },
    },
]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[
        Groundedness(model="openai:/gpt-4o"),
        ContextRelevance(model="openai:/gpt-4o"),
        AnswerRelevance(model="openai:/gpt-4o"),
    ],
)

print(results.tables["eval_results"])

Model Configuration

TruLens scorers in MLflow support multiple LLM providers through LiteLLM:

OpenAI

from mlflow.genai.scorers.trulens import Groundedness

scorer = Groundedness(model="openai:/gpt-4o")

Anthropic

scorer = Groundedness(model="anthropic:/claude-3-5-sonnet")

Azure OpenAI

scorer = Groundedness(model="azure:/my-deployment-name")

Other LiteLLM Providers

# AWS Bedrock
scorer = Groundedness(model="bedrock:/anthropic.claude-3-sonnet")

# Google Vertex AI
scorer = Groundedness(model="vertex_ai:/gemini-pro")

Threshold Configuration

TruLens scorers return a score between 0 and 1. You can configure the threshold for pass/fail:

from mlflow.genai.scorers.trulens import Groundedness

# Default threshold is 0.5
scorer = Groundedness(model="openai:/gpt-4o", threshold=0.7)

feedback = scorer(outputs="...", expectations={"context": "..."})
print(feedback.value)  # "yes" if score >= 0.7, else "no"
print(feedback.metadata["score"])  # Actual score (0.0 to 1.0)
print(feedback.metadata["threshold"])  # 0.7

Dynamic Scorer Creation

Use get_scorer to create scorers dynamically by name. This is useful when you need to configure scorers from external configuration files, environment variables, or user input rather than hardcoding scorer classes in your code:

from mlflow.genai.scorers.trulens import get_scorer

# Load scorer names from config or user input
scorer_names = ["Groundedness", "ContextRelevance"]

scorers = [get_scorer(name, model="openai:/gpt-4o") for name in scorer_names]

results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=scorers,
)

Using with MLflow Tracing

TruLens scorers integrate with MLflow's tracing infrastructure:

import mlflow
from mlflow.genai.scorers.trulens import Groundedness

# Enable tracing
mlflow.openai.autolog()

@mlflow.trace
def my_rag_app(question: str) -> str:
    # Your RAG logic here
    return response

# Evaluate using trace
scorer = Groundedness(model="openai:/gpt-4o")
feedback = scorer(trace=trace)

Agent Evaluation

Agent GPA scorers evaluate tool selection and execution in agentic workflows. These scorers require traces since they inspect tool call spans.

Batch Agent Evaluation

Use predict_fn with mlflow.genai.evaluate to trace and evaluate agent runs:

import mlflow
from mlflow.genai.scorers.trulens import (
    Groundedness,
    ToolSelection,
    ToolCalling,
    Coherence,
)

mlflow.openai.autolog()


def run_agent(inputs: dict) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": inputs["user_query"]}],
        tools=[...],  # your tool definitions
    )
    # ... handle tool calls and return result


agent_queries = [
    "What's the weather in Paris?",
    "Book a flight to Tokyo for next Monday",
    "Send an email to my team about the meeting",
]

agent_eval_results = mlflow.genai.evaluate(
    data=[{"inputs": {"user_query": q}} for q in agent_queries],
    predict_fn=run_agent,
    scorers=[
        Groundedness(model="openai:/gpt-4o-mini"),
        ToolSelection(model="openai:/gpt-4o-mini"),
        ToolCalling(model="openai:/gpt-4o-mini"),
        Coherence(model="openai:/gpt-4o-mini"),
    ],
)

print(agent_eval_results.tables["eval_results"])

Evaluating Individual Agent Traces

You can also evaluate agent traces individually:

import mlflow
from mlflow.genai.scorers.trulens import ToolSelection, ToolCalling

mlflow.openai.autolog()

# Run your agent
result = run_agent({"user_query": "What's the weather in Paris?"})

# Get the trace
trace = mlflow.get_last_active_trace()

# Evaluate tool usage
tool_selection = ToolSelection(model="openai:/gpt-4o-mini")
tool_calling = ToolCalling(model="openai:/gpt-4o-mini")

selection_feedback = tool_selection(trace=trace)
calling_feedback = tool_calling(trace=trace)

print(f"Tool Selection: {selection_feedback.value}")
print(f"Tool Calling: {calling_feedback.value}")
print(f"Rationale: {selection_feedback.rationale}")

Agent vs RAG Scorers

RAG and output scorers (Groundedness, Coherence, etc.) can be called directly with data or on traces. Agent GPA scorers (ToolSelection, ToolCalling, etc.) require a trace parameter since they evaluate tool usage patterns within trace spans.

Viewing Results

Results are automatically logged to MLflow:

# Access detailed results
df = results.tables["eval_results"]
print(df[["inputs", "outputs", "Groundedness", "ContextRelevance"]])

# Access aggregate metrics
print(results.metrics)
# Example: {'Groundedness/mean': 0.85, 'ContextRelevance/mean': 0.92}

Best Practices

Choose the Right Scorer

Goal Recommended Scorer
Detect hallucinations Groundedness
Evaluate retrieval quality ContextRelevance
Check answer relevance AnswerRelevance
Assess response quality Coherence
Evaluate agent behavior Agent trace scorers

Provide Context

For RAG evaluation scorers, always provide context:

{
    "expectations": {
        "context": "The retrieved documents or ground truth...",
    }
}

Troubleshooting

Missing Dependencies

ModuleNotFoundError: No module named 'trulens'

Install the TruLens packages:

pip install trulens trulens-providers-litellm

API Key Issues

Ensure your API key is set:

export OPENAI_API_KEY="your-key"
# or
export ANTHROPIC_API_KEY="your-key"