GEPA + TruLens: Evolving Prompts with Feedback-Driven Fitness¶
GEPA (Genetic/Evolutionary Prompt Adaptation) optimizes prompts using evolutionary algorithms. Instead of manually tuning instructions, you define a fitness function that scores prompt variants and let the algorithm search for improvements.
TruLens feedback functions are a natural fit as fitness functions: they already score dimensions like context relevance, groundedness, and toxicity on a [0, 1] scale.
This notebook shows how to:
- Wrap a TruLens feedback function as a GEPA fitness function using
TruGEPA. - Run evolutionary prompt optimization with
run_evolution. - Automatically log every evaluation as a TruLens virtual record for audit and dashboard visualization.
- Plot the improvement trajectory.
# !pip install trulens trulens-apps-gepa trulens-providers-openai
Setup¶
import os
# Set your OpenAI API key.
os.environ["OPENAI_API_KEY"] = "sk-..." # replace with your key
1. Start a TruLens session¶
TruGEPA logs every evaluation automatically. A TruSession must be active
before the first evaluation so records have somewhere to go.
from trulens.core import TruSession
session = TruSession()
session.reset_database()
2. Define a feedback function¶
We use context_relevance from the OpenAI provider. It scores how relevant a
prompt is to a fixed reference context on a [0, 1] scale.
from trulens.providers.openai import OpenAI
provider = OpenAI()
# context_relevance expects (question, context) -> float
feedback_fn = provider.context_relevance
# Fixed reference context used to score every prompt variant.
REFERENCE_CONTEXT = (
"TruLens is an open-source library for evaluating and tracking "
"LLM-based applications. It supports feedback functions for quality, "
"safety, and relevance metrics."
)
3. Wrap the feedback function as a GEPA fitness function¶
Pass both app_name and app_version to enable logging — TruGEPA creates
a TruVirtual recorder automatically and every evaluation is stored as a
TruLens record. Omit both to run without logging. Supplying only one raises
an error immediately.
from trulens.apps.gepa import TruGEPA
fitness = TruGEPA(
feedback_fn,
# optimize_key names the feedback arg that receives the evolving prompt.
optimize_key="question",
# feedback_args holds all other fixed args forwarded on every call.
feedback_args={"context": REFERENCE_CONTEXT},
# Supply both to enable logging; omit both to run without logging.
app_name="gepa_prompt_optimizer",
app_version="v1",
)
# Quick sanity check.
score = fitness("What does TruLens do?")
print(f"Test score: {score:.3f}")
4. Define a mutation function¶
A mutation function takes a prompt string and returns a modified variant. Real-world setups often use an LLM to rephrase; here we use simple template mutations for illustration.
import random
MUTATIONS = [
lambda p: f"Please explain: {p}",
lambda p: f"{p} Provide a detailed answer.",
lambda p: f"In simple terms, {p.lower()}",
lambda p: f"{p} Focus on key benefits.",
lambda p: p.replace("?", ". Explain this."),
]
def mutate(prompt: str) -> str:
return random.choice(MUTATIONS)(prompt)
5. Run evolutionary optimization¶
from trulens.apps.gepa import run_evolution
BASE_PROMPT = "What is TruLens?"
best_prompt, best_score, history = run_evolution(
base_prompt=BASE_PROMPT,
fitness_fn=fitness,
mutate_fn=mutate,
n_generations=8,
population_size=5,
top_k=2,
seed=42,
)
print(f"\nBest prompt : {best_prompt}")
print(f"Best score : {best_score:.3f}")
6. Visualize the improvement trajectory¶
import matplotlib.pyplot as plt
generations = list(range(1, len(history) + 1))
scores = [s for _, s in history]
plt.figure(figsize=(8, 4))
plt.plot(generations, scores, marker="o", linewidth=2)
plt.xlabel("Generation")
plt.ylabel("Best fitness score")
plt.title("GEPA Prompt Optimization — context_relevance trajectory")
plt.ylim(0, 1.05)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nGeneration-by-generation history:")
for gen, (prompt, score) in enumerate(history, 1):
print(f" Gen {gen:2d} | score={score:.3f} | prompt='{prompt}'")
7. View results in the TruLens dashboard¶
All evaluations were logged automatically as virtual records. Launch the dashboard to explore them interactively.
records_df, _ = session.get_records_and_feedback()
print(f"Total records logged: {len(records_df)}")
records_df[["input", "output"]].head(10)
from trulens.dashboard import run_dashboard
run_dashboard(session)