๐ Custom Feedback Functionsยถ
Feedback functions are an extensible framework for evaluating LLMs.
The primary motivations for customizing feedback functions are either to improve alignment of an existing feedback function, or to evaluate on a new axis not addressed by an out-of-the-box feedback function.
Improving feedback function alignment through customizationยถ
Feedback functions can be customized through a number of parameter changes that influence score generation. For example, you can choose to run feedbacks with or without chain-of-thought reasoning, customize the output scale, or provide "few-shot" examples to guide alignment of a feedback function. All of these decisions affect the score generation and should be carefully tested and benchmarked.
Chain-of-thought Reasoningยถ
Feedback functions can be run with chain-of-thought reasoning using their "with_cot_reasons" variant. Doing so provides both the benefit of a view into how the grading is performed, and improves alignment due to the auto-regressive nature of LLMs forcing the score to sequentially follow the reasons.
from trulens.core import Metric
from trulens.core import Selector
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4o")
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
provider.relevance_with_cot_reasons(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
Output spaceยถ
The output space is another very important variable to consider. This allows you to trade-off between a score's accuracy and granularity. The larger the output space, the lower the accuracy.
Output space can be modulated via the min_score_val and max_score_val keyword arguments.
The output space currently allows three selections:
- 0 or 1 (binary)
- 0 to 3 (default)
- 0 to 10
While the output you see is always on a scale from 0 to 1, changing the output space changes the score range prompting given to the LLM judge. The score produced by the judge is then scaled down appropriately.
For example, we can modulate the output space to 0-10.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
min_score_val=0,
max_score_val=10,
)
Or to binary scoring.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
min_score_val=0,
max_score_val=1,
)
Temperatureยถ
When using LLMs, temperature is another parameter to be mindful of. Metrics default to a temperature of 0, but it can be useful in some cases to use higher temperatures, or even ensemble with metrics using different temperatures.
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
temperature=0.9,
)
Groundedness configurationsยถ
Groundedness has its own specific configurations that can be set with the GroundednessConfigs class.
from trulens.core.feedback import feedback
groundedness_configs = feedback.GroundednessConfigs(
use_sent_tokenize=False, filter_trivial_statements=False
)
provider.groundedness_measure_with_cot_reasons(
"The First AFLโNFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
"Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)
provider.groundedness_measure_with_cot_reasons(
"The First AFLโNFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
"Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
groundedness_configs=groundedness_configs,
)
Custom Criteriaยถ
To customize the LLM-judge prompting, you can override standard criteria with your own custom criteria.
This can be useful to tailor LLM-judge prompting to your domain and improve alignment with human evaluations.
custom_relevance_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
criteria=custom_relevance_criteria,
min_score_val=0,
max_score_val=1,
)
custom_sentiment_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""
provider.sentiment(
"When you're ready to start your business, you'll be amazed at how much you can achieve!",
criteria=custom_sentiment_criteria,
)
Additional Instructionsยถ
While custom criteria completely overrides a feedback function's default criteria, additional instructions allows you to add custom guidelines on top of the existing feedback function's standard criteria.
This feature is especially helpful when you want to use an out-of-the-box feedback function but need to include a few extra details, such as a description of the system you are trying to evaluate.
additional_instructions = """The system you are evaluating is a helpful, high-level business advisor dedicated to guiding small business owners and entrepreneurs.
The relevance of the response should be judged based on its usefulness and direct applicability to someone actively navigating the challenges of starting and growing a small business.
"""
provider.relevance_with_cot_reasons(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
additional_instructions=additional_instructions,
)
Few-shot examplesยถ
You can also provide examples to customize metric scoring to your domain.
This is currently available only for the RAG triad metrics (answer relevance, context relevance, and groundedness).
from trulens.feedback.v2 import feedback
fewshot_relevance_examples_list = [
(
{
"query": "What are the key considerations when starting a small business?",
"response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
},
3,
),
]
provider.relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
examples=fewshot_relevance_examples_list,
)
Usage Options for Customized Metricsยถ
Metric customizations are available both directly (shown above) and through the Metric class.
Below is an example using the customizations via a metric instantiation that will run with typical TruLens recording.
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4o")
# Question/answer relevance between overall question and answer.
f_answer_relevance = Metric(
implementation=provider.relevance_with_cot_reasons,
name="Answer Relevance",
selectors={
"prompt": Selector.select_record_input(),
"response": Selector.select_record_output(),
},
examples=fewshot_relevance_examples_list,
criteria=custom_relevance_criteria,
min_score_val=0,
max_score_val=1,
temperature=0.9,
)
f_answer_relevance(
"What are the key considerations when starting a small business?",
"Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
from trulens.core.metric.metric import GroundednessConfigs
groundedness_configs = GroundednessConfigs(
use_sent_tokenize=False, filter_trivial_statements=False
)
# Groundedness metric with custom configuration.
f_groundedness = Metric(
implementation=provider.groundedness_measure_with_cot_reasons,
name="Groundedness",
selectors={
"source": Selector.select_context(collect_list=True),
"statement": Selector.select_record_output(),
},
examples=fewshot_relevance_examples_list,
min_score_val=0,
max_score_val=1,
temperature=0.9,
groundedness_configs=groundedness_configs,
)
Creating new custom metricsยถ
You can add your own metrics to evaluate the qualities required by your application in two steps: by creating a new provider class and metric function in your notebook! If your contributions would be useful for others, we encourage you to contribute to TruLens!
Metric functions are organized by model provider into Provider classes.
The process for adding new metrics is:
- Create a new
Providerclass or locate an existing one that applies to your metric. If your metric does not rely on a model provider, you can create a standalone class. Add the new metric method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).
from trulens.core import Metric
from trulens.core import Provider
from trulens.core import Selector
class StandAlone(Provider):
def custom_metric(self, my_text_field: str) -> float:
"""
A dummy function of text inputs to float outputs.
Parameters:
my_text_field (str): Text to evaluate.
Returns:
float: square length of the text
"""
return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))
- Instantiate your provider and feedback functions. The feedback function is wrapped by the
Feedbackclass which helps specify what will get sent to your function parameters (For example:Select.RecordInputorSelect.RecordOutput)
standalone = StandAlone()
f_custom_function = Metric(
implementation=standalone.custom_feedback,
name="custom_feedback",
selectors={
"text": Selector.select_record_output(),
},
)
- Your feedback function is now ready to use just like the out of the box feedback functions. Below is an example of it being used.
f_custom_function("Hello, World!")
Extending existing providers.ยถ
In addition to calling your own methods, you can also extend stock feedback providers (such as OpenAI, AzureOpenAI, or Bedrock) to custom feedback implementations. This can be especially useful for tweaking stock feedback functions, or running custom feedback function prompts while letting TruLens handle the backend LLM provider.
This is done by subclassing the provider you wish to extend, using the generate_score method that runs the provided prompt with your specified provider, and extracting a float score from 0-1. Your prompt should request the LLM respond on the scale from 0 to 10, then the generate_score method will normalize to 0-1.
See below for example usage:
from trulens.providers.openai import AzureOpenAI
class CustomAzureOpenAI(AzureOpenAI):
def style_check_professional(self, response: str) -> float:
"""
Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.
Args:
response (str): text to be graded for professional style.
Returns:
float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
"""
professional_prompt = str.format(
"Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
response,
)
return self.generate_score(system_prompt=professional_prompt)
Running "chain of thought evaluations" is another use case for extending providers. Doing so follows a similar process as above, where the base provider (such as AzureOpenAI) is subclassed.
For this case, the method generate_score_and_reasons can be used to extract both the score and chain of thought reasons from the LLM response.
To use this method, the prompt used should include the COT_REASONS_TEMPLATE available from the TruLens prompts library (trulens.feedback.prompts).
See below for example usage:
from typing import Dict, Tuple
from trulens.feedback import prompts
class CustomAzureOpenAIReasoning(AzureOpenAI):
def context_relevance_with_cot_reasons_extreme(
self, question: str, context: str
) -> Tuple[float, Dict]:
"""
Tweaked version of context relevance, extending AzureOpenAI provider.
A function that completes a template to check the relevance of the statement to the question.
Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
Also uses chain of thought methodology and emits the reasons.
Args:
question (str): A question being asked.
context (str): A statement to the question.
Returns:
float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
"""
# remove scoring guidelines around middle scores
system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
"- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
"",
)
user_prompt = str.format(
prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
)
user_prompt = user_prompt.replace(
"RELEVANCE:", prompts.COT_REASONS_TEMPLATE
)
return self.generate_score_and_reasons(system_prompt, user_prompt)