📓 Custom Feedback Functions¶

Feedback functions are an extensible framework for evaluating LLMs.

The primary motivations for customizing feedback functions are either to improve alignment of an existing feedback function, or to evaluate on a new axis not addressed by an out-of-the-box feedback function.

Improving feedback function alignment through customization¶

Feedback functions can be customized through a number of parameter changes that influence score generation. For example, you can choose to run feedbacks with or without chain-of-thought reasoning, customize the output scale, or provide "few-shot" examples to guide alignment of a feedback function. All of these decisions affect the score generation and should be carefully tested and benchmarked.

Chain-of-thought Reasoning¶

Feedback functions can be run with chain-of-thought reasoning using their "with_cot_reasons" variant. Doing so provides both the benefit of a view into how the grading is performed, and improves alignment due to the auto-regressive nature of LLMs forcing the score to sequentially follow the reasons.

In [ ]:

Copied!





from trulens.core import Metric
from trulens.core import Selector
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
from trulens.core import Metric
from trulens.core import Selector
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

In [ ]:

Copied!





provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

Output space¶

The output space is another very important variable to consider. This allows you to trade-off between a score's accuracy and granularity. The larger the output space, the lower the accuracy.

Output space can be modulated via the min_score_val and max_score_val keyword arguments.

The output space currently allows three selections:

0 or 1 (binary)
0 to 3 (default)
0 to 10

While the output you see is always on a scale from 0 to 1, changing the output space changes the score range prompting given to the LLM judge. The score produced by the judge is then scaled down appropriately.

For example, we can modulate the output space to 0-10.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=10,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=10,
)

Or to binary scoring.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=1,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    min_score_val=0,
    max_score_val=1,
)

Temperature¶

When using LLMs, temperature is another parameter to be mindful of. Metrics default to a temperature of 0, but it can be useful in some cases to use higher temperatures, or even ensemble with metrics using different temperatures.

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    temperature=0.9,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    temperature=0.9,
)

Groundedness configurations¶

Groundedness has its own specific configurations that can be set with the GroundednessConfigs class.

In [ ]:

Copied!

from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)
from trulens.core.feedback import feedback

groundedness_configs = feedback.GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

In [ ]:

Copied!





provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)
provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
)

In [ ]:

Copied!





provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
    groundedness_configs=groundedness_configs,
)
provider.groundedness_measure_with_cot_reasons(
    "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles.",
    "Hi, your football expert here. The first superbowl was held on Jan 15, 1967",
    groundedness_configs=groundedness_configs,
)

Custom Criteria¶

To customize the LLM-judge prompting, you can override standard criteria with your own custom criteria.

This can be useful to tailor LLM-judge prompting to your domain and improve alignment with human evaluations.

In [ ]:

Copied!





custom_relevance_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    criteria=custom_relevance_criteria,
    min_score_val=0,
    max_score_val=1,
)
custom_relevance_criteria = """
A relevant response should provide a clear and concise answer to the question.
"""

provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    criteria=custom_relevance_criteria,
    min_score_val=0,
    max_score_val=1,
)

In [ ]:

Copied!





custom_sentiment_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""

provider.sentiment(
    "When you're ready to start your business, you'll be amazed at how much you can achieve!",
    criteria=custom_sentiment_criteria,
)
custom_sentiment_criteria = """
A positive sentiment should be expressed with an extremely encouraging and enthusiastic tone.
"""

provider.sentiment(
    "When you're ready to start your business, you'll be amazed at how much you can achieve!",
    criteria=custom_sentiment_criteria,
)

Additional Instructions¶

While custom criteria completely overrides a feedback function's default criteria, additional instructions allows you to add custom guidelines on top of the existing feedback function's standard criteria.

This feature is especially helpful when you want to use an out-of-the-box feedback function but need to include a few extra details, such as a description of the system you are trying to evaluate.

In [ ]:

Copied!





additional_instructions = """The system you are evaluating is a helpful, high-level business advisor dedicated to guiding small business owners and entrepreneurs. 
The relevance of the response should be judged based on its usefulness and direct applicability to someone actively navigating the challenges of starting and growing a small business.
"""

provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    additional_instructions=additional_instructions,
)
additional_instructions = """The system you are evaluating is a helpful, high-level business advisor dedicated to guiding small business owners and entrepreneurs. 
The relevance of the response should be judged based on its usefulness and direct applicability to someone actively navigating the challenges of starting and growing a small business.
"""

provider.relevance_with_cot_reasons(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    additional_instructions=additional_instructions,
)

Few-shot examples¶

You can also provide examples to customize metric scoring to your domain.

This is currently available only for the RAG triad metrics (answer relevance, context relevance, and groundedness).

In [ ]:

Copied!





from trulens.feedback.v2 import feedback

fewshot_relevance_examples_list = [
    (
        {
            "query": "What are the key considerations when starting a small business?",
            "response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
        },
        3,
    ),
]
from trulens.feedback.v2 import feedback

fewshot_relevance_examples_list = [
    (
        {
            "query": "What are the key considerations when starting a small business?",
            "response": "You should focus on building relationships with mentors and industry leaders. Networking can provide insights, open doors to opportunities, and help you avoid common pitfalls.",
        },
        3,
    ),
]

In [ ]:

Copied!





provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    examples=fewshot_relevance_examples_list,
)
provider.relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
    examples=fewshot_relevance_examples_list,
)

Usage Options for Customized Metrics¶

Metric customizations are available both directly (shown above) and through the Metric class.

Below is an example using the customizations via a metric instantiation that will run with typical TruLens recording.

In [ ]:

Copied!





from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

# Question/answer relevance between overall question and answer.
f_answer_relevance = Metric(
    implementation=provider.relevance_with_cot_reasons,
    name="Answer Relevance",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
    examples=fewshot_relevance_examples_list,
    criteria=custom_relevance_criteria,
    min_score_val=0,
    max_score_val=1,
    temperature=0.9,
)
from trulens.providers.openai import OpenAI

provider = OpenAI(model_engine="gpt-4o")

# Question/answer relevance between overall question and answer.
f_answer_relevance = Metric(
    implementation=provider.relevance_with_cot_reasons,
    name="Answer Relevance",
    selectors={
        "prompt": Selector.select_record_input(),
        "response": Selector.select_record_output(),
    },
    examples=fewshot_relevance_examples_list,
    criteria=custom_relevance_criteria,
    min_score_val=0,
    max_score_val=1,
    temperature=0.9,
)

In [ ]:

Copied!





f_answer_relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)
f_answer_relevance(
    "What are the key considerations when starting a small business?",
    "Find a mentor who can guide you through the early stages and help you navigate common challenges.",
)

In [ ]:

Copied!





from trulens.core.metric.metric import GroundednessConfigs

groundedness_configs = GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

# Groundedness metric with custom configuration.
f_groundedness = Metric(
    implementation=provider.groundedness_measure_with_cot_reasons,
    name="Groundedness",
    selectors={
        "source": Selector.select_context(collect_list=True),
        "statement": Selector.select_record_output(),
    },
    examples=fewshot_relevance_examples_list,
    min_score_val=0,
    max_score_val=1,
    temperature=0.9,
    groundedness_configs=groundedness_configs,
)
from trulens.core.metric.metric import GroundednessConfigs

groundedness_configs = GroundednessConfigs(
    use_sent_tokenize=False, filter_trivial_statements=False
)

# Groundedness metric with custom configuration.
f_groundedness = Metric(
    implementation=provider.groundedness_measure_with_cot_reasons,
    name="Groundedness",
    selectors={
        "source": Selector.select_context(collect_list=True),
        "statement": Selector.select_record_output(),
    },
    examples=fewshot_relevance_examples_list,
    min_score_val=0,
    max_score_val=1,
    temperature=0.9,
    groundedness_configs=groundedness_configs,
)

Creating new custom metrics¶

You can add your own metrics to evaluate the qualities required by your application in two steps: by creating a new provider class and metric function in your notebook! If your contributions would be useful for others, we encourage you to contribute to TruLens!

Metric functions are organized by model provider into Provider classes.

The process for adding new metrics is:

Create a new Provider class or locate an existing one that applies to your metric. If your metric does not rely on a model provider, you can create a standalone class. Add the new metric method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).

In [ ]:

Copied!





from trulens.core import Metric
from trulens.core import Provider
from trulens.core import Selector


class StandAlone(Provider):
    def custom_metric(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))
from trulens.core import Metric
from trulens.core import Provider
from trulens.core import Selector


class StandAlone(Provider):
    def custom_metric(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))

Instantiate your provider and feedback functions. The feedback function is wrapped by the Feedback class which helps specify what will get sent to your function parameters (For example: Select.RecordInput or Select.RecordOutput)

In [ ]:

Copied!





standalone = StandAlone()
f_custom_function = Metric(
    implementation=standalone.custom_feedback,
    name="custom_feedback",
    selectors={
        "text": Selector.select_record_output(),
    },
)
standalone = StandAlone()
f_custom_function = Metric(
    implementation=standalone.custom_feedback,
    name="custom_feedback",
    selectors={
        "text": Selector.select_record_output(),
    },
)

Your feedback function is now ready to use just like the out of the box feedback functions. Below is an example of it being used.

In [ ]:

Copied!

f_custom_function("Hello, World!")
f_custom_function("Hello, World!")

Extending existing providers.¶

In addition to calling your own methods, you can also extend stock feedback providers (such as OpenAI, AzureOpenAI, or Bedrock) to custom feedback implementations. This can be especially useful for tweaking stock feedback functions, or running custom feedback function prompts while letting TruLens handle the backend LLM provider.

This is done by subclassing the provider you wish to extend, using the generate_score method that runs the provided prompt with your specified provider, and extracting a float score from 0-1. Your prompt should request the LLM respond on the scale from 0 to 10, then the generate_score method will normalize to 0-1.

See below for example usage:

In [ ]:

Copied!





from trulens.providers.openai import AzureOpenAI


class CustomAzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format(
            "Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
            response,
        )
        return self.generate_score(system_prompt=professional_prompt)
from trulens.providers.openai import AzureOpenAI


class CustomAzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the response, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format(
            "Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}",
            response,
        )
        return self.generate_score(system_prompt=professional_prompt)

Running "chain of thought evaluations" is another use case for extending providers. Doing so follows a similar process as above, where the base provider (such as AzureOpenAI) is subclassed.

For this case, the method generate_score_and_reasons can be used to extract both the score and chain of thought reasons from the LLM response.

To use this method, the prompt used should include the COT_REASONS_TEMPLATE available from the TruLens prompts library (trulens.feedback.prompts).

See below for example usage:

In [ ]:

Copied!





from typing import Dict, Tuple

from trulens.feedback import prompts


class CustomAzureOpenAIReasoning(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(
        self, question: str, context: str
    ) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked.
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
            "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
            "",
        )

        user_prompt = str.format(
            prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
        )
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)
from typing import Dict, Tuple

from trulens.feedback import prompts


class CustomAzureOpenAIReasoning(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(
        self, question: str, context: str
    ) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked.
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
            "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n",
            "",
        )

        user_prompt = str.format(
            prompts.CONTEXT_RELEVANCE_USER, question=question, context=context
        )
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)