📓 Custom Feedback Functions¶

Feedback functions are an extensible framework for evaluating LLMs. You can add your own feedback functions to evaluate the qualities required by your application by updating trulens_eval/feedback.py, or simply creating a new provider class and feedback function in your notebook. If your contributions would be useful for others, we encourage you to contribute to TruLens!

Feedback functions are organized by model provider into Provider classes.

The process for adding new feedback functions is:

Create a new Provider class or locate an existing one that applies to your feedback function. If your feedback function does not rely on a model provider, you can create a standalone class. Add the new feedback function method to your selected class. Your new method can either take a single text (str) as a parameter or both prompt (str) and response (str). It should return a float between 0 (worst) and 1 (best).

In [ ]:

Copied!





from trulens_eval import Provider, Feedback, Select, Tru

class StandAlone(Provider):
    def custom_feedback(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))
from trulens_eval import Provider, Feedback, Select, Tru

class StandAlone(Provider):
    def custom_feedback(self, my_text_field: str) -> float:
        """
        A dummy function of text inputs to float outputs.

        Parameters:
            my_text_field (str): Text to evaluate.

        Returns:
            float: square length of the text
        """
        return 1.0 / (1.0 + len(my_text_field) * len(my_text_field))

Instantiate your provider and feedback functions. The feedback function is wrapped by the trulens-eval Feedback class which helps specify what will get sent to your function parameters (For example: Select.RecordInput or Select.RecordOutput)

In [ ]:

Copied!





standalone = StandAlone()
f_custom_function = Feedback(standalone.custom_feedback).on(
    my_text_field=Select.RecordOutput
)
standalone = StandAlone()
f_custom_function = Feedback(standalone.custom_feedback).on(
    my_text_field=Select.RecordOutput
)

Your feedback function is now ready to use just like the out of the box feedback functions. Below is an example of it being used.

In [ ]:

Copied!





tru = Tru()
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[f_custom_function]
)
tru.add_feedbacks(feedback_results)
tru = Tru()
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[f_custom_function]
)
tru.add_feedbacks(feedback_results)

Extending existing providers.¶

In addition to calling your own methods, you can also extend stock feedback providers (such as OpenAI, AzureOpenAI, Bedrock) to custom feedback implementations. This can be especially useful for tweaking stock feedback functions, or running custom feedback function prompts while letting TruLens handle the backend LLM provider.

This is done by subclassing the provider you wish to extend, and using the generate_score method that runs the provided prompt with your specified provider, and extracts a float score from 0-1. Your prompt should request the LLM respond on the scale from 0 to 10, then the generate_score method will normalize to 0-1.

See below for example usage:

In [ ]:

Copied!





from trulens_eval.feedback.provider import AzureOpenAI
from trulens_eval.utils.generated import re_0_10_rating

class Custom_AzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the resposne, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format("Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}", response)
        return self.generate_score(system_prompt=professional_prompt)
from trulens_eval.feedback.provider import AzureOpenAI
from trulens_eval.utils.generated import re_0_10_rating

class Custom_AzureOpenAI(AzureOpenAI):
    def style_check_professional(self, response: str) -> float:
        """
        Custom feedback function to grade the professional style of the resposne, extending AzureOpenAI provider.

        Args:
            response (str): text to be graded for professional style.

        Returns:
            float: A value between 0 and 1. 0 being "not professional" and 1 being "professional".
        """
        professional_prompt = str.format("Please rate the professionalism of the following text on a scale from 0 to 10, where 0 is not at all professional and 10 is extremely professional: \n\n{}", response)
        return self.generate_score(system_prompt=professional_prompt)

Running "chain of thought evaluations" is another use case for extending providers. Doing so follows a similar process as above, where the base provider (such as AzureOpenAI) is subclassed.

For this case, the method generate_score_and_reasons can be used to extract both the score and chain of thought reasons from the LLM response.

To use this method, the prompt used should include the COT_REASONS_TEMPLATE available from the TruLens prompts library (trulens_eval.feedback.prompts).

See below for example usage:

In [ ]:

Copied!





from typing import Tuple, Dict
from trulens_eval.feedback import prompts

class Custom_AzureOpenAI(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(self, question: str, context: str) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked. 
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
        "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n", "")
        
        user_prompt = str.format(prompts.CONTEXT_RELEVANCE_USER, question = question, context = context)
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)
from typing import Tuple, Dict
from trulens_eval.feedback import prompts

class Custom_AzureOpenAI(AzureOpenAI):
    def context_relevance_with_cot_reasons_extreme(self, question: str, context: str) -> Tuple[float, Dict]:
        """
        Tweaked version of context relevance, extending AzureOpenAI provider.
        A function that completes a template to check the relevance of the statement to the question.
        Scoring guidelines for scores 5-8 are removed to push the LLM to more extreme scores.
        Also uses chain of thought methodology and emits the reasons.

        Args:
            question (str): A question being asked. 
            context (str): A statement to the question.

        Returns:
            float: A value between 0 and 1. 0 being "not relevant" and 1 being "relevant".
        """

        # remove scoring guidelines around middle scores
        system_prompt = prompts.CONTEXT_RELEVANCE_SYSTEM.replace(
        "- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n", "")
        
        user_prompt = str.format(prompts.CONTEXT_RELEVANCE_USER, question = question, context = context)
        user_prompt = user_prompt.replace(
            "RELEVANCE:", prompts.COT_REASONS_TEMPLATE
        )

        return self.generate_score_and_reasons(system_prompt, user_prompt)

Multi-Output Feedback functions¶

Trulens also supports multi-output feedback functions. As a typical feedback function will output a float between 0 and 1, multi-output should output a dictionary of output_key to a float between 0 and 1. The feedbacks table will display the feedback with column feedback_name:::outputkey

In [ ]:

Copied!





multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi").on(
    input_param=Select.RecordOutput
)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)
multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi").on(
    input_param=Select.RecordOutput
)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)

In [ ]:

Copied!





# Aggregators will run on the same dict keys.
import numpy as np
multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi-agg").on(
    input_param=Select.RecordOutput
).aggregate(np.mean)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)
# Aggregators will run on the same dict keys.
import numpy as np
multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi-agg").on(
    input_param=Select.RecordOutput
).aggregate(np.mean)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)

In [ ]:

Copied!





# For multi-context chunking, an aggregator can operate on a list of multi output dictionaries.
def dict_aggregator(list_dict_input):
    agg = 0
    for dict_input in list_dict_input:
        agg += dict_input['output_key1']
    return agg
multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi-agg-dict").on(
    input_param=Select.RecordOutput
).aggregate(dict_aggregator)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)
# For multi-context chunking, an aggregator can operate on a list of multi output dictionaries.
def dict_aggregator(list_dict_input):
    agg = 0
    for dict_input in list_dict_input:
        agg += dict_input['output_key1']
    return agg
multi_output_feedback = Feedback(lambda input_param: {'output_key1': 0.1, 'output_key2': 0.9}, name="multi-agg-dict").on(
    input_param=Select.RecordOutput
).aggregate(dict_aggregator)
feedback_results = tru.run_feedback_functions(
    record=record,
    feedback_functions=[multi_output_feedback]
)
tru.add_feedbacks(feedback_results)