trulens.feedback.llm_provider¶

trulens.feedback.llm_provider ¶

Classes¶

LLMProvider ¶

Bases: Provider

An LLM-based provider.

This is an abstract class and needs to be initialized as one of these:

OpenAI and subclass AzureOpenAI.
Bedrock.
LiteLLM. LiteLLM provides an interface to a wide range of models.
LangChain.

Attributes¶

tru_class_info `instance-attribute` ¶

tru_class_info: Class

Class information of this pydantic object for use in deserialization.

Using this odd key to not pollute attribute names in whatever class we mix this into. Should be the same as CLASS_INFO.

endpoint `class-attribute` `instance-attribute` ¶

endpoint: Optional[Endpoint] = None

Endpoint supporting this provider.

Remote API invocations are handled by the endpoint.

Functions¶

generate_score ¶

generate_score(
    system_prompt: str,
    user_prompt: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 10,
    temperature: float = 0.0,
) -> float

Base method to generate a score normalized to 0 to 1, used for evaluation.

PARAMETER	DESCRIPTION
`system_prompt`	A pre-formatted system prompt. TYPE: `str`
`user_prompt`	An optional user prompt. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. TYPE: `int` DEFAULT: `10`
`temperature`	The temperature for the LLM response. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	The normalized score on a 0-1 scale.

generate_score_and_reasons ¶

generate_score_and_reasons(
    system_prompt: str,
    user_prompt: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 10,
    temperature: float = 0.0,
) -> Tuple[float, Dict]

Base method to generate a score and reason, used for evaluation.

PARAMETER	DESCRIPTION
`system_prompt`	A pre-formatted system prompt. TYPE: `str`
`user_prompt`	An optional user prompt. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. TYPE: `int` DEFAULT: `10`
`temperature`	The temperature for the LLM response. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing the normalized score on a 0-1 scale and reason metadata dictionary.

context_relevance ¶

context_relevance(
    question: str,
    context: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the relevance of the context to the question.

Example

from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

PARAMETER	DESCRIPTION
`question`	A question being asked. TYPE: `str`
`context`	Context related to the question. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not relevant) and 1.0 (relevant). TYPE: `float`

context_relevance_with_cot_reasons ¶

context_relevance_with_cot_reasons(
    question: str,
    context: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the relevance of the context to the question. Also uses chain of thought methodology and emits the reasons.

Example

from trulens.apps.langchain import TruChain
context = TruChain.select_context(rag_app)
feedback = (
    Feedback(provider.context_relevance_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_input()
    .on(context)
    .aggregate(np.mean)
    )

PARAMETER	DESCRIPTION
`question`	A question being asked. TYPE: `str`
`context`	Context related to the question. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `Tuple[float, Dict]`

relevance ¶

relevance(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the relevance of the response to a prompt.

Example

feedback = (
Feedback(provider.relevance,
    criteria=criteria,
    additional_instructions=additional_instructions,
    examples=examples)
    .on_input_output()
    )

Usage on RAG Contexts

feedback = (
    Feedback(provider.relevance,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_input()
    .on(TruLlama.select_source_nodes().node.text) # See note below
    .aggregate(np.mean)
    )

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `float`

relevance_with_cot_reasons ¶

relevance_with_cot_reasons(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion Model. A function that completes a template to check the relevance of the response to a prompt. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.relevance_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_input()
    .on_output()
    )

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "not relevant" and 1 being "relevant". TYPE: `Tuple[float, Dict]`

sentiment ¶

sentiment(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the sentiment of some text.

Example

feedback = (
    Feedback(provider.sentiment,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate sentiment of. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0 and 1. 0 being "negative sentiment" and 1 being "positive sentiment". TYPE: `float`

sentiment_with_cot_reasons ¶

sentiment_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the sentiment of some text. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.sentiment_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions,
        examples=examples)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	Text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (negative sentiment) and 1.0 (positive sentiment). TYPE: `Tuple[float, Dict]`

model_agreement ¶

model_agreement(prompt: str, response: str) -> float

Uses chat completion model. A function that gives a chat completion model the same prompt and gets a response, encouraging truthfulness. A second template is given to the model with a prompt that the original response is correct, and measures whether previous chat completion response is similar.

Example

feedback = Feedback(provider.model_agreement).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not in agreement) and 1.0 (in agreement). TYPE: `float`

conciseness ¶

conciseness(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.conciseness,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate the conciseness of. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not concise) and 1.0 (concise). TYPE: `float`

conciseness_with_cot_reasons ¶

conciseness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the conciseness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.conciseness_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

Args: text (str): The text to evaluate the conciseness of. criteria (Optional[str]): If provided, overrides the default criteria for evaluation. Defaults to None. additional_instructions (Optional[str]): If provided, adds instructions to default criteria for the judge to follow. Defaults to None. min_score_val (int): The minimum score value used by the LLM before normalization. Defaults to 0. max_score_val (int): The maximum score value used by the LLM before normalization. Defaults to 3. temperature (float): The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0.

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not concise) and 1.0 (concise) and a dictionary containing the reasons for the evaluation.

correctness ¶

correctness(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.correctness,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	A prompt to an agent. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not correct) and 1.0 (correct). TYPE: `float`

correctness_with_cot_reasons ¶

correctness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the correctness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.correctness_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	Text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not correct) and 1.0 (correct) and a dictionary containing the reasons for the evaluation.

coherence ¶

coherence(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.coherence,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not coherent) and 1.0 (coherent). TYPE: `float`

coherence_with_cot_reasons ¶

coherence_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the coherence of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.coherence_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not coherent) and 1.0 (coherent) and a dictionary containing the reasons for the evaluation.

harmfulness ¶

harmfulness(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.harmfulness,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not harmful) and 1.0 (harmful)". TYPE: `float`

harmfulness_with_cot_reasons ¶

harmfulness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the harmfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.harmfulness_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not harmful) and 1.0 (harmful) and a dictionary containing the reasons for the evaluation.

maliciousness ¶

maliciousness(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.maliciousness,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not malicious) and 1.0 (malicious). TYPE: `float`

maliciousness_with_cot_reasons ¶

maliciousness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the maliciousness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.maliciousness_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not malicious) and 1.0 (malicious) and a dictionary containing the reasons for the evaluation.

helpfulness ¶

helpfulness(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.helpfulness,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not helpful) and 1.0 (helpful). TYPE: `float`

helpfulness_with_cot_reasons ¶

helpfulness_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the helpfulness of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.helpfulness_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not helpful) and 1.0 (helpful) and a dictionary containing the reasons for the evaluation.

controversiality ¶

controversiality(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval.

Example

feedback = (
    Feedback(provider.controversiality,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not controversial) and 1.0 (controversial). TYPE: `float`

controversiality_with_cot_reasons ¶

controversiality_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the controversiality of some text. Prompt credit to Langchain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.controversiality_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not controversial) and 1.0 (controversial) and a dictionary containing the reasons for the evaluation.

misogyny ¶

misogyny(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.misogyny,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not misogynistic) and 1.0 (misogynistic). TYPE: `float`

misogyny_with_cot_reasons ¶

misogyny_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the misogyny of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.misogyny_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not misogynistic) and 1.0 (misogynistic) and a dictionary containing the reasons for the evaluation.

criminality ¶

criminality(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.criminality,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not criminal) and 1.0 (criminal). TYPE: `float`

criminality_with_cot_reasons ¶

criminality_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the criminality of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.criminality_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not criminal) and 1.0 (criminal) and a dictionary containing the reasons for the evaluation.

insensitivity ¶

insensitivity(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval.

Example

feedback = (
    Feedback(provider.insensitivity,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (not insensitive) and 1.0 (insensitive). TYPE: `float`

insensitivity_with_cot_reasons ¶

insensitivity_with_cot_reasons(
    text: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check the insensitivity of some text. Prompt credit to LangChain Eval. Also uses chain of thought methodology and emits the reasons.

Example

feedback = (
    Feedback(provider.insensitivity_with_cot_reasons,
        criteria=criteria,
        additional_instructions=additional_instructions)
    .on_output()
    )

PARAMETER	DESCRIPTION
`text`	The text to evaluate. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not insensitive) and 1.0 (insensitive) and a dictionary containing the reasons for the evaluation.

comprehensiveness_with_cot_reasons ¶

comprehensiveness_with_cot_reasons(
    source: str,
    summary: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that tries to distill main points and compares a summary against those main points. This feedback function only has a chain of thought implementation as it is extremely important in function assessment.

Example

feedback = Feedback(provider.comprehensiveness_with_cot_reasons).on_input_output()

PARAMETER	DESCRIPTION
`source`	Text corresponding to source material. TYPE: `str`
`summary`	Text corresponding to a summary. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (not comprehensive) and 1.0 (comprehensive) and a dictionary containing the reasons for the evaluation.

summarization_with_cot_reasons ¶

summarization_with_cot_reasons(
    source: str, summary: str
) -> Tuple[float, Dict]

Summarization is deprecated in place of comprehensiveness. This function is no longer implemented.

stereotypes ¶

stereotypes(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> float

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt.

Example

feedback = Feedback(provider.stereotypes).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`float`	A value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed). TYPE: `float`

stereotypes_with_cot_reasons ¶

stereotypes_with_cot_reasons(
    prompt: str,
    response: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, Dict]

Uses chat completion model. A function that completes a template to check adding assumed stereotypes in the response when not present in the prompt. Also uses chain of thought methodology and emits the reasons.

Example

feedback = Feedback(provider.stereotypes_with_cot_reasons).on_input_output()

PARAMETER	DESCRIPTION
`prompt`	A text prompt to an agent. TYPE: `str`
`response`	The agent's response to the prompt. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, Dict]`	Tuple[float, Dict]: A tuple containing a value between 0.0 (no stereotypes assumed) and 1.0 (stereotypes assumed) and a dictionary containing the reasons for the evaluation.

groundedness_measure_with_cot_reasons ¶

groundedness_measure_with_cot_reasons(
    source: str,
    statement: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[str] = None,
    groundedness_configs: Optional[
        GroundednessConfigs
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, dict]

A measure to track if the source material supports each sentence in the statement using an LLM provider.

The statement will first be split by a tokenizer into its component sentences.

Then, trivial statements are eliminated so as to not dilute the evaluation. Note that if all statements are filtered out as trivial, returns 0.0 with a reason indicating no non-trivial statements were found.

The LLM will process each statement, using chain of thought methodology to emit the reasons.

Abstentions will be considered as grounded.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(context.collect())
    .on_output()
    )

To further explain how the function works under the hood, consider the statement:

"Hi. I'm here to help. The university of Washington is a public research university. UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

The function will split the statement into its component sentences:

"Hi."
"I'm here to help."
"The university of Washington is a public research university."
"UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

Next, trivial statements are removed, leaving only:

"The university of Washington is a public research university."
"UW's connections to major corporations in Seattle contribute to its reputation as a hub for innovation and technology"

The LLM will then process the statement, to assess the groundedness of the statement.

For the sake of this example, the LLM will grade the groundedness of one statement as 10, and the other as 0.

Then, the scores are normalized, and averaged to give a final groundedness score of 0.5.

PARAMETER	DESCRIPTION
`source`	The source that should support the statement. TYPE: `str`
`statement`	The statement to check groundedness. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional examples to guide the evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`groundedness_configs`	Configuration for groundedness evaluation. Defaults to None. TYPE: `Optional[GroundednessConfigs]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, dict]`	Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation.

qs_relevance ¶

qs_relevance(*args, **kwargs)

Deprecated. Use relevance instead.

qs_relevance_with_cot_reasons ¶

qs_relevance_with_cot_reasons(*args, **kwargs)

Deprecated. Use relevance_with_cot_reasons instead.

groundedness_measure_with_cot_reasons_consider_answerability ¶

groundedness_measure_with_cot_reasons_consider_answerability(
    source: str,
    statement: str,
    question: str,
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[List[str]] = None,
    groundedness_configs: Optional[
        GroundednessConfigs
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    **kwargs
) -> Tuple[float, dict]

A measure to track if the source material supports each sentence in the statement using an LLM provider.

The statement will first be split by a tokenizer into its component sentences.

Then, trivial statements are eliminated so as to not dilute the evaluation. Note that if all statements are filtered out as trivial, returns 0.0 with a reason indicating no non-trivial statements were found.

The LLM will process each statement, using chain of thought methodology to emit the reasons.

In the case of abstentions, such as 'I do not know', the LLM will be asked to consider the answerability of the question given the source material.

If the question is considered answerable, abstentions will be considered as not grounded and punished with low scores. Otherwise, unanswerable abstentions will be considered grounded.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons_consider_answerability)
    .on(context.collect())
    .on_output()
    .on_input()
    )

PARAMETER	DESCRIPTION
`source`	The source that should support the statement. TYPE: `str`
`statement`	The statement to check groundedness. TYPE: `str`
`question`	The question to check answerability. TYPE: `str`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional examples to guide the evaluation. Defaults to None. TYPE: `Optional[List[str]]` DEFAULT: `None`
`groundedness_configs`	Configuration for groundedness evaluation. Defaults to None. TYPE: `Optional[GroundednessConfigs]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

RETURNS	DESCRIPTION
`Tuple[float, dict]`	Tuple[float, dict]: A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation.

logical_consistency_with_cot_reasons ¶

logical_consistency_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic trace using a rubric focused on logical consistency and reasoning.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_logical_consistency = (
    Feedback(provider.logical_consistency_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (no logical consistency) and 1.0 (complete logical consistency) and a dictionary containing the reasons for the evaluation.

execution_efficiency_with_cot_reasons ¶

execution_efficiency_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic execution using a rubric focused on execution efficiency.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_execution_efficiency = (
    Feedback(provider.execution_efficiency_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (highly inefficient workflow) and 1.0 (highly streamlined/optimized workflow) and a dictionary containing the reasons for the evaluation.

plan_adherence_with_cot_reasons ¶

plan_adherence_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic trace using a rubric focused on execution adherence to the plan.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_plan_adherence = (
    Feedback(provider.plan_adherence_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (execution did not follow plan) and 1.0 (execution followed plan exactly) and a dictionary containing the reasons for the evaluation.

plan_quality_with_cot_reasons ¶

plan_quality_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic system's plan.

Example

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_plan_quality = (
    Feedback(provider.plan_quality_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (poor plan quality) and 1.0 (excellent plan quality) and a dictionary containing the reasons for the evaluation.

tool_selection_with_cot_reasons ¶

tool_selection_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic trace using a rubric focused on tool selection. Example:

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_tool_selection = (
    Feedback(provider.tool_selection_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (poor tool selection) and 1.0 (excellent tool selection) and a dictionary containing the reasons for the evaluation.

tool_calling_with_cot_reasons ¶

tool_calling_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic trace using a rubric focused on tool calling. Example:

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_tool_calling = (
    Feedback(provider.tool_calling_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (poor tool calling) and 1.0 (excellent tool calling) and a dictionary containing the reasons for the evaluation.

tool_quality_with_cot_reasons ¶

tool_quality_with_cot_reasons(
    trace: Union[Trace, str],
    criteria: Optional[str] = None,
    additional_instructions: Optional[str] = None,
    examples: Optional[
        List[Tuple[Dict[str, str], int]]
    ] = None,
    min_score_val: int = 0,
    max_score_val: int = 3,
    temperature: float = 0.0,
    enable_trace_compression: bool = True,
    **kwargs
) -> Tuple[float, Dict]

Evaluate the quality of an agentic trace using a rubric focused on tool quality. Example:

from trulens.core import Feedback
from trulens.providers.openai import OpenAI

provider = OpenAI()

f_tool_quality = (
    Feedback(provider.tool_quality_with_cot_reasons)
    .on({
        "trace": Selector(trace_level=True),
    })

PARAMETER	DESCRIPTION
`trace`	The trace to evaluate (e.g., as a JSON string or formatted log). TYPE: `Union[Trace, str]`
`criteria`	If provided, overrides the default criteria for evaluation. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`additional_instructions`	If provided, adds instructions to default criteria for the judge to follow. Defaults to None. TYPE: `Optional[str]` DEFAULT: `None`
`examples`	Optional few-shot examples for evaluation. Defaults to None. TYPE: `Optional[List[Tuple[Dict[str, str], int]]` DEFAULT: `None`
`min_score_val`	The minimum score value used by the LLM before normalization. Defaults to 0. TYPE: `int` DEFAULT: `0`
`max_score_val`	The maximum score value used by the LLM before normalization. Defaults to 3. TYPE: `int` DEFAULT: `3`
`temperature`	The temperature for the LLM response, which might have impact on the confidence level of the evaluation. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`
`enable_trace_compression`	Whether to compress the trace data to reduce token usage. When True (default), traces are compressed to preserve essential information while removing redundant data. Set to False to use full, uncompressed traces. This parameter is only available for feedback functions that take 'trace' as input. Defaults to True. TYPE: `bool` DEFAULT: `True`

Returns: Tuple[float, Dict]: A tuple containing a value between 0.0 (poor tool quality) and 1.0 (excellent tool quality) and a dictionary containing the reasons for the evaluation.

__rich_repr__ ¶

__rich_repr__() -> Result

Requirement for pretty printing using the rich package.

load `staticmethod` ¶

load(obj, *args, **kwargs)

Deserialize/load this object using the class information in tru_class_info to lookup the actual class that will do the deserialization.

model_validate `classmethod` ¶

model_validate(*args, **kwargs) -> Any

Deserialized a jsonized version of the app into the instance of the class it was serialized from.

Note

This process uses extra information stored in the jsonized object and handled by WithClassInfo.

trulens.feedback.llm_provider¶