Multimodal Evaluations with Gemini¶

Installing the dependencies¶

In [ ]:

Copied!

!pip install trulens trulens-providers-google google-genai -q
!pip install trulens trulens-providers-google google-genai -q

In [ ]:

Copied!

import os

os.environ["TRULENS_OTEL_TRACING"] = "0"
import os

os.environ["TRULENS_OTEL_TRACING"] = "0"

Download data to use¶

In [ ]:

Copied!

!wget "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -O ./sf.png
!wget "https://docs.google.com/uc?export=download&id=16oTISaB5m2uasHlezg7iPYV2FBiQYc4n" -O ./customer_support_agnet.wav
!wget "https://docs.google.com/uc?export=download&id=1186BiByf2NUXmOOO8k7hGK2qGy8o5fCb" -O ./chameleon.mp4
!wget "https://docs.google.com/uc?export=download&id=1ShPnYVc1iL_TA1t7ErCFEAHT74-qvMrn" -O ./sf.png
!wget "https://docs.google.com/uc?export=download&id=16oTISaB5m2uasHlezg7iPYV2FBiQYc4n" -O ./customer_support_agnet.wav
!wget "https://docs.google.com/uc?export=download&id=1186BiByf2NUXmOOO8k7hGK2qGy8o5fCb" -O ./chameleon.mp4

Setting Gemini Client¶

In [ ]:

Copied!

import os
from google import genai

os.environ["GOOGLE_API_KEY"] = "..."
google_client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])
import os
from google import genai

os.environ["GOOGLE_API_KEY"] = "..."
google_client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

Setup custom provider with Google¶

In this tutorial, we leverage the multi-modal capabilities of Gemini models from Google to evaluate across different modalities, while using their structured output generation to reliably produce scores in the desired result format.

For images¶

For image input, Gemini supports the following formats: JPEG, PNG, WebP, HEIC, and HEIF. Make sure to pass the image with the correct MIME type.

Google Feedback Provider for evaluating Image Faithfulness¶

In [ ]:

Copied!





from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types
from typing import List


class ImageFaithfulnessScore(BaseModel):
    """
    Represents a binary faithfulness score for an image response
    with respect to the given query and/or retrieved context.
    """

    value: float = Field(
        ...,
        description=(
            "Binary faithfulness score. "
            "1.0 → The image is faithful (accurately reflects the query/context). "
            "0.0 → The image is unfaithful (introduces unsupported or contradictory content)."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A concise explanation describing why this score was given. "
            "Should reference objects, attributes, or details in the image "
            "and whether they are supported by the query/context."
        ),
    )


class Multimodal_Google_Provider(Google):
    def multi_modal_faithfulness(
        self, query: str, retrieved_context: List
    ):
        retrieved_context = [
            (
                types.Part(text=rc)
                if isinstance(rc, str)
                else types.Part.from_bytes(data=rc, mime_type="image/png")
            )
            for rc in retrieved_context
        ]
        score = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system designed to judge whether a given piece of information is supported by the provided context, which may include both textual and visual content.

                    ### TASK:

                    Analyze the provided **information statement** and the **context** (including text and any images if available).
                    Determine whether the information is supported by the context.

                    Consider these factors:
                    - **Support from Text**: Does the textual context explicitly or implicitly support the information?
                    - **Support from Visuals**: If images are provided, do they support the information?
                    - **Partial Evidence**: If any part of the context (text or image) supports the information, output **1**.
                    - **Contradiction or Absence**: If the context does not support or contradicts the information, output **0**.

                    The classification must be one of the following:
                    [1, 0]

                    IMPORTANT:
                    - "1" → At least one piece of context (text or image) supports the information.
                    - "0" → None of the context supports the information, or it contradicts it.

                    ************

                    Here is the information statement:
                    """
                ),
                types.Part(text=query),
                types.Part(
                    text="""
                    Here is the context:
                    """
                ),
                *retrieved_context,
                types.Part(
                    text="""
                    ************

                    RESPONSE FORMAT:
                    Provide a single digit (`1` or `0`) representing the judgment.

                    ************

                    ### EXAMPLES:

                    Information: Apple pie is generally double-crusted.
                    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
                    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
                    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
                    Answer: 1

                    Information: Apple pies taste bad.
                    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
                    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
                    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
                    Answer: 0

                    ************

                    Analyze the information statement and the context, and respond in this format.
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": ImageFaithfulnessScore,
            },
        )
        return score.parsed
from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types
from typing import List


class ImageFaithfulnessScore(BaseModel):
    """
    Represents a binary faithfulness score for an image response
    with respect to the given query and/or retrieved context.
    """

    value: float = Field(
        ...,
        description=(
            "Binary faithfulness score. "
            "1.0 → The image is faithful (accurately reflects the query/context). "
            "0.0 → The image is unfaithful (introduces unsupported or contradictory content)."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A concise explanation describing why this score was given. "
            "Should reference objects, attributes, or details in the image "
            "and whether they are supported by the query/context."
        ),
    )


class Multimodal_Google_Provider(Google):
    def multi_modal_faithfulness(
        self, query: str, retrieved_context: List
    ):
        retrieved_context = [
            (
                types.Part(text=rc)
                if isinstance(rc, str)
                else types.Part.from_bytes(data=rc, mime_type="image/png")
            )
            for rc in retrieved_context
        ]
        score = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system designed to judge whether a given piece of information is supported by the provided context, which may include both textual and visual content.

                    ### TASK:

                    Analyze the provided **information statement** and the **context** (including text and any images if available).
                    Determine whether the information is supported by the context.

                    Consider these factors:
                    - **Support from Text**: Does the textual context explicitly or implicitly support the information?
                    - **Support from Visuals**: If images are provided, do they support the information?
                    - **Partial Evidence**: If any part of the context (text or image) supports the information, output **1**.
                    - **Contradiction or Absence**: If the context does not support or contradicts the information, output **0**.

                    The classification must be one of the following:
                    [1, 0]

                    IMPORTANT:
                    - "1" → At least one piece of context (text or image) supports the information.
                    - "0" → None of the context supports the information, or it contradicts it.

                    ************

                    Here is the information statement:
                    """
                ),
                types.Part(text=query),
                types.Part(
                    text="""
                    Here is the context:
                    """
                ),
                *retrieved_context,
                types.Part(
                    text="""
                    ************

                    RESPONSE FORMAT:
                    Provide a single digit (`1` or `0`) representing the judgment.

                    ************

                    ### EXAMPLES:

                    Information: Apple pie is generally double-crusted.
                    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
                    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
                    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
                    Answer: 1

                    Information: Apple pies taste bad.
                    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
                    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard or cheddar cheese.
                    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
                    Answer: 0

                    ************

                    Analyze the information statement and the context, and respond in this format.
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": ImageFaithfulnessScore,
            },
        )
        return score.parsed

Test custom feedback function¶

In [ ]:

Copied!





multimodal_gemini_provider = Multimodal_Google_Provider()

image_file_name = "sf.png"
with open(image_file_name, "rb") as f:
    image_bytes = f.read()

faithfulness = multimodal_gemini_provider.multi_modal_faithfulness(
    query="Does Sam’s Grill have outdoor seating?",
    retrieved_context=[
        image_bytes,
        "Customers can choose dine-in, curbside pickup, or delivery.",
    ],
)
faithfulness
multimodal_gemini_provider = Multimodal_Google_Provider()

image_file_name = "sf.png"
with open(image_file_name, "rb") as f:
    image_bytes = f.read()

faithfulness = multimodal_gemini_provider.multi_modal_faithfulness(
    query="Does Sam’s Grill have outdoor seating?",
    retrieved_context=[
        image_bytes,
        "Customers can choose dine-in, curbside pickup, or delivery.",
    ],
)
faithfulness

For Audio¶

For audio input, Gemini supports specific formats — WAV, MP3, AIFF, AAC, OGG, and FLAC. Ensure that you provide the correct MIME type when passing audio files.

Evaluating Customer Support Chatbot Resolutions with Google Feedback Provider¶

In [ ]:

Copied!





from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types

class ResolutionStatus(BaseModel):
    """
    Represents whether the support issue was resolved based on the agent's final utterance.
    """
    value: float = Field(
        ...,
        description=(
            "1.0 if the final utterance clearly indicates resolution of the issue; 0.0 otherwise."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A short explanation referencing the agent's final words "
            "and the detected emotion (tone, confidence, reassurance)."
        )
    )


class Multimodal_Google_Provider(Google):
    def audio_resolution_detection(self, audio_bytes: bytes):
        result = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system that checks customer support call endings.

                    ### TASK:

                    Based on both the transcript meaning AND the detected emotion in the audio, determine if the issue was **resolved**.

                    Guidelines for resolution:
                    - If the final utterance provides a clear action, resolution, or timeline in a confident or neutral/reassuring tone → value = 1.0.
                    - If the final utterance is vague, evasive, non-committal, or delivered with frustration/hesitation → value = 0.0.

                    ************

                    Here is the audio to analyze:
                    """
                ),
                types.Part.from_bytes(
                    data=audio_bytes,
                    mime_type="audio/wav",
                ),
                types.Part(
                    text="""
                    RESPONSE FORMAT:
                    Return JSON in the following schema:
                    {
                      "resolved": 1.0/0.0,
                      "reason": "short explanation with reference to transcript + audio tone"
                    }
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": ResolutionStatus,
            },
        )
        return result.parsed
from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types

class ResolutionStatus(BaseModel):
    """
    Represents whether the support issue was resolved based on the agent's final utterance.
    """
    value: float = Field(
        ...,
        description=(
            "1.0 if the final utterance clearly indicates resolution of the issue; 0.0 otherwise."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A short explanation referencing the agent's final words "
            "and the detected emotion (tone, confidence, reassurance)."
        )
    )


class Multimodal_Google_Provider(Google):
    def audio_resolution_detection(self, audio_bytes: bytes):
        result = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system that checks customer support call endings.

                    ### TASK:

                    Based on both the transcript meaning AND the detected emotion in the audio, determine if the issue was **resolved**.

                    Guidelines for resolution:
                    - If the final utterance provides a clear action, resolution, or timeline in a confident or neutral/reassuring tone → value = 1.0.
                    - If the final utterance is vague, evasive, non-committal, or delivered with frustration/hesitation → value = 0.0.

                    ************

                    Here is the audio to analyze:
                    """
                ),
                types.Part.from_bytes(
                    data=audio_bytes,
                    mime_type="audio/wav",
                ),
                types.Part(
                    text="""
                    RESPONSE FORMAT:
                    Return JSON in the following schema:
                    {
                      "resolved": 1.0/0.0,
                      "reason": "short explanation with reference to transcript + audio tone"
                    }
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": ResolutionStatus,
            },
        )
        return result.parsed

Test custom feedback function¶

In [ ]:

Copied!

multimodal_gemini_provider = Multimodal_Google_Provider()

# Only for audio of size <20Mb
with open("customer_support_agnet.wav", "rb") as f:
    audio_bytes = f.read()

multimodal_gemini_provider.audio_resolution_detection(audio_bytes=audio_bytes)
multimodal_gemini_provider = Multimodal_Google_Provider()

# Only for audio of size <20Mb
with open("customer_support_agnet.wav", "rb") as f:
    audio_bytes = f.read()

multimodal_gemini_provider.audio_resolution_detection(audio_bytes=audio_bytes)

For Video¶

For video input, Gemini supports the following formats: [MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP]. Ensure that you provide the correct MIME type when passing video files.

Google Feedback Provider to evaluate Video Relevance¶

In [ ]:

Copied!





from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types


class VideoRelevance(BaseModel):
    """
    Represents the relevance classification of a recommended video
    with respect to a given search query.
    """

    value: float = Field(
        ...,
        description=(
            "The classification of the video's relevance to the search query. "
            "'1.0' → directly addresses the main intent, "
            "'0.5' → overlaps but is incomplete or drifts, "
            "'0.0' → does not address the query in a meaningful way."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A concise explanation describing why this classification was chosen. "
            "Should reference topic alignment, specificity, format/medium match, "
            "and clarity of relevance."
        ),
    )


class Multimodal_Google_Provider(Google):
    def video_relevance_scorer(self, query, video_bytes):
        result = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system designed to judge whether a recommended video is relevant to a given search query.

                    ### TASK:

                    Analyze the provided search query and the recommended video.
                    Determine whether the video’s main content is relevant to the search intent expressed in the query.

                    Consider these factors:
                    - **Topic Alignment**: Does the video content match the subject of the search query?
                    - **Specificity**: Does it address the specific focus, details, or constraints of the query?
                    - **Format & Medium**: If the query implies a certain type of content (tutorial, documentary, news, etc.), does the video match?
                    - **Clarity of Relevance**: Is the connection to the query obvious or is it only loosely related?

                    The classification must be one of the following:
                    [1.0, 0.5, 0.0]

                    IMPORTANT:
                    - "1.0" → Directly addresses the main intent of the query.
                    - "0.5" → Shares some overlap but is missing key details or drifts into unrelated topics.
                    - "0.0" → Does not address the query’s intent in a meaningful way.
                    - Avoid overusing "partially_relevant" — decide firmly whenever possible.

                    ************

                    Here is the search query:
                    """
                ),
                types.Part(text=query),
                types.Part(
                    text="""
                    Here is the recommended video information:
                    """
                ),
                types.Part(
                    inline_data=types.Blob(data=video_bytes, mime_type="video/mp4")
                ),
                types.Part(
                    text="""
                    ************

                    RESPONSE FORMAT:
                    Provide a single word from the list above representing the relevance classification.

                    ************

                    EXAMPLE RESPONSE: relevant

                    ************

                    Analyze the query and the recommended video and respond in this format.
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": VideoRelevance,
            },
        )
        return result.parsed
from trulens.providers.google import Google
from pydantic import BaseModel, Field
from google.genai import types


class VideoRelevance(BaseModel):
    """
    Represents the relevance classification of a recommended video
    with respect to a given search query.
    """

    value: float = Field(
        ...,
        description=(
            "The classification of the video's relevance to the search query. "
            "'1.0' → directly addresses the main intent, "
            "'0.5' → overlaps but is incomplete or drifts, "
            "'0.0' → does not address the query in a meaningful way."
        ),
        ge=0.0,
        le=1.0,
    )

    reason: str = Field(
        ...,
        description=(
            "A concise explanation describing why this classification was chosen. "
            "Should reference topic alignment, specificity, format/medium match, "
            "and clarity of relevance."
        ),
    )


class Multimodal_Google_Provider(Google):
    def video_relevance_scorer(self, query, video_bytes):
        result = google_client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[
                types.Part(
                    text="""
                    You are an AI system designed to judge whether a recommended video is relevant to a given search query.

                    ### TASK:

                    Analyze the provided search query and the recommended video.
                    Determine whether the video’s main content is relevant to the search intent expressed in the query.

                    Consider these factors:
                    - **Topic Alignment**: Does the video content match the subject of the search query?
                    - **Specificity**: Does it address the specific focus, details, or constraints of the query?
                    - **Format & Medium**: If the query implies a certain type of content (tutorial, documentary, news, etc.), does the video match?
                    - **Clarity of Relevance**: Is the connection to the query obvious or is it only loosely related?

                    The classification must be one of the following:
                    [1.0, 0.5, 0.0]

                    IMPORTANT:
                    - "1.0" → Directly addresses the main intent of the query.
                    - "0.5" → Shares some overlap but is missing key details or drifts into unrelated topics.
                    - "0.0" → Does not address the query’s intent in a meaningful way.
                    - Avoid overusing "partially_relevant" — decide firmly whenever possible.

                    ************

                    Here is the search query:
                    """
                ),
                types.Part(text=query),
                types.Part(
                    text="""
                    Here is the recommended video information:
                    """
                ),
                types.Part(
                    inline_data=types.Blob(data=video_bytes, mime_type="video/mp4")
                ),
                types.Part(
                    text="""
                    ************

                    RESPONSE FORMAT:
                    Provide a single word from the list above representing the relevance classification.

                    ************

                    EXAMPLE RESPONSE: relevant

                    ************

                    Analyze the query and the recommended video and respond in this format.
                    """
                ),
            ],
            config={
                "response_mime_type": "application/json",
                "response_schema": VideoRelevance,
            },
        )
        return result.parsed

Test custom feedback function¶

In [ ]:

Copied!





gemini_provider = Multimodal_Google_Provider()

# Only for videos of size <20Mb
video_file_name = "chameleon.mp4"
with open(video_file_name, 'rb') as f:
    video_bytes = f.read()

relevance = gemini_provider.video_relevance_scorer(query="Chameleon hunting it's prey",video_bytes=video_bytes)
relevance
gemini_provider = Multimodal_Google_Provider()

# Only for videos of size <20Mb
video_file_name = "chameleon.mp4"
with open(video_file_name, 'rb') as f:
    video_bytes = f.read()

relevance = gemini_provider.video_relevance_scorer(query="Chameleon hunting it's prey",video_bytes=video_bytes)
relevance