OpenAI OSS Models as Judge with TruLens¶

Evaluation is a key component for using TruLens, useful for assessing the quality of AI apps and increasingly complex AI agents.

As agents become more sophisticated, developers face competing requirements when choosing evaluation models:

Powerful, reliable LLMs capable of assessing complex tasks.
Models used for evaluation should be cost-effective (despite large token requirements).
Models should be runnable on local hardware/deployment (rather than locked via API access).

To meet these requirements, we often have to choose between expensive, large, proprietary models and smaller open-source ones.

OpenAI's release of GPT-OSS models (20B and 120B) are an important advancement in address these competing requirements, offering highly performant, open-weight reasoning language models at competitive costs, while supporting local deployments.

TruLens provides day-0 support for these models, enabling you to evaluate your AI agents with powerful OSS models like the GPT-OSS series.

Consider a challenging groundedness evaluation¶

In [ ]:

Copied!





source_text = """
Clinical decision support (CDS) software that provides recommendations based on AI algorithms may be 
considered a medical device if it is intended to inform clinical management. 
However, for such software to be exempt from regulation, it must allow healthcare professionals to 
independently review the basis of its recommendations. The FDA does not endorse any software that acts 
as a substitute for clinical judgment or is used as the sole basis for treatment decisions.
"""
source_text = """
Clinical decision support (CDS) software that provides recommendations based on AI algorithms may be 
considered a medical device if it is intended to inform clinical management. 
However, for such software to be exempt from regulation, it must allow healthcare professionals to 
independently review the basis of its recommendations. The FDA does not endorse any software that acts 
as a substitute for clinical judgment or is used as the sole basis for treatment decisions.
"""

In [ ]:

Copied!

claim_hallucination = "The FDA’s 2023 guidance explicitly states that AI-generated diagnoses may be used as a sole basis for treatment decisions in clinical settings."

claim_grounded = "According to the FDA, clinical decision support software must enable healthcare professionals to independently review how recommendations are made, in order to be exempt from regulation."
claim_hallucination = "The FDA’s 2023 guidance explicitly states that AI-generated diagnoses may be used as a sole basis for treatment decisions in clinical settings."

claim_grounded = "According to the FDA, clinical decision support software must enable healthcare professionals to independently review how recommendations are made, in order to be exempt from regulation."

Evaluate using the TruLens LiteLLM provider & Ollama¶

To use, first you need to download ollama.

Then, run ollama run gpt-oss.

Once the model is pulled, you can use it in TruLens!

In [ ]:

Copied!

from trulens.providers.litellm import LiteLLM

ollama_provider = LiteLLM(
    model_engine="ollama/gpt-oss", api_base="http://localhost:11434"
)
from trulens.providers.litellm import LiteLLM

ollama_provider = LiteLLM(
    model_engine="ollama/gpt-oss", api_base="http://localhost:11434"
)

In [ ]:

Copied!

ollama_provider.groundedness_measure_with_cot_reasons(source_text, claim_hallucination)
ollama_provider.groundedness_measure_with_cot_reasons(source_text, claim_hallucination)

In [ ]:

Copied!

ollama_provider.groundedness_measure_with_cot_reasons(source_text, claim_grounded)
ollama_provider.groundedness_measure_with_cot_reasons(source_text, claim_grounded)