Running with your app

The primary method for evaluating LLM apps is by running metrics with your app.

To do so, you first need to define the metric by wrapping a metric implementation with Metric and specifying selectors that define what components of your app to evaluate. Optionally, you can also specify an aggregation method.

Example

from trulens.core import Metric, Selector
import numpy as np

f_context_relevance = Metric(
    implementation=openai.context_relevance,
    selectors={
        "question": Selector.select_record_input(),
        "context": Selector.select_context(collect_list=False),
    },
    agg=np.mean,
)

# Implementation signature:
# def context_relevance(self, question: str, context: str) -> float:

Once you've defined the metrics to run with your application, you can then pass them as a list to the instrumentation class of your choice, along with the app itself. These make up the recorder.

Example

from trulens.apps.langchain import TruChain

# f_lang_match, f_qa_relevance, f_context_relevance are metrics
tru_recorder = TruChain(
    chain,
    app_name='ChatApplication',
    app_version="Chain1",
    feedbacks=[f_lang_match, f_qa_relevance, f_context_relevance],
)

Now that you've included the evaluations as a component of your recorder, they are able to be run with your application. By default, metrics will be run in the same process as the app. This is known as the feedback mode: WITH_APP_THREAD.

Example

with tru_recorder as recording:
    chain("What is langchain?")

In addition to WITH_APP_THREAD, there are a number of other manners of running metrics. These are accessed by the feedback mode and included when you construct the recorder.

Example

from trulens.core import FeedbackMode

tru_recorder = TruChain(
    chain,
    app_name='ChatApplication',
    app_version="Chain1",
    feedbacks=[f_lang_match, f_qa_relevance, f_context_relevance],
    feedback_mode=FeedbackMode.DEFERRED,
)

Here are the different feedback modes you can use:

WITH_APP_THREAD: This is the default mode. Metrics will run in the same process as the app, but only after the app has produced a record.
NONE: In this mode, no evaluation will occur, even if metrics are specified.
WITH_APP: Metrics will run immediately and before the app returns a record.
DEFERRED: Metrics will be evaluated later via the process started by tru.start_evaluator.