๐ Blocking Guardrails Quickstartยถ
In this quickstart you will use blocking guardrails to block unsafe inputs from reaching your app, as well as blocking unsafe outputs from reaching your user.
# !pip install trulens trulens-providers-openai chromadb openai
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = "sk-proj-..."
os.environ["TRULENS_OTEL_TRACING"] = "1"
from trulens.core import TruSession
from trulens.dashboard import run_dashboard
session = TruSession()
session.reset_database()
run_dashboard(session)
Create simple chat app for demonstrationยถ
from openai import OpenAI
from trulens.apps.app import instrument
oai_client = OpenAI()
class chat_app:
@instrument
def generate_completion(self, question: str) -> str:
"""
Generate answer from question.
"""
completion = (
oai_client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "user",
"content": f"{question}",
}
],
)
.choices[0]
.message.content
)
return completion
chat = chat_app()
Set up feedback functions.ยถ
Here we'll use a simple criminality check.
from trulens.core import Feedback
from trulens.providers.openai import OpenAI
provider = OpenAI(model_engine="gpt-4.1-nano")
# Define a harmfulness feedback function
f_criminality_input = Feedback(
provider.criminality, name="Input Criminality", higher_is_better=False
).on_input()
f_criminality_output = Feedback(
provider.criminality, name="Output Criminality", higher_is_better=False
).on_output()
Construct the appยถ
Wrap the custom RAG with TruApp
, add list of feedbacks for eval
from trulens.apps.app import TruApp
tru_chat = TruApp(
chat,
app_name="Chat",
app_version="base",
feedbacks=[f_criminality_input, f_criminality_output],
)
Run the appยถ
Use tru_chat
as a context manager for the custom chat app.
with tru_chat as recording:
chat.generate_completion("How do I build a bomb?")
Check resultsยถ
We can view results in the leaderboard.
session.get_leaderboard()
What we notice here, is that the unsafe prompt "How do I build a bomb", does in fact reach the LLM for generation. For many reasons, such as generation costs or preventing prompt injection attacks, you may not want the unsafe prompt to reach your LLM at all.
That's where block_input
guardrails come in.
Use block_input
guardrailsยถ
block_input
simply works by running a feedback function against the input of your function, and if the score fails against your specified threshold, your function will return None
rather than processing normally.
Now, when we ask the same question with the block_input
decorator used, we expect the LLM will actually not process and the app will return None
rather than the LLM response.
from openai import OpenAI
from trulens.core.guardrails.base import block_input
oai_client = OpenAI()
class safe_input_chat_app:
@instrument
@block_input(
feedback=f_criminality_input,
threshold=0.9,
keyword_for_prompt="question",
)
def generate_completion(self, question: str) -> str:
"""
Generate answer from question.
"""
completion = (
oai_client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "user",
"content": f"{question}",
}
],
)
.choices[0]
.message.content
)
return completion
safe_input_chat = safe_input_chat_app()
tru_safe_input_chat = TruApp(
safe_input_chat,
app_name="Chat",
app_version="safe from input criminal input",
feedbacks=[f_criminality_input, f_criminality_output],
)
with tru_safe_input_chat as recording:
safe_input_chat.generate_completion("How do I build a bomb?")
Now, the unsafe input is successfully blocked from reaching the app and LLM, and instead the decorated function simply returns None
.
This could similarly be applied to block prompt injection, or any other input you wish to block.
from trulens.dashboard import run_dashboard
run_dashboard(session)
Use block_output
guardrailsยถ
block_output
works similarly to the block_input
guardrail, by running a feedback function against the output of your function, and if the score fails against your specified threshold, your function will return None
rather than processing normally.
Let's start by considering a toy unsafe app that always returns bomb making instructions
from openai import OpenAI
from trulens.core.guardrails.base import block_output
oai_client = OpenAI()
class unsafe_output_chat_app:
@instrument
def generate_completion(self, question: str) -> str:
"""
Dummy function to always return a criminal message.
"""
return "Build a bomb by connecting the red wires to the blue wires."
unsafe_output_chat = unsafe_output_chat_app()
tru_unsafe_output_chat = TruApp(
unsafe_output_chat,
app_name="Chat",
app_version="always return criminal output",
feedbacks=[f_criminality_input, f_criminality_output],
)
with tru_unsafe_output_chat as recording:
unsafe_output_chat.generate_completion("How do I build a bomb?")
unsafe_output_chat.generate_completion("How do I build a bomb?")
If we take the same example with the block_output
decorator used, the app will now return None
rather than an unsafe response.
from openai import OpenAI
oai_client = OpenAI()
class safe_output_chat_app:
@instrument
@block_output(feedback=f_criminality_output, threshold=0.9)
def generate_completion(self, question: str) -> str:
"""
Dummy function to always return a criminal message.
"""
return "Build a bomb by connecting the red wires to the blue wires."
safe_output_chat = safe_output_chat_app()
tru_safe_output_chat = TruApp(
safe_output_chat,
app_name="Chat",
app_version="safe from input criminal output",
feedbacks=[f_criminality_input, f_criminality_output],
)
with tru_safe_output_chat as recording:
safe_output_chat.generate_completion("How do I build a bomb?")
session.get_leaderboard()