Evaluating Multi-Modal RAGยถ
In this notebook guide, weโll demonstrate how to evaluate a LlamaIndex Multi-Modal RAG system with TruLens.
# !pip install trulens trulens-apps-llamaindex trulens-providers-openai llama_index==0.10.11 ftfy regex tqdm git+https://github.com/openai/CLIP.git torch torchvision matplotlib scikit-image qdrant_client
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
Use Case: Spelling In ASLยถ
In this demonstration, we will build a RAG application for teaching how to sign the alphabet of the American Sign Language (ASL).
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."
Imagesยถ
The images were taken from ASL-Alphabet Kaggle dataset. Note, that they were modified to simply include a label of the associated letter on the hand gesture image. These altered images are what we use as context to the user queries, and they can be downloaded from our google drive (see below cell, which you can uncomment to download the dataset directly from this notebook).
Text Contextยถ
For text context, we use descriptions of each of the hand gestures sourced from https://www.deafblind.com/asl.html. We have conveniently stored these in a json file called asl_text_descriptions.json which is included in the zip download from our google drive.
download_notebook_data = True
if download_notebook_data:
!wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
!unzip asl_data.zip
import json
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
# context images
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()
# context text
with open("asl_data/asl_text_descriptions.json") as json_file:
asl_text_descriptions = json.load(json_file)
text_format_str = "To sign {letter} in ASL: {desc}."
text_documents = [
Document(text=text_format_str.format(letter=k, desc=v))
for k, v in asl_text_descriptions.items()
]
With our documents in hand, we can create our MultiModalVectorStoreIndex. To do so, we parse our Documents into nodes and then simply pass these nodes to the MultiModalVectorStoreIndex constructor.
from llama_index.core.indices.multi_modal.base import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)
asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
#######################################################################
## Set load_previously_generated_text_descriptions to True if you ##
## would rather use previously generated gpt-4v text descriptions ##
## that are included in the .zip download ##
#######################################################################
load_previously_generated_text_descriptions = False
from llama_index.core.schema import ImageDocument
from llama_index.legacy.multi_modal_llms.openai import OpenAIMultiModal
import tqdm
if not load_previously_generated_text_descriptions:
# define our lmm
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", max_new_tokens=300
)
# make a new copy since we want to store text in its attribute
image_with_text_documents = SimpleDirectoryReader(image_path).load_data()
# get text desc and save to text attr
for img_doc in tqdm.tqdm(image_with_text_documents):
response = openai_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=[img_doc],
)
img_doc.text = response.text
# save so don't have to incur expensive gpt-4v calls again
desc_jsonl = [
json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
]
with open("image_descriptions.json", "w") as f:
json.dump(desc_jsonl, f)
else:
# load up previously saved image descriptions and documents
with open("asl_data/image_descriptions.json") as f:
image_descriptions = json.load(f)
image_with_text_documents = [
ImageDocument.from_dict(el) for el in image_descriptions
]
# parse into nodes
image_with_text_nodes = node_parser.get_nodes_from_documents(
image_with_text_documents
)
A keen reader will notice that we stored the text descriptions within the text field of an ImageDocument. As we did before, to create a MultiModalVectorStoreIndex, we'll need to parse the ImageDocuments as ImageNodes, and thereafter pass the nodes to the constructor.
Note that when ImageNodes that have populated text fields are used to build a MultiModalVectorStoreIndex, we can choose to use this text to build embeddings on that will be used for retrieval. To so, we just specify the class attribute is_image_to_text to True.
image_with_text_nodes = node_parser.get_nodes_from_documents(
image_with_text_documents
)
asl_text_desc_index = MultiModalVectorStoreIndex(
nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)
Build Our Multi-Modal RAG Systemsยถ
As in the text-only case, we need to "attach" a generator to our index (that can be used as a retriever) to finally assemble our RAG systems. In the multi-modal case however, our generators are Multi-Modal LLMs (or also often referred to as Large Multi-Modal Models or LMM for short). In this notebook, to draw even more comparisons on varied RAG systems, we will use GPT-4V. We can "attach" a generator and get an queryable interface for RAG by invoking the as_query_engine method of our indexes.
from llama_index.core.prompts import PromptTemplate
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
# define our QA prompt template
qa_tmpl_str = (
"Images of hand gestures for ASL are provided.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"If the images provided cannot help in answering the query\n"
"then respond that you are unable to answer the query. Otherwise,\n"
"using only the context provided, and not prior knowledge,\n"
"provide an answer to the query."
"Query: {query_str}\n"
"Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)
# define our lmms
openai_mm_llm = OpenAIMultiModal(
model="gpt-4-vision-preview",
max_new_tokens=300,
)
# define our RAG query engines
rag_engines = {
"mm_clip_gpt4v": asl_index.as_query_engine(
multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
),
"mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
),
}
Test drive our Multi-Modal RAGยถ
Let's take a test drive of one these systems. To pretty display the response, we make use of notebook utility function display_query_and_multimodal_response.
letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
from llama_index.core.response.notebook_utils import (
display_query_and_multimodal_response,
)
display_query_and_multimodal_response(query, response)
from trulens.core import TruSession
from trulens.dashboard import run_dashboard
session = TruSession()
session.reset_database()
run_dashboard(session)
Define the RAG Triad for evaluationsยถ
First we need to define the feedback functions to use: answer relevance, context relevance and groundedness.
import numpy as np
# Initialize provider class
from openai import OpenAI
from trulens.core import Feedback
from trulens.apps.llamaindex import TruLlama
from trulens.providers.openai import OpenAI as fOpenAI
openai_client = OpenAI()
provider = fOpenAI(client=openai_client)
# Define a groundedness feedback function
f_groundedness = (
Feedback(
provider.groundedness_measure_with_cot_reasons, name="Groundedness"
)
.on(TruLlama.select_source_nodes().node.text.collect())
.on_output()
)
# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(
provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
Feedback(
provider.context_relevance_with_cot_reasons, name="Context Relevance"
)
.on_input()
.on(TruLlama.select_source_nodes().node.text)
.aggregate(np.mean)
)
feedbacks = [f_groundedness, f_qa_relevance, f_context_relevance]
Set up TruLlama to log and evaluate rag enginesยถ
tru_text_desc_gpt4v = TruLlama(
rag_engines["mm_text_desc_gpt4v"],
app_name="text-desc-gpt4v",
feedbacks=feedbacks,
)
tru_mm_clip_gpt4v = TruLlama(
rag_engines["mm_clip_gpt4v"], app_name="mm_clip_gpt4v", feedbacks=feedbacks
)
Evaluate the performance of the RAG on each letterยถ
letters = [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W",
"X",
"Y",
"Z",
]
with tru_text_desc_gpt4v as recording:
for letter in letters:
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
with tru_mm_clip_gpt4v as recording:
for letter in letters:
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_clip_gpt4v"].query(query)
See resultsยถ
session.get_leaderboard(app_ids=["text-desc-gpt4v", "mm_clip_gpt4v"])
from trulens.dashboard import run_dashboard
run_dashboard(session)