Pinecone Configuration Choices on Downstream App Performance¶
Large Language Models (LLMs) have a hallucination problem. Retrieval Augmented Generation (RAG) is an emerging paradigm that augments LLMs with a knowledge base – a source of truth set of docs often stored in a vector database like Pinecone, to mitigate this problem. To build an effective RAG-style LLM application, it is important to experiment with various configuration choices while setting up the vector database and study their impact on performance metrics.
Installing dependencies¶
The following cell invokes a shell command in the active Python environment for the packages we need to continue with this notebook. You can also run pip install
directly in your terminal without the !
.
# !pip install trulens trulens-apps-langchain trulens-providers-openai langchain==0.0.315 openai==0.28.1 tiktoken==0.5.1 "pinecone-client[grpc]==2.2.4" pinecone-datasets==0.5.1 datasets==2.14.5 langchain_community
import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."
os.environ["PINECONE_API_KEY"] = "..."
os.environ["PINECONE_ENVIRONMENT"] = "..."
Building the Knowledge Base¶
We will download a pre-embedding dataset from pinecone-datasets. Allowing us to skip the embedding and preprocessing steps, if you'd rather work through those steps you can find the full notebook here.
import pinecone_datasets
dataset = pinecone_datasets.load_dataset(
"wikipedia-simple-text-embedding-ada-002-100K"
)
dataset.head()
We'll format the dataset ready for upsert and reduce what we use to a subset of the full dataset.
# we drop sparse_values as they are not needed for this example
dataset.documents.drop(["metadata"], axis=1, inplace=True)
dataset.documents.rename(columns={"blob": "metadata"}, inplace=True)
# we will use rows of the dataset up to index 30_000
dataset.documents.drop(dataset.documents.index[30_000:], inplace=True)
len(dataset)
Now we move on to initializing our Pinecone vector database.
Vector Database¶
To create our vector database we first need a free API key from Pinecone. Then we initialize like so:
import pinecone
# find API key in console at app.pinecone.io
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
# find ENV (cloud region) next to API key in console
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT")
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index_name_v1 = "langchain-rag-cosine"
if index_name_v1 not in pinecone.list_indexes():
# we create a new index
pinecone.create_index(
name=index_name_v1,
metric="cosine", # we'll try each distance metric here
dimension=1536, # 1536 dim of text-embedding-ada-002
)
We can fetch index stats to confirm that it was created. Note that the total vector count here will be 0.
import time
index = pinecone.GRPCIndex(index_name_v1)
# wait a moment for the index to be fully initialized
time.sleep(1)
index.describe_index_stats()
Upsert documents into the db.
for batch in dataset.iter_documents(batch_size=100):
index.upsert(batch)
Confirm they've been added, the vector count should now be 30k.
index.describe_index_stats()
Creating a Vector Store and Querying¶
Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object, which we initialize like so:
from langchain.embeddings.openai import OpenAIEmbeddings
# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model_name = "text-embedding-ada-002"
embed = OpenAIEmbeddings(model=model_name, openai_api_key=OPENAI_API_KEY)
Now initialize the vector store:
from langchain_community.vectorstores import Pinecone
text_field = "text"
# switch back to normal index for langchain
index = pinecone.Index(index_name_v1)
vectorstore = Pinecone(index, embed.embed_query, text_field)
Retrieval Augmented Generation (RAG)¶
In RAG we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the vectorstore
.
To do this we initialize a RetrievalQA
object like so:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# completion llm
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.0)
chain_v1 = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)
Evaluation with TruLens¶
Once we’ve set up our app, we should put together our feedback functions. As a reminder, feedback functions are an extensible method for evaluating LLMs. Here we’ll set up 3 feedback functions: context_relevance
, qa_relevance
, and groundedness
. They’re defined as follows:
- QS Relevance: query-statement relevance is the average of relevance (0 to 1) for each context chunk returned by the semantic search.
- QA Relevance: question-answer relevance is the relevance (again, 0 to 1) of the final answer to the original question.
- Groundedness: groundedness measures how well the generated response is supported by the evidence provided to the model where a score of 1 means each sentence is grounded by a retrieved context chunk.
# Imports main tools for eval
import numpy as np
from trulens.core import Feedback
from trulens.core import Select
from trulens.core import TruSession
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as fOpenAI
session = TruSession()
# Initialize OpenAI-based feedback function collection class:
provider = fOpenAI()
# Define groundedness
f_groundedness = (
Feedback(
provider.groundedness_measure_with_cot_reasons, name="Groundedness"
)
.on(
TruChain.select_context(chain_v1).collect() # context
)
.on_output()
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(
provider.relevance_with_cot_reasons, name="Answer Relevance"
).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
Feedback(
provider.context_relevance_with_cot_reasons, name="Context Relevance"
)
.on_input()
.on(TruChain.select_context(chain_v1))
.aggregate(np.mean)
)
feedback_functions = [f_answer_relevance, f_context_relevance, f_groundedness]
# wrap with TruLens
tru_chain_recorder_v1 = TruChain(
chain_v1, app_name="WikipediaQA", app_version="chain_1", feedbacks=feedback_functions
)
Now we can submit queries to our application and have them tracked and evaluated by TruLens.
prompts = [
"Name some famous dental floss brands?",
"Which year did Cincinnati become the Capital of Ohio?",
"Which year was Hawaii's state song written?",
"How many countries are there in the world?",
"How many total major trophies has manchester united won?",
]
with tru_chain_recorder_v1 as recording:
for prompt in prompts:
chain_v1(prompt)
Open the TruLens Dashboard to view tracking and evaluations.
from trulens.dashboard import run_dashboard
run_dashboard(session)
# If using a free pinecone instance, only one index is allowed. Delete instance to make room for the next iteration.
pinecone.delete_index(index_name_v1)
time.sleep(
30
) # sleep for 30 seconds after deleting the index before creating a new one
Experimenting with Distance Metrics¶
Now that we’ve walked through the process of building our tracked RAG application using cosine as the distance metric, all we have to do for the next two experiments is to rebuild the index with ‘euclidean’ or ‘dotproduct’ as the metric and following the rest of the steps above as is.
index_name_v2 = "langchain-rag-euclidean"
pinecone.create_index(
name=index_name_v2,
metric="euclidean",
dimension=1536, # 1536 dim of text-embedding-ada-002
)
index = pinecone.GRPCIndex(index_name_v2)
# wait a moment for the index to be fully initialized
time.sleep(1)
# upsert documents
for batch in dataset.iter_documents(batch_size=100):
index.upsert(batch)
# qa still exists, and will now use our updated vector store
# switch back to normal index for langchain
index = pinecone.Index(index_name_v2)
# update vectorstore with new index
vectorstore = Pinecone(index, embed.embed_query, text_field)
# recreate qa from vector store
chain_v2 = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)
# wrap with TruLens
tru_chain_recorder_v2 = TruChain(
qa, app_name="WikipediaQA", app_version="chain_2", feedbacks=[qa_relevance, context_relevance]
)
with tru_chain_recorder_v2 as recording:
for prompt in prompts:
chain_v2(prompt)
pinecone.delete_index(index_name_v2)
time.sleep(
30
) # sleep for 30 seconds after deleting the index before creating a new one
index_name_v3 = "langchain-rag-dot"
pinecone.create_index(
name=index_name_v3,
metric="dotproduct",
dimension=1536, # 1536 dim of text-embedding-ada-002
)
index = pinecone.GRPCIndex(index_name_v3)
# wait a moment for the index to be fully initialized
time.sleep(1)
index.describe_index_stats()
# upsert documents
for batch in dataset.iter_documents(batch_size=100):
index.upsert(batch)
# switch back to normal index for langchain
index = pinecone.Index(index_name_v3)
# update vectorstore with new index
vectorstore = Pinecone(index, embed.embed_query, text_field)
# recreate qa from vector store
chain_v3 = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)
# wrap with TruLens
tru_chain_recorder_v3 = TruChain(
chain_v3, app_name="WikipediaQA", app_version="chain_3", feedbacks=feedback_functions
)
with tru_chain_recorder_v3 as recording:
for prompt in prompts:
chain_v3(prompt)
We can also see that both the euclidean and dot-product metrics performed at a lower latency than cosine at roughly the same evaluation quality. We can move forward with either. Since Euclidean is already loaded in Pinecone, we'll go with that one.
After doing so, we can view our evaluations for all three LLM apps sitting on top of the different indices. All three apps are struggling with query-statement relevance. In other words, the context retrieved is only somewhat relevant to the original query.
Diagnosis: Hallucination.
Digging deeper into the Query Statement Relevance, we notice one problem in particular with a question about famous dental floss brands. The app responds correctly, but is not backed up by the context retrieved, which does not mention any specific brands.
Using a less powerful model is a common way to reduce hallucination for some applications. We’ll evaluate ada-001 in our next experiment for this purpose.
Changing different components of apps built with frameworks like LangChain is really easy. In this case we just need to call ‘text-ada-001’ from the langchain LLM store. Adding in easy evaluation with TruLens allows us to quickly iterate through different components to find our optimal app configuration.
# completion llm
from langchain_community.llms import OpenAI
llm = OpenAI(model_name="text-ada-001", temperature=0)
chain_with_sources = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever()
)
# wrap with TruLens
tru_chain_with_sources_recorder = TruChain(
chain_with_sources,
app_name="WikipediaQA",
app_version="chain_4"
feedbacks=[f_answer_relevance, f_context_relevance],
)
with tru_chain_with_sources_recorder as recording:
for prompt in prompts:
chain_with_sources(prompt)
However this configuration with a less powerful model struggles to return a relevant answer given the context provided. For example, when asked “Which year was Hawaii’s state song written?”, the app retrieves context that contains the correct answer but fails to respond with that answer, instead simply responding with the name of the song.
# completion llm
from langchain_community.llms import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0)
chain_v5 = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(top_k=1)
)
Note: The way the top_k works with RetrievalQA is that the documents are still retrieved by our semantic search and but only the top_k are passed to the LLM. Howevever TruLens captures all of the context chunks that are being retrieved. In order to calculate an accurate QS Relevance metric that matches what's being passed to the LLM, we need to only calculate the relevance of the top context chunk retrieved.
context_relevance = (
Feedback(provider.context_relevance, name="Context Relevance")
.on_input()
.on(
Select.Record.app.combine_documents_chain._call.args.inputs.input_documents[
:1
].page_content
)
.aggregate(np.mean)
)
# wrap with TruLens
tru_chain_recorder_v5 = TruChain(
chain_v5, app_name="WikipediaQA", app_version="chain_5", feedbacks=feedback_functions
)
with tru_chain_recorder_v5 as recording:
for prompt in prompts:
chain_v5(prompt)
Our final application has much improved context_relevance, qa_relevance and low latency!