There are many different approaches to chatting with your own data. One way is to upload your PDF to the OpenAi chatGPT cloud. This comes with several limitations such as data being deleted or umm the risk of a not smart chatGPT leaking it.
Now, you can also download a model from somewhere. And you need to find a good one. And it will probably not be as good as chatGPT. So, to benefit from the advantage of chatGPT with local data, LangChain presents an interesting option. We, however, need a vector store.
What is a vector store?
A vector store stores data as data encodings. This allows us to plug our own data in chatGPT. It augments the contextual awareness of chatGPT. If you asks about “Mauritius tourist activities”, it might give some response. If you give it more information, the response will be more detailed. You need to convert your data into vector embeddings and store it in a vector database. You can use Redis, PineCone (A SaaS offering) or ChromaDB as vector databases among others. Fortunately, LangChain helps us a lot in connecting a vector store for use in chatGPT.
Please read this before to understand more about vector databases: Understand langChain to Quickly Get Started with Production Codebases
The code
You need to install those:
pip install langchain openai chromadb==0.4.15 tiktoken unstructured
Then create a directory called data/.
Then add your files in the data folder.
Then create a file called chat.py.
Your directory looks like this
./
data/
friends.txt
bugs.txt
chat.py
You can add whatever info you want in the data folder.
You can add pdfs also.
Then paste:
import os
import sys
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.vectorstores import Chroma
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
os.environ["OPENAI_API_KEY"] = "sk-mykeyhere"
# Enable to save to disk & reuse the model (for repeated queries on the same data)
PERSIST = False
query = None
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
if PERSIST and os.path.exists("persist"):
print("Reusing index...\n")
vectorstore = Chroma(
persist_directory="persist", embedding_function=embedding_function
)
index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:
# loader = TextLoader("data/data.txt") # Use this line if you only need data.txt
loader = DirectoryLoader("exp-data/")
if PERSIST:
index = VectorstoreIndexCreator(
vectorstore_kwargs={"persist_directory": "persist"}
).from_loaders([loader])
else:
index = VectorstoreIndexCreator().from_loaders([loader])
chain = ConversationalRetrievalChain.from_llm(
llm=ChatOpenAI(model="gpt-3.5-turbo"),
retriever=index.vectorstore.as_retriever(search_kwargs={"k": 1}),
)
chat_history = []
while True:
if not query:
query = input("Prompt > ")
if query in ["quit", "q", "exit"]:
sys.exit()
result = chain({"question": query, "chat_history": chat_history})
print(result["answer"], end="\n\n")
chat_history.append((query, result["answer"]))
query = None
The persist part is if you don’t want to load it in the db each time.
Notes
If you have errors like:
Traceback (most recent call last):
File "", line 34, in <module>
index = VectorstoreIndexCreator().from_loaders([loader])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 82, in from_loaders
return self.from_documents(docs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 87, in from_documents
vectorstore = self.vectorstore_cls.from_documents(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/langchain/vectorstores/chroma.py", line 684, in from_documents
return cls.from_texts(
^^^^^^^^^^^^^^^
File "site-packages/langchain/vectorstores/chroma.py", line 620, in from_texts
chroma_collection = cls(
^^^^
File "site-packages/langchain/vectorstores/chroma.py", line 125, in __init__
self._collection = self._client.get_or_create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/chromadb/api/client.py", line 226, in get_or_create_collection
return self._server.get_or_create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "site-packages/chromadb/api/segment.py", line 216, in get_or_create_collection
return self.create_collection( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "site-packages/chromadb/telemetry/opentelemetry/__init__.py", line 127, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "site-packages/chromadb/api/segment.py", line 190, in create_collection
return Collection(
^^^^^^^^^^^
File "site-packages/chromadb/api/models/Collection.py", line 85, in __init__
validate_embedding_function(embedding_function)
File "site-packages/chromadb/api/types.py", line 210, in validate_embedding_function
raise ValueError(
ValueError: Expected EmbeddingFunction.__call__ to have the following signature:
odict_keys(['self', 'input']), got odict_keys(['self', 'args', 'kwargs'])
Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.
Please note the recent change to the EmbeddingFunction interface:
https://docs.trychroma.com/migration#migration-to-0416---november-7-2023
The solution is probably to pin an older version of chromadb as we did above.