introduce
What’s wrong everyone! This blog is a tutorial on how to integrate the docker image of pgvector with the langchain project to use it as a vector database. For this tutorial I use Google’s embedding model Embed data and Gemini-1.5-flash model generate a response. This blog will guide you through all the important documents required for this purpose.
Step 1: Set up pgvector’s docker image
Create a docker-compose.yml
file to list the pgvector docker image and pass all the necessary parameters to set it up.
services:
db:
image: pgvector/pgvector:pg16
restart: always
env_file:
- pgvector.env
ports:
- "5432:5432"
volumes:
- pg_data:/var/lib/postgresql/data
volumes:
pg_data:
By default, pgvector is hosted on port 5432 and remains on the same port as the local computer. Can be updated as needed. Likewise, the name of the volume can be changed as needed.
Additionally, create a pgvector.env file to list all the environment variables required by the docker image.
POSTGRES_USER=pgvector_user
POSTGRES_PASSWORD=pgvector_passwd
POSTGRES_DB=pgvector_db
Again, you can assign any value to these variables. FYI: Once a volume is created, the database can only be accessed using these values; unless you delete the existing volume and create a new one.
This brings us to the first step, which is setting up pgvector’s docker image.
Step 2: Get the function of the Vector DB instance
Create a db.py
document. This will contain a function that will declare a pgvector db instance that can be used to work with the database.
Below are the required dependencies which can be installed using pip install [dependency]
python-dotenv
langchain_google_genai
langchain_postgres
psycopg
psycopg[binary]
After installing the dependencies, as follows db.py
document.
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_postgres import PGVector
load_dotenv()
def get_vector_store():
# Get Gemini API Keys from environment variables.
gemini_api_key = os.getenv("GEMINI_API_KEY")
# Get DB credentials from environment variables.
postgres_collection = os.getenv("POSTGRES_COLLECTION")
postgres_connection_string = os.getenv("POSTGRES_CONNECTION_STRING")
# Initiate Gemini's embedding model.
embedding_model = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=gemini_api_key
)
# Initiate pgvector by passing the environment variables and embeddings model.
vector_store = PGVector(
embeddings=embedding_model,
collection_name=postgres_collection,
connection=postgres_connection_string,
use_jsonb=True,
)
return vector_store
Before interpreting this file, we need to add one more .env
File to store the project’s environment variables. Therefore, create .env
document.
GEMINI_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXX
POSTGRES_COLLECTION=pgvector_documents
POSTGRES_CONNECTION_STRING=postgresql+psycopg://pgvector_user:pgvector_passwd@localhost:5432/pgvector_db
This file contains your LLM’s API key. Furthermore, for the value POSTGRES_COLLECTION
you can again give it any value. However, POSTGRES_CONNECTION_STRING
Consists of listed credentials pgvector.env
. Its structure is as follows: postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{URL on which DB is serving}/{POSTGRES_DB}
. For the URL we use localhost:5432
because we are using a docker image and have connected the local machine’s port 5432 to the container’s port 5432.
once .env
File is set up, let me walk you through the logic db.py
of get_vector_store
Function. First, we are talking about the value of the declared variable .env
document. Second, we declare an instance of the embedding model. Here I’m using Google’s embed model, but you can use anyone. Finally, we declare the instance of the PG vector by passing the instance of the embedding model, the collection name, and the connection string. Finally, we pass back this PG vector instance. This brings us to step 2.
Step 3: Master File
In this step we will create the main file, app.py
It will read the content from the file and store it in the vector database. Furthermore, when the user passes a query, it will get the relevant data chunks from the vector database and store the query along with the relevant data chunks. Passed to LLM and printed the resulting response. Below are the required imports for this file.
# app.py
import PyPDF2
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from db import get_vector_store
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv
load_dotenv()
There are dependencies required by this file in addition to the following required dependencies db.py
.
PyPDF2 // Only if you plan the extract data from a pdf.
langchain
langchain_text_splitters
langchain_google_genai
This file has 4 functions, 2 of which are one of the important functions of this tutorial, namely store_data
and get_relevant_chunk
. So here is a detailed description of these functions, followed by a brief description of the other 2 functions.
Store data
# app.py
def store_data(data):
# Step 1: Converting the data into Document type.
document_data = Document(page_content=data)
# Step 2: Splitting the data into chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
documents = text_splitter.split_documents([document_data])
# Step 3: Get the instance of vector db and store the data.
vector_db = get_vector_store()
if not vector_db: return
vector_db.add_documents(documents)
Purpose
The purpose of this function is to store the received data into a vector database. Let us understand the steps involved in doing so.
logic
-
Step 1. The data it receives is located in
string
type. So it converts it toDocument
type. thisDocument
is imported fromlangchain.schema
. -
Step 2. Once the data has been converted into the required type, the next step is to split the data into chunks. To split the data we use
RecursiveCharacterTextSplitter
fromlangchain_text_splitters
. The separator is set tochunk_size
1000, which means each data block will contain 1000 units. also,chunk_overlap
A setting of 100 means that the last 100 units of the first block will be the first 100 units of the second block. This ensures that each block has the correct data context when passed to LLM. Using this splitter, the data is first split into chunks. -
Step 3. The next step is to get an instance of the vector database using the following command
get_vector_store
functiondb.py
. at last,add_document
Call the vector database method by passing the split data and store it in the vector database.
Get relevant blocks
def get_relevant_chunk(user_query):
# Step 1: Get the instance of vector db.
vector_db = get_vector_store()
# Step 2: Get the relevant chunk of data.
if not vector_db: return
documents = vector_db.similarity_search(user_query, k=2)
# Step 3: Convert the data from array type to string type and return it.
relevant_chunk = " ".join([d.page_content for d in documents])
return relevant_chunk
Purpose
The purpose of this function is to retrieve the relevant data block from the stored data in the vector database using the user’s query.
logic
-
Step 1. Use to get an instance of the vector database
get_vector_store
functiondb.py
. -
Step 2. call
similarity_search
By passing the vector db methoduser_query
and set k. k represents the required number of blocks. If set to 2, it will return the 2 most relevant database blocks for the user’s query. Can be set according to project requirements. -
Step 3. The relevant blocks received from the vector library are located in the array. So to put it into a string,
.join
The python method is user. Finally, the relevant data is returned.
Perfect. These are the two important functions of this file in this tutorial. Below are two more functions from this file; however, since this tutorial is about pgvector and vector db, I will quickly walk you through the logic of these functions.
def get_document_content(document_path):
pdf_text = ""
# Load the document.
with open(document_path, "rb") as file:
pdf_reader = PyPDF2.PdfReader(file)
# Read and return the document content.
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
pdf_text += page.extract_text()
return pdf_text
def prompt_llm(user_query, relevant_data):
# Initiate a prompt template.
prompt_template = PromptTemplate(
input_variables=["user_query", "relevant_data"],
template= """
You are a knowledgeable assistant trained to answer questions based on specific content provided to you. Below is the content you should use to respond, followed by a user's question. Do not include information outside the given content. If the question cannot be answered based on the provided content, respond with "I am not trained to answer this."
Content: {relevant_data}
User's Question: {user_query}
"""
)
# Initiate LLM instance.
gemini_api_key = os.getenv("GEMINI_API_KEY")
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=gemini_api_key)
# Chain the template and LLM
chain = prompt_template | llm
# Invoke the chain by passing the input variables of prompt template.
response = chain.invoke({
"user_query":user_query,
"relevant_data": relevant_data
})
# Return the generated response.
return response.content
The first function, get_document_content
Basically, get the path to the pdf, open it, read it and return its contents.
The second function, prompt_llm
Accepts user queries and related data blocks. It launches a prompt template that lists the description of the LLM and takes the user’s query and related data blocks as input variables. Additionally, it starts an instance of the LLM model by passing the required parameters, connects the LLM model to the prompt template, calls the chain by passing the prompt template’s input variable values, and finally returns the generated response.
Finally, once these 4 utility functions are declared, we will declare the main function of the archive, which will call these utility functions to perform the required operations.
if __name__ == "__main__":
# Get document content.
document_content = get_document_content("resume.pdf")
# Store the data in vector db.
store_data(document_content)
# Declare a variable having user's query.
user_query = "Where does Dev currently works at?"
# Get relevant chunk of data for solving the query.
relevant_chunk = get_relevant_chunk(user_query)
# Prompt LLM to generate the response.
generated_response = prompt_llm(user_query, relevant_chunk)
# Print the generated response.
print(generated_response)
In this tutorial, I use the pdf of the resume as the data. First, this main function obtains the data of the PDF in the form of a string, stores the data in the vector db, declares a variable containing the user query, obtains the relevant data block by passing the user query, and obtains the relevant data block by passing the user query and related data. Generate response data block, and finally print the LLM response.
For a more detailed walkthrough of this tutorial, watch this video.
last words
This is a tutorial on how to integrate pgvector’s docker image with the langchain project to use it as a vector repository. If you have any feedback on this or have any questions please let me know. I’ll be happy to answer.