How to integrate pgvector’s Docker image with Langchain?
December 21, 2024

How to integrate pgvector’s Docker image with Langchain?


introduce

What’s wrong everyone! This blog is a tutorial on how to integrate the docker image of pgvector with the langchain project to use it as a vector database. For this tutorial I use Google’s embedding model Embed data and Gemini-1.5-flash model generate a response. This blog will guide you through all the important documents required for this purpose.


Step 1: Set up pgvector’s docker image

Create a docker-compose.yml file to list the pgvector docker image and pass all the necessary parameters to set it up.

services:
  db:
    image: pgvector/pgvector:pg16
    restart: always
    env_file:
      - pgvector.env
    ports:
      - "5432:5432"
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  pg_data:
Enter full screen mode

Exit full screen mode

By default, pgvector is hosted on port 5432 and remains on the same port as the local computer. Can be updated as needed. Likewise, the name of the volume can be changed as needed.

Additionally, create a pgvector.env file to list all the environment variables required by the docker image.

POSTGRES_USER=pgvector_user
POSTGRES_PASSWORD=pgvector_passwd
POSTGRES_DB=pgvector_db
Enter full screen mode

Exit full screen mode

Again, you can assign any value to these variables. FYI: Once a volume is created, the database can only be accessed using these values; unless you delete the existing volume and create a new one.

This brings us to the first step, which is setting up pgvector’s docker image.


Step 2: Get the function of the Vector DB instance

Create a db.py document. This will contain a function that will declare a pgvector db instance that can be used to work with the database.

Below are the required dependencies which can be installed using pip install [dependency]

python-dotenv
langchain_google_genai
langchain_postgres
psycopg
psycopg[binary]
Enter full screen mode

Exit full screen mode

After installing the dependencies, as follows db.py document.

import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_postgres import PGVector

load_dotenv()

def get_vector_store():
    # Get Gemini API Keys from environment variables.
    gemini_api_key = os.getenv("GEMINI_API_KEY")
    # Get DB credentials from environment variables.
    postgres_collection = os.getenv("POSTGRES_COLLECTION")
    postgres_connection_string = os.getenv("POSTGRES_CONNECTION_STRING")

    # Initiate Gemini's embedding model.
    embedding_model = GoogleGenerativeAIEmbeddings(
                        model="models/embedding-001", 
                        google_api_key=gemini_api_key
                    )

    # Initiate pgvector by passing the environment variables and embeddings model.
    vector_store = PGVector(
                    embeddings=embedding_model,
                    collection_name=postgres_collection,
                    connection=postgres_connection_string,
                    use_jsonb=True,
                )

    return vector_store
Enter full screen mode

Exit full screen mode

Before interpreting this file, we need to add one more .env File to store the project’s environment variables. Therefore, create .env document.

GEMINI_API_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXX

POSTGRES_COLLECTION=pgvector_documents
POSTGRES_CONNECTION_STRING=postgresql+psycopg://pgvector_user:pgvector_passwd@localhost:5432/pgvector_db
Enter full screen mode

Exit full screen mode

This file contains your LLM’s API key. Furthermore, for the value POSTGRES_COLLECTIONyou can again give it any value. However, POSTGRES_CONNECTION_STRING Consists of listed credentials pgvector.env. Its structure is as follows: postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{URL on which DB is serving}/{POSTGRES_DB}. For the URL we use localhost:5432because we are using a docker image and have connected the local machine’s port 5432 to the container’s port 5432.

once .env File is set up, let me walk you through the logic db.pyof get_vector_store Function. First, we are talking about the value of the declared variable .env document. Second, we declare an instance of the embedding model. Here I’m using Google’s embed model, but you can use anyone. Finally, we declare the instance of the PG vector by passing the instance of the embedding model, the collection name, and the connection string. Finally, we pass back this PG vector instance. This brings us to step 2.


Step 3: Master File

In this step we will create the main file, app.py It will read the content from the file and store it in the vector database. Furthermore, when the user passes a query, it will get the relevant data chunks from the vector database and store the query along with the relevant data chunks. Passed to LLM and printed the resulting response. Below are the required imports for this file.

# app.py
import PyPDF2
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from db import get_vector_store
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv

load_dotenv()
Enter full screen mode

Exit full screen mode

There are dependencies required by this file in addition to the following required dependencies db.py.

PyPDF2 // Only if you plan the extract data from a pdf.
langchain
langchain_text_splitters
langchain_google_genai
Enter full screen mode

Exit full screen mode

This file has 4 functions, 2 of which are one of the important functions of this tutorial, namely store_data and get_relevant_chunk. So here is a detailed description of these functions, followed by a brief description of the other 2 functions.


Store data

# app.py
def store_data(data):
    # Step 1: Converting the data into Document type.
    document_data = Document(page_content=data)

    # Step 2: Splitting the data into chunks.
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

    documents = text_splitter.split_documents([document_data])

    # Step 3: Get the instance of vector db and store the data.
    vector_db = get_vector_store()

    if not vector_db: return

    vector_db.add_documents(documents)
Enter full screen mode

Exit full screen mode

Purpose
The purpose of this function is to store the received data into a vector database. Let us understand the steps involved in doing so.

logic

  • Step 1. The data it receives is located in string type. So it converts it to Document type. this Document is imported from langchain.schema.

  • Step 2. Once the data has been converted into the required type, the next step is to split the data into chunks. To split the data we use RecursiveCharacterTextSplitter from langchain_text_splitters. The separator is set to chunk_size 1000, which means each data block will contain 1000 units. also, chunk_overlap A setting of 100 means that the last 100 units of the first block will be the first 100 units of the second block. This ensures that each block has the correct data context when passed to LLM. Using this splitter, the data is first split into chunks.

  • Step 3. The next step is to get an instance of the vector database using the following command get_vector_store function db.py. at last, add_document Call the vector database method by passing the split data and store it in the vector database.


Get relevant blocks

def get_relevant_chunk(user_query):
    # Step 1: Get the instance of vector db.
    vector_db = get_vector_store()

    # Step 2: Get the relevant chunk of data.
    if not vector_db: return

    documents = vector_db.similarity_search(user_query, k=2)

    # Step 3: Convert the data from array type to string type and return it.
    relevant_chunk = " ".join([d.page_content for d in documents])

    return relevant_chunk
Enter full screen mode

Exit full screen mode

Purpose
The purpose of this function is to retrieve the relevant data block from the stored data in the vector database using the user’s query.

logic

  • Step 1. Use to get an instance of the vector database get_vector_store function db.py.

  • Step 2. call similarity_search By passing the vector db method user_query and set k. k represents the required number of blocks. If set to 2, it will return the 2 most relevant database blocks for the user’s query. Can be set according to project requirements.

  • Step 3. The relevant blocks received from the vector library are located in the array. So to put it into a string, .join The python method is user. Finally, the relevant data is returned.

Perfect. These are the two important functions of this file in this tutorial. Below are two more functions from this file; however, since this tutorial is about pgvector and vector db, I will quickly walk you through the logic of these functions.

def get_document_content(document_path):
    pdf_text = ""

    # Load the document.
    with open(document_path, "rb") as file:

        pdf_reader = PyPDF2.PdfReader(file)

    # Read and return the document content.
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_text += page.extract_text()

    return pdf_text 

def prompt_llm(user_query, relevant_data):
    # Initiate a prompt template.
    prompt_template = PromptTemplate(
        input_variables=["user_query", "relevant_data"], 
        template= """
        You are a knowledgeable assistant trained to answer questions based on specific content provided to you. Below is the content you should use to respond, followed by a user's question. Do not include information outside the given content. If the question cannot be answered based on the provided content, respond with "I am not trained to answer this."

        Content: {relevant_data}

        User's Question: {user_query}
        """
        )

    # Initiate LLM instance.
    gemini_api_key = os.getenv("GEMINI_API_KEY")

    llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=gemini_api_key)

    # Chain the template and LLM
    chain = prompt_template | llm

    # Invoke the chain by passing the input variables of prompt template.
    response = chain.invoke({
        "user_query":user_query,
        "relevant_data": relevant_data
        })

    # Return the generated response.
    return response.content
Enter full screen mode

Exit full screen mode

The first function, get_document_content Basically, get the path to the pdf, open it, read it and return its contents.

The second function, prompt_llm Accepts user queries and related data blocks. It launches a prompt template that lists the description of the LLM and takes the user’s query and related data blocks as input variables. Additionally, it starts an instance of the LLM model by passing the required parameters, connects the LLM model to the prompt template, calls the chain by passing the prompt template’s input variable values, and finally returns the generated response.

Finally, once these 4 utility functions are declared, we will declare the main function of the archive, which will call these utility functions to perform the required operations.

if __name__ == "__main__":
    # Get document content.
    document_content = get_document_content("resume.pdf")

    # Store the data in vector db.
    store_data(document_content)

    # Declare a variable having user's query.
    user_query = "Where does Dev currently works at?"

    # Get relevant chunk of data for solving the query.
    relevant_chunk = get_relevant_chunk(user_query)

    # Prompt LLM to generate the response.
    generated_response = prompt_llm(user_query, relevant_chunk)

    # Print the generated response.
    print(generated_response)
Enter full screen mode

Exit full screen mode

In this tutorial, I use the pdf of the resume as the data. First, this main function obtains the data of the PDF in the form of a string, stores the data in the vector db, declares a variable containing the user query, obtains the relevant data block by passing the user query, and obtains the relevant data block by passing the user query and related data. Generate response data block, and finally print the LLM response.

For a more detailed walkthrough of this tutorial, watch this video.



last words

This is a tutorial on how to integrate pgvector’s docker image with the langchain project to use it as a vector repository. If you have any feedback on this or have any questions please let me know. I’ll be happy to answer.

2024-12-21 15:00:00

Leave a Reply

Your email address will not be published. Required fields are marked *