Large language models are powerful, but they have a fundamental limitation: their knowledge is frozen at training time, and they can hallucinate facts. Retrieval-Augmented Generation (RAG) solves both problems by giving the model access to your own documents at query time. Instead of relying solely on memorized knowledge, RAG retrieves relevant context from your data and feeds it to the model along with the question. In this tutorial, we will build a complete RAG application using LangChain, covering document loading, text splitting, embeddings, vector storage, and the retrieval chain that ties everything together.
Understanding the RAG Architecture
A RAG pipeline has two main phases. The ingestion phase processes your documents into searchable chunks: load documents, split them into manageable pieces, generate vector embeddings for each chunk, and store those embeddings in a vector database. The query phase takes a user question, converts it to an embedding, finds the most similar document chunks, and passes those chunks as context to the LLM to generate an answer grounded in your data.
Install the required packages to get started:
pip install langchain langchain-openai langchain-community \
chromadb tiktoken pypdf python-dotenv
import os
from dotenv import load_dotenv
load_dotenv()
# Verify API key is available
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in your .env file"
Document Loading and Text Splitting
LangChain provides loaders for dozens of document formats. The key challenge after loading is splitting documents into chunks that are small enough for accurate retrieval but large enough to preserve context:
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_documents(source_dir: str) -> list:
"""Load documents from a directory supporting multiple formats."""
# Load PDFs
pdf_loader = DirectoryLoader(
source_dir,
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
pdf_docs = pdf_loader.load()
# Load text files
txt_loader = DirectoryLoader(
source_dir,
glob="**/*.txt",
loader_cls=TextLoader
)
txt_docs = txt_loader.load()
all_docs = pdf_docs + txt_docs
print(f"Loaded {len(all_docs)} document pages/sections")
return all_docs
def split_documents(documents: list) -> list:
"""Split documents into chunks optimized for retrieval."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # target characters per chunk
chunk_overlap=200, # overlap between adjacent chunks
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
return chunks
# Load and split
docs = load_documents("./documents")
chunks = split_documents(docs)
# Inspect a chunk
print(f"\nSample chunk:\n{chunks[0].page_content[:200]}...")
print(f"Metadata: {chunks[0].metadata}")
The RecursiveCharacterTextSplitter tries to split on the most natural boundaries first (double newlines, then single newlines, then sentences). The 200-character overlap ensures that context is not lost at chunk boundaries. If a sentence spans two chunks, both chunks contain it.
Creating Embeddings and the Vector Store
Embeddings convert text into numerical vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search. We use OpenAI embeddings and ChromaDB as our vector store:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def create_vector_store(chunks: list, persist_dir: str = "./chroma_db"):
"""Create a persistent vector store from document chunks."""
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small" # cost-effective, high quality
)
# Create and persist the vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_dir,
collection_name="rag_documents"
)
print(f"Vector store created with {vectorstore._collection.count()} vectors")
print(f"Persisted to: {persist_dir}")
return vectorstore
def load_vector_store(persist_dir: str = "./chroma_db"):
"""Load an existing vector store from disk."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=embeddings,
collection_name="rag_documents"
)
print(f"Loaded vector store with {vectorstore._collection.count()} vectors")
return vectorstore
# Create the vector store (run once)
vectorstore = create_vector_store(chunks)
# Test similarity search
results = vectorstore.similarity_search("What is the return policy?", k=3)
for i, doc in enumerate(results):
print(f"\nResult {i+1} (source: {doc.metadata.get('source', 'unknown')}):")
print(f" {doc.page_content[:150]}...")
ChromaDB stores vectors locally on disk, which is perfect for development and small-to-medium applications. For production with millions of documents, consider Pinecone, Weaviate, or pgvector. The k=3 parameter in similarity search returns the three most relevant chunks.
Building the Retrieval Chain
Now we connect the vector store to an LLM through a retrieval chain. When a user asks a question, the chain retrieves relevant context and generates an answer grounded in that context:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
def build_rag_chain(vectorstore):
"""Build a RAG chain with custom prompting."""
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.0 # deterministic for factual Q&A
)
# Custom prompt that instructs the model to use only retrieved context
prompt_template = """Use the following context to answer the question.
If the context does not contain enough information to answer,
say "I don't have enough information to answer that question."
Do not make up information that is not in the context.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Build the chain
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # retrieve top 4 chunks
)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # concatenate all retrieved docs into prompt
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
return chain
# Build and test the chain
rag_chain = build_rag_chain(vectorstore)
def ask(question: str) -> str:
"""Ask a question and display the answer with sources."""
result = rag_chain.invoke({"query": question})
print(f"Question: {question}")
print(f"Answer: {result['result']}")
print(f"\nSources ({len(result['source_documents'])}):")
for doc in result["source_documents"]:
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "n/a")
print(f" - {source} (page {page})")
return result["result"]
ask("What are the main features of the product?")
ask("How do I request a refund?")
The custom prompt is critical. Without explicit instructions to only use the provided context, the model may blend its training knowledge with retrieved documents, which defeats the purpose of RAG. Setting temperature=0.0 further reduces hallucination for factual question-answering.
Adding Conversation Memory
For a chatbot experience, add conversation memory so users can ask follow-up questions that reference earlier parts of the conversation:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
def build_conversational_rag(vectorstore):
"""Build a RAG chain with conversation memory."""
llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5 # remember last 5 exchanges
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False
)
return chain
# Interactive conversation loop
conv_chain = build_conversational_rag(vectorstore)
def chat(question: str) -> str:
"""Send a conversational query."""
result = conv_chain.invoke({"question": question})
print(f"User: {question}")
print(f"Assistant: {result['answer']}\n")
return result["answer"]
# Multi-turn conversation example
chat("What products do you offer?")
chat("Tell me more about the first one you mentioned.")
chat("What is the pricing for that?")
The ConversationBufferWindowMemory keeps the last five exchanges in memory. The chain automatically reformulates follow-up questions using conversation history, so “Tell me more about the first one” gets expanded into a self-contained query before retrieval.
Conclusion
We have built a complete RAG application from scratch: loading documents in multiple formats, splitting them into retrieval-optimized chunks, generating embeddings, storing them in a vector database, building a retrieval chain with a grounded prompt, and adding conversation memory for multi-turn interactions. RAG is the most practical pattern for building AI applications that need access to private, up-to-date, or domain-specific information. From here, you can improve the pipeline by experimenting with different chunk sizes, adding metadata filtering to narrow search scope, implementing hybrid search combining keyword and semantic retrieval, or using reranking models to improve the relevance of retrieved documents.