LLMs Know Everything About 2024. They Know Nothing About Your Company.
GPT-4 can tell you the GDP of Moldova. It can write a sonnet about quantum entanglement. Ask it who won the 2023 Cricket World Cup, and it’ll probably nail it. But ask it what your company’s refund policy says about returns after 30 days? Blank stare. Ask it which internal document covers the onboarding process for new hires in your Pune office? Total silence.
Here’s the uncomfortable truth about large language models in mid-2025: they’re simultaneously brilliant and useless. Brilliant at general knowledge. Useless at your knowledge. Your internal wikis, product manuals, HR policies, customer support logs — none of that made it into their training data. And even if it somehow did, it’d be months or years stale by the time you’re querying the model.
So what do you do? You could fine-tune a model on your data. Expensive. Slow. Requires ML expertise most teams don’t have. And every time your documents change, you’d need to retrain. Or you could do something cleverer: give the LLM a reading list at query time. Fetch the relevant pages, hand them over, and let the model answer based on what it just read. Not what it memorized during training. What it found in your documents, right now, for this specific question.
That approach has a name. Retrieval-Augmented Generation — RAG. And honestly? It’s become the single most practical pattern for building AI apps that need private, current, or domain-specific knowledge. I’ve seen teams blow weeks fine-tuning models when a RAG pipeline would’ve solved the problem in an afternoon. Not always, but more often than the fine-tuning evangelists want to admit.
In this tutorial, we’ll build one from scratch. Documents in, intelligent answers out. LangChain handles the orchestration, OpenAI provides the embeddings and LLM, and ChromaDB stores our vectors locally. By the end, you’ll have a working system that can answer questions about your own files — PDFs, text documents, whatever you throw at it — with source citations and conversation memory.
Let’s get into it.
The Architecture: What Actually Happens Inside a RAG Pipeline
Before writing any code, you should understand what we’re building. A RAG pipeline operates in two distinct phases, and confusing them causes most beginner mistakes.
Phase 1: Ingestion (done once, or when documents change). Your raw documents — PDFs, text files, Notion exports, whatever — get loaded, chopped into overlapping chunks, converted into numerical vectors (embeddings), and stored in a vector database. Think of it as building a highly specialized search index, except instead of matching keywords, it understands meaning.
Phase 2: Query (every time a user asks something). The user’s question gets converted to an embedding using the same model. The vector database finds the chunks whose embeddings are closest to the question’s embedding. Those chunks — maybe three, maybe five — get stuffed into the LLM’s prompt as context. The LLM reads them and generates an answer grounded in your actual documents.
The beauty of this architecture? Your LLM never needs retraining. Update a document, re-run the ingestion, and the next query automatically picks up the changes. No GPU clusters. No training runs. Just updated embeddings in your vector store.
Setting Up: Dependencies and API Keys
We need LangChain for the pipeline glue, OpenAI for embeddings and the LLM, ChromaDB as our vector database, and a few utilities for document parsing. Install everything in one shot:
pip install langchain langchain-openai langchain-community \
chromadb tiktoken pypdf python-dotenv
Quick note on cost. OpenAI’s text-embedding-3-small model costs roughly $0.02 per million tokens as of early 2025. Embedding a 100-page PDF probably runs you less than a cup of chai. The gpt-4o calls during query time are pricier, but for internal tools, you’re still looking at pennies per question unless you’re hammering it with thousands of queries daily.
Create a .env file in your project root with your OpenAI API key, then load it:
import os
from dotenv import load_dotenv
load_dotenv()
# Verify API key is available
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in your .env file"
If you’re working at a company with data sensitivity concerns — and you probably should be — consider using Azure OpenAI’s deployment instead. Same models, but the data stays within your Azure tenant. LangChain supports both with a simple config swap.
Loading Documents and Splitting Them Into Chunks
Here’s where most RAG tutorials gloss over the hard part: chunk strategy. Sure, loading files is mechanical. LangChain has loaders for PDFs, text, CSVs, Notion exports, HTML, Markdown — dozens of formats. The interesting decision is how you break those documents apart.
Why split at all? Because embedding models and LLM context windows have limits. You can’t embed an entire 200-page HR manual as one vector. Even if you could, the retrieval would be useless — searching across entire documents gives you vague matches instead of precise ones. You need granular chunks: small enough that each one covers a single concept, large enough that it isn’t a meaningless sentence fragment.
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_documents(source_dir: str) -> list:
"""Load documents from a directory supporting multiple formats."""
# Load PDFs
pdf_loader = DirectoryLoader(
source_dir,
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
pdf_docs = pdf_loader.load()
# Load text files
txt_loader = DirectoryLoader(
source_dir,
glob="**/*.txt",
loader_cls=TextLoader
)
txt_docs = txt_loader.load()
all_docs = pdf_docs + txt_docs
print(f"Loaded {len(all_docs)} document pages/sections")
return all_docs
def split_documents(documents: list) -> list:
"""Split documents into chunks optimized for retrieval."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # target characters per chunk
chunk_overlap=200, # overlap between adjacent chunks
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
return chunks
# Load and split
docs = load_documents("./documents")
chunks = split_documents(docs)
# Inspect a chunk
print(f"\nSample chunk:\n{chunks[0].page_content[:200]}...")
print(f"Metadata: {chunks[0].metadata}")
Pay attention to that separators list. The RecursiveCharacterTextSplitter tries each separator in order — double newlines first, then single newlines, then periods, then spaces, and finally individual characters as a last resort. So paragraphs stay intact when possible. Sentences don’t get sliced mid-thought unless absolutely necessary.
And that 200-character overlap? It’s not waste — it’s insurance. Suppose a critical sentence about your return policy straddles two chunks. Without overlap, one chunk gets the setup, the other gets the punchline, and neither retrieves properly. With overlap, both chunks contain the full sentence. Retrieval catches it no matter which chunk scores higher.
Embeddings and the Vector Store: Where the Magic Lives
Alright, we have our chunks. Now we need to turn each chunk from a blob of text into a numerical vector — a list of 1536 floating-point numbers (for text-embedding-3-small) that captures the semantic meaning of that text. Similar ideas produce similar vectors. “How do I return a product?” and “What’s your refund policy?” would sit close together in vector space, even though they share almost no words.
ChromaDB handles storage and similarity search locally. No cloud service needed, no API costs for retrieval — it’s all on your machine. Perfect for development, prototyping, and honestly a lot of production use cases too. You’d only need to upgrade to Pinecone, Weaviate, or pgvector when you’re dealing with millions of documents or need distributed infrastructure.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def create_vector_store(chunks: list, persist_dir: str = "./chroma_db"):
"""Create a persistent vector store from document chunks."""
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small" # cost-effective, high quality
)
# Create and persist the vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_dir,
collection_name="rag_documents"
)
print(f"Vector store created with {vectorstore._collection.count()} vectors")
print(f"Persisted to: {persist_dir}")
return vectorstore
def load_vector_store(persist_dir: str = "./chroma_db"):
"""Load an existing vector store from disk."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
persist_directory=persist_dir,
embedding_function=embeddings,
collection_name="rag_documents"
)
print(f"Loaded vector store with {vectorstore._collection.count()} vectors")
return vectorstore
# Create the vector store (run once)
vectorstore = create_vector_store(chunks)
# Test similarity search
results = vectorstore.similarity_search("What is the return policy?", k=3)
for i, doc in enumerate(results):
print(f"\nResult {i+1} (source: {doc.metadata.get('source', 'unknown')}):")
print(f" {doc.page_content[:150]}...")
Run that similarity_search test and actually look at the results. Seriously. Before building anything else, confirm that the right chunks are coming back. If your retrieval is garbage, no amount of prompt engineering or model sophistication will save you. Retrieval quality is the bottleneck in every RAG system I’ve worked with — not the LLM, not the prompt, not the chain framework. The retrieval.
The k=3 parameter controls how many chunks come back. Three is conservative. Four or five gives the LLM more context to work with, but increases token usage and can sometimes introduce noise — irrelevant chunks that confuse the model. Start with 3 or 4, test with your actual questions, and adjust.
The Retrieval Chain: Connecting Search to Generation
Now the interesting part. We’ve got a vector database full of our document chunks. We need to wire it to an LLM so that when a user asks a question, the system automatically retrieves relevant context and generates a grounded answer.
LangChain’s RetrievalQA chain handles this orchestration. But the piece you should care about most isn’t the chain — it’s the prompt template. Without explicit instructions telling the model to stick to the provided context, it’ll happily mix its training data with your documents. You’ll get answers that sound confident but contain hallucinated details your documents never mentioned. Dangerous, especially for anything customer-facing or compliance-related.
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
def build_rag_chain(vectorstore):
"""Build a RAG chain with custom prompting."""
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.0 # deterministic for factual Q&A
)
# Custom prompt that instructs the model to use only retrieved context
prompt_template = """Use the following context to answer the question.
If the context does not contain enough information to answer,
say "I don't have enough information to answer that question."
Do not make up information that is not in the context.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Build the chain
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # retrieve top 4 chunks
)
chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # concatenate all retrieved docs into prompt
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
return chain
# Build and test the chain
rag_chain = build_rag_chain(vectorstore)
def ask(question: str) -> str:
"""Ask a question and display the answer with sources."""
result = rag_chain.invoke({"query": question})
print(f"Question: {question}")
print(f"Answer: {result['result']}")
print(f"\nSources ({len(result['source_documents'])}):")
for doc in result["source_documents"]:
source = doc.metadata.get("source", "unknown")
page = doc.metadata.get("page", "n/a")
print(f" - {source} (page {page})")
return result["result"]
ask("What are the main features of the product?")
ask("How do I request a refund?")
A few things worth unpacking here.
temperature=0.0 isn’t just a good default for factual Q&A — it’s borderline mandatory. Cranking temperature up introduces randomness, which is great for creative writing, terrible for “tell me what page 47 of the employee handbook says about vacation carry-over.” You want determinism. Same question, same retrieved context, same answer every time.
chain_type="stuff" means all retrieved chunks get concatenated and dumped into the prompt in one go. Simple, effective, and works beautifully as long as your total context (question + retrieved chunks + prompt template) fits within the model’s context window. For GPT-4o’s 128k window, you’d need to retrieve an absurd number of chunks before hitting limits. Probably not something to worry about unless your chunks are unusually large.
Notice the “I don’t have enough information” fallback in the prompt. Without it, models get creative. They’ll extrapolate, infer, or outright invent answers. Better to have the system say “I don’t know” than to confidently present something your documents never said. Your users will trust the system more when it’s honest about its limits.
Conversation Memory: Making It Actually Feel Like a Chat
A single-turn Q&A system works for simple lookups. But real users don’t ask isolated questions. They ask follow-ups. “What products do you offer?” then “Tell me more about the first one.” then “What’s the pricing for that?” Without memory, the second and third questions are meaningless — the model doesn’t know what “the first one” or “that” refers to.
LangChain’s ConversationalRetrievalChain handles this by maintaining a sliding window of recent exchanges. When a follow-up comes in, the chain uses the conversation history to reformulate the question into a standalone query before hitting the vector store. “Tell me more about the first one you mentioned” becomes something like “Tell me more about Product X” — a self-contained question that actually retrieves useful chunks.
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
def build_conversational_rag(vectorstore):
"""Build a RAG chain with conversation memory."""
llm = ChatOpenAI(model="gpt-4o", temperature=0.0)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5 # remember last 5 exchanges
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=False
)
return chain
# Interactive conversation loop
conv_chain = build_conversational_rag(vectorstore)
def chat(question: str) -> str:
"""Send a conversational query."""
result = conv_chain.invoke({"question": question})
print(f"User: {question}")
print(f"Assistant: {result['answer']}\n")
return result["answer"]
# Multi-turn conversation example
chat("What products do you offer?")
chat("Tell me more about the first one you mentioned.")
chat("What is the pricing for that?")
That k=5 in the memory config deserves a comment. Keeping the last five exchanges strikes a decent balance between context awareness and token efficiency. Bump it higher and you’re spending more tokens on history in every query. Drop it to 2 and the model forgets the beginning of a longer conversation. Five works for most internal tool use cases. For a customer-facing chatbot where conversations run longer, you might want 8 or 10, but watch your costs.
Something subtle happens behind the scenes here. The chain doesn’t just append chat history to the retrieval query. It uses the LLM itself to rephrase the follow-up question into a standalone form, incorporating context from previous turns. So the retrieval step always gets a clean, self-contained query. Smart. Also means you’re paying for an extra LLM call per question, which is the trade-off. Worth it for usability, in my experience.
RAG vs. Fine-Tuning: When to Use Which
I’ve got strong opinions on this, and I know not everyone agrees. But here’s how I see it after building both kinds of systems over the past year and a half.
Use RAG when:
- Your knowledge base changes. Weekly, daily, hourly — doesn’t matter. RAG handles updates by re-embedding documents. No retraining.
- You need source attribution. RAG naturally returns which documents it pulled from. Fine-tuned models give you answers with zero traceability.
- Your data is sensitive and you don’t want it baked into model weights. RAG keeps your data in your vector store, under your control.
- You need it working this week, not this quarter. A RAG pipeline can go from zero to deployed in a day or two.
Use fine-tuning when:
- You need the model to adopt a specific tone, style, or persona that prompting alone can’t achieve.
- Your task requires specialized reasoning patterns the base model struggles with — think domain-specific classification or structured extraction.
- Latency matters more than freshness. Fine-tuned models don’t need retrieval, so they’re faster per query.
Use both when: You’ve got a domain-specific style AND a large, changing knowledge base. Fine-tune for the style, RAG for the knowledge. Rare in practice, but powerful when it fits.
Improving Your Pipeline: What to Try Next
What we built works. It’s a solid starting point. But production RAG systems need more polish. Here’s where to invest your time once the basics are running.
Metadata filtering. Right now, similarity search considers all chunks equally. In reality, you might want to restrict search to specific document categories. “Only search the HR policy docs.” “Only look at documents from 2024 onward.” ChromaDB supports metadata filtering natively — tag your chunks during ingestion and filter at query time.
Hybrid search. Pure semantic search misses when the user asks about a specific product name, error code, or employee ID — things where exact keyword matching beats semantic similarity. Combining BM25 (keyword search) with vector search gives you the best of both approaches. LangChain’s EnsembleRetriever handles this.
Reranking. Your initial retrieval pulls back, say, 10 chunks. A reranker (like Cohere’s or a cross-encoder model) then rescores those chunks against the original question and returns the top 4. More compute per query, but meaningfully better relevance. Worth it when answer quality matters more than latency.
Evaluation. You can’t improve what you can’t measure. Build a test set of 50-100 question-answer pairs grounded in your documents. Run your pipeline against them. Track hit rate (did the correct chunk get retrieved?), answer correctness, and faithfulness (did the answer stay within the provided context?). Frameworks like RAGAS make this structured evaluation easier.
Back to Your Company’s Documents
Remember where we started? LLMs know everything about 2024. They know nothing about your company.
But now you’ve got the pieces to change that. Those 47 PDFs your support team keeps emailing around? They’re chunks waiting to be embedded. The product docs living in a shared Google Drive folder that nobody reads because they’re 200 pages long? They’re a vector store that can answer questions in seconds. The onboarding guide your HR team wrote last quarter — the one new hires are supposed to read but never do? It’s a conversational RAG bot that actually gets used.
I built something almost identical to what we walked through today for an internal project last year. Took a Friday afternoon to get the first version running. By Monday, three teams were using it. Not because it was perfect — the chunk sizes were probably wrong, the prompt could’ve been tighter, and we hadn’t added any reranking yet. But it answered questions about our internal docs, with sources, and it didn’t hallucinate. That alone made it more useful than asking the LLM directly.
Your company’s knowledge doesn’t need to be locked in documents nobody reads. It doesn’t need a fine-tuned model to be accessible. It needs a pipeline — load, split, embed, retrieve, generate. That’s RAG. And now you know how to build one.
The LLM still knows nothing about your company. But with a vector store between you and it? Doesn’t need to.