Building a RAG Application with LangChain

LLMs Know Everything About 2024. They Know Nothing About Your Company.

GPT-4 can tell you the GDP of Moldova. It can write a sonnet about quantum entanglement. Ask it who won the 2023 Cricket World Cup, and it’ll probably nail it. But ask it what your company’s refund policy says about returns after 30 days? Blank stare. Ask it which internal document covers the onboarding process for new hires in your Pune office? Total silence.

Here’s the uncomfortable truth about large language models in mid-2025: they’re simultaneously brilliant and useless. Brilliant at general knowledge. Useless at your knowledge. Your internal wikis, product manuals, HR policies, customer support logs — none of that made it into their training data. And even if it somehow did, it’d be months or years stale by the time you’re querying the model.

So what do you do? You could fine-tune a model on your data. Expensive. Slow. Requires ML expertise most teams don’t have. And every time your documents change, you’d need to retrain. Or you could do something cleverer: give the LLM a reading list at query time. Fetch the relevant pages, hand them over, and let the model answer based on what it just read. Not what it memorized during training. What it found in your documents, right now, for this specific question.

That approach has a name. Retrieval-Augmented Generation — RAG. And honestly? It’s become the single most practical pattern for building AI apps that need private, current, or domain-specific knowledge. I’ve seen teams blow weeks fine-tuning models when a RAG pipeline would’ve solved the problem in an afternoon. Not always, but more often than the fine-tuning evangelists want to admit.

In this tutorial, we’ll build one from scratch. Documents in, intelligent answers out. LangChain handles the orchestration, OpenAI provides the embeddings and LLM, and ChromaDB stores our vectors locally. By the end, you’ll have a working system that can answer questions about your own files — PDFs, text documents, whatever you throw at it — with source citations and conversation memory.

Let’s get into it.

The Architecture: What Actually Happens Inside a RAG Pipeline

Before writing any code, you should understand what we’re building. A RAG pipeline operates in two distinct phases, and confusing them causes most beginner mistakes.

Phase 1: Ingestion (done once, or when documents change). Your raw documents — PDFs, text files, Notion exports, whatever — get loaded, chopped into overlapping chunks, converted into numerical vectors (embeddings), and stored in a vector database. Think of it as building a highly specialized search index, except instead of matching keywords, it understands meaning.

Phase 2: Query (every time a user asks something). The user’s question gets converted to an embedding using the same model. The vector database finds the chunks whose embeddings are closest to the question’s embedding. Those chunks — maybe three, maybe five — get stuffed into the LLM’s prompt as context. The LLM reads them and generates an answer grounded in your actual documents.

Opinionated take: RAG is overkill for static FAQ bots. If your knowledge base is under 50 questions, a simple lookup table with fuzzy matching will be faster, cheaper, and more reliable. RAG shines when your corpus is large, changes frequently, or when users ask unpredictable questions that don’t map neatly to pre-written answers. Don’t reach for RAG because it sounds cool. Reach for it because simpler approaches failed.

The beauty of this architecture? Your LLM never needs retraining. Update a document, re-run the ingestion, and the next query automatically picks up the changes. No GPU clusters. No training runs. Just updated embeddings in your vector store.

Setting Up: Dependencies and API Keys

We need LangChain for the pipeline glue, OpenAI for embeddings and the LLM, ChromaDB as our vector database, and a few utilities for document parsing. Install everything in one shot:

pip install langchain langchain-openai langchain-community \
    chromadb tiktoken pypdf python-dotenv

Quick note on cost. OpenAI’s text-embedding-3-small model costs roughly $0.02 per million tokens as of early 2025. Embedding a 100-page PDF probably runs you less than a cup of chai. The gpt-4o calls during query time are pricier, but for internal tools, you’re still looking at pennies per question unless you’re hammering it with thousands of queries daily.

Create a .env file in your project root with your OpenAI API key, then load it:

import os
from dotenv import load_dotenv

load_dotenv()

# Verify API key is available
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in your .env file"

If you’re working at a company with data sensitivity concerns — and you probably should be — consider using Azure OpenAI’s deployment instead. Same models, but the data stays within your Azure tenant. LangChain supports both with a simple config swap.

Loading Documents and Splitting Them Into Chunks

Here’s where most RAG tutorials gloss over the hard part: chunk strategy. Sure, loading files is mechanical. LangChain has loaders for PDFs, text, CSVs, Notion exports, HTML, Markdown — dozens of formats. The interesting decision is how you break those documents apart.

Why split at all? Because embedding models and LLM context windows have limits. You can’t embed an entire 200-page HR manual as one vector. Even if you could, the retrieval would be useless — searching across entire documents gives you vague matches instead of precise ones. You need granular chunks: small enough that each one covers a single concept, large enough that it isn’t a meaningless sentence fragment.

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    DirectoryLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_documents(source_dir: str) -> list:
    """Load documents from a directory supporting multiple formats."""
    # Load PDFs
    pdf_loader = DirectoryLoader(
        source_dir,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader
    )
    pdf_docs = pdf_loader.load()

    # Load text files
    txt_loader = DirectoryLoader(
        source_dir,
        glob="**/*.txt",
        loader_cls=TextLoader
    )
    txt_docs = txt_loader.load()

    all_docs = pdf_docs + txt_docs
    print(f"Loaded {len(all_docs)} document pages/sections")
    return all_docs


def split_documents(documents: list) -> list:
    """Split documents into chunks optimized for retrieval."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,         # target characters per chunk
        chunk_overlap=200,       # overlap between adjacent chunks
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )

    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
    return chunks


# Load and split
docs = load_documents("./documents")
chunks = split_documents(docs)

# Inspect a chunk
print(f"\nSample chunk:\n{chunks[0].page_content[:200]}...")
print(f"Metadata: {chunks[0].metadata}")

Pay attention to that separators list. The RecursiveCharacterTextSplitter tries each separator in order — double newlines first, then single newlines, then periods, then spaces, and finally individual characters as a last resort. So paragraphs stay intact when possible. Sentences don’t get sliced mid-thought unless absolutely necessary.

And that 200-character overlap? It’s not waste — it’s insurance. Suppose a critical sentence about your return policy straddles two chunks. Without overlap, one chunk gets the setup, the other gets the punchline, and neither retrieves properly. With overlap, both chunks contain the full sentence. Retrieval catches it no matter which chunk scores higher.

Chunk size matters more than you’d think. I’ve seen teams default to 1000 characters and never question it. For dense legal documents, 500 works better — smaller chunks mean more precise retrieval. For conversational support docs, 1500 might be fine because the language is looser. Experiment. Run the same 20 questions against different chunk sizes and compare answer quality. There’s no universal right answer here.

Embeddings and the Vector Store: Where the Magic Lives

Alright, we have our chunks. Now we need to turn each chunk from a blob of text into a numerical vector — a list of 1536 floating-point numbers (for text-embedding-3-small) that captures the semantic meaning of that text. Similar ideas produce similar vectors. “How do I return a product?” and “What’s your refund policy?” would sit close together in vector space, even though they share almost no words.

ChromaDB handles storage and similarity search locally. No cloud service needed, no API costs for retrieval — it’s all on your machine. Perfect for development, prototyping, and honestly a lot of production use cases too. You’d only need to upgrade to Pinecone, Weaviate, or pgvector when you’re dealing with millions of documents or need distributed infrastructure.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

def create_vector_store(chunks: list, persist_dir: str = "./chroma_db"):
    """Create a persistent vector store from document chunks."""
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small"  # cost-effective, high quality
    )

    # Create and persist the vector store
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_dir,
        collection_name="rag_documents"
    )

    print(f"Vector store created with {vectorstore._collection.count()} vectors")
    print(f"Persisted to: {persist_dir}")

    return vectorstore


def load_vector_store(persist_dir: str = "./chroma_db"):
    """Load an existing vector store from disk."""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    vectorstore = Chroma(
        persist_directory=persist_dir,
        embedding_function=embeddings,
        collection_name="rag_documents"
    )

    print(f"Loaded vector store with {vectorstore._collection.count()} vectors")
    return vectorstore


# Create the vector store (run once)
vectorstore = create_vector_store(chunks)

# Test similarity search
results = vectorstore.similarity_search("What is the return policy?", k=3)
for i, doc in enumerate(results):
    print(f"\nResult {i+1} (source: {doc.metadata.get('source', 'unknown')}):")
    print(f"  {doc.page_content[:150]}...")

Run that similarity_search test and actually look at the results. Seriously. Before building anything else, confirm that the right chunks are coming back. If your retrieval is garbage, no amount of prompt engineering or model sophistication will save you. Retrieval quality is the bottleneck in every RAG system I’ve worked with — not the LLM, not the prompt, not the chain framework. The retrieval.

The k=3 parameter controls how many chunks come back. Three is conservative. Four or five gives the LLM more context to work with, but increases token usage and can sometimes introduce noise — irrelevant chunks that confuse the model. Start with 3 or 4, test with your actual questions, and adjust.

The Retrieval Chain: Connecting Search to Generation

Now the interesting part. We’ve got a vector database full of our document chunks. We need to wire it to an LLM so that when a user asks a question, the system automatically retrieves relevant context and generates a grounded answer.

LangChain’s RetrievalQA chain handles this orchestration. But the piece you should care about most isn’t the chain — it’s the prompt template. Without explicit instructions telling the model to stick to the provided context, it’ll happily mix its training data with your documents. You’ll get answers that sound confident but contain hallucinated details your documents never mentioned. Dangerous, especially for anything customer-facing or compliance-related.

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def build_rag_chain(vectorstore):
    """Build a RAG chain with custom prompting."""
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0.0  # deterministic for factual Q&A
    )

    # Custom prompt that instructs the model to use only retrieved context
    prompt_template = """Use the following context to answer the question.
If the context does not contain enough information to answer,
say "I don't have enough information to answer that question."
Do not make up information that is not in the context.

Context:
{context}

Question: {question}

Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )

    # Build the chain
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}  # retrieve top 4 chunks
    )

    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # concatenate all retrieved docs into prompt
        retriever=retriever,
        chain_type_kwargs={"prompt": prompt},
        return_source_documents=True
    )

    return chain


# Build and test the chain
rag_chain = build_rag_chain(vectorstore)

def ask(question: str) -> str:
    """Ask a question and display the answer with sources."""
    result = rag_chain.invoke({"query": question})

    print(f"Question: {question}")
    print(f"Answer: {result['result']}")
    print(f"\nSources ({len(result['source_documents'])}):")
    for doc in result["source_documents"]:
        source = doc.metadata.get("source", "unknown")
        page = doc.metadata.get("page", "n/a")
        print(f"  - {source} (page {page})")

    return result["result"]


ask("What are the main features of the product?")
ask("How do I request a refund?")

A few things worth unpacking here.

temperature=0.0 isn’t just a good default for factual Q&A — it’s borderline mandatory. Cranking temperature up introduces randomness, which is great for creative writing, terrible for “tell me what page 47 of the employee handbook says about vacation carry-over.” You want determinism. Same question, same retrieved context, same answer every time.

chain_type="stuff" means all retrieved chunks get concatenated and dumped into the prompt in one go. Simple, effective, and works beautifully as long as your total context (question + retrieved chunks + prompt template) fits within the model’s context window. For GPT-4o’s 128k window, you’d need to retrieve an absurd number of chunks before hitting limits. Probably not something to worry about unless your chunks are unusually large.

Source documents are non-negotiable. Always return them. Always show them to the user. RAG without source attribution is just a chatbot that might be lying. When someone asks about your refund policy and gets an answer, they should be able to click through to the exact PDF page that backs it up. Trust requires traceability.

Notice the “I don’t have enough information” fallback in the prompt. Without it, models get creative. They’ll extrapolate, infer, or outright invent answers. Better to have the system say “I don’t know” than to confidently present something your documents never said. Your users will trust the system more when it’s honest about its limits.

Conversation Memory: Making It Actually Feel Like a Chat

A single-turn Q&A system works for simple lookups. But real users don’t ask isolated questions. They ask follow-ups. “What products do you offer?” then “Tell me more about the first one.” then “What’s the pricing for that?” Without memory, the second and third questions are meaningless — the model doesn’t know what “the first one” or “that” refers to.

LangChain’s ConversationalRetrievalChain handles this by maintaining a sliding window of recent exchanges. When a follow-up comes in, the chain uses the conversation history to reformulate the question into a standalone query before hitting the vector store. “Tell me more about the first one you mentioned” becomes something like “Tell me more about Product X” — a self-contained question that actually retrieves useful chunks.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory

def build_conversational_rag(vectorstore):
    """Build a RAG chain with conversation memory."""
    llm = ChatOpenAI(model="gpt-4o", temperature=0.0)

    memory = ConversationBufferWindowMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer",
        k=5  # remember last 5 exchanges
    )

    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        return_source_documents=True,
        verbose=False
    )

    return chain


# Interactive conversation loop
conv_chain = build_conversational_rag(vectorstore)

def chat(question: str) -> str:
    """Send a conversational query."""
    result = conv_chain.invoke({"question": question})
    print(f"User: {question}")
    print(f"Assistant: {result['answer']}\n")
    return result["answer"]


# Multi-turn conversation example
chat("What products do you offer?")
chat("Tell me more about the first one you mentioned.")
chat("What is the pricing for that?")

That k=5 in the memory config deserves a comment. Keeping the last five exchanges strikes a decent balance between context awareness and token efficiency. Bump it higher and you’re spending more tokens on history in every query. Drop it to 2 and the model forgets the beginning of a longer conversation. Five works for most internal tool use cases. For a customer-facing chatbot where conversations run longer, you might want 8 or 10, but watch your costs.

Something subtle happens behind the scenes here. The chain doesn’t just append chat history to the retrieval query. It uses the LLM itself to rephrase the follow-up question into a standalone form, incorporating context from previous turns. So the retrieval step always gets a clean, self-contained query. Smart. Also means you’re paying for an extra LLM call per question, which is the trade-off. Worth it for usability, in my experience.

RAG vs. Fine-Tuning: When to Use Which

I’ve got strong opinions on this, and I know not everyone agrees. But here’s how I see it after building both kinds of systems over the past year and a half.

Use RAG when:

Your knowledge base changes. Weekly, daily, hourly — doesn’t matter. RAG handles updates by re-embedding documents. No retraining.
You need source attribution. RAG naturally returns which documents it pulled from. Fine-tuned models give you answers with zero traceability.
Your data is sensitive and you don’t want it baked into model weights. RAG keeps your data in your vector store, under your control.
You need it working this week, not this quarter. A RAG pipeline can go from zero to deployed in a day or two.

Use fine-tuning when:

You need the model to adopt a specific tone, style, or persona that prompting alone can’t achieve.
Your task requires specialized reasoning patterns the base model struggles with — think domain-specific classification or structured extraction.
Latency matters more than freshness. Fine-tuned models don’t need retrieval, so they’re faster per query.

Use both when: You’ve got a domain-specific style AND a large, changing knowledge base. Fine-tune for the style, RAG for the knowledge. Rare in practice, but powerful when it fits.

Hot take: Most teams that jump to fine-tuning haven’t tried prompting hard enough. Few-shot examples in the system prompt, combined with RAG for knowledge, cover 90% of enterprise use cases I’ve encountered. Fine-tuning is a scalpel. RAG is a Swiss army knife. Grab the knife first.

Improving Your Pipeline: What to Try Next

What we built works. It’s a solid starting point. But production RAG systems need more polish. Here’s where to invest your time once the basics are running.

Metadata filtering. Right now, similarity search considers all chunks equally. In reality, you might want to restrict search to specific document categories. “Only search the HR policy docs.” “Only look at documents from 2024 onward.” ChromaDB supports metadata filtering natively — tag your chunks during ingestion and filter at query time.

Hybrid search. Pure semantic search misses when the user asks about a specific product name, error code, or employee ID — things where exact keyword matching beats semantic similarity. Combining BM25 (keyword search) with vector search gives you the best of both approaches. LangChain’s EnsembleRetriever handles this.

Reranking. Your initial retrieval pulls back, say, 10 chunks. A reranker (like Cohere’s or a cross-encoder model) then rescores those chunks against the original question and returns the top 4. More compute per query, but meaningfully better relevance. Worth it when answer quality matters more than latency.

Evaluation. You can’t improve what you can’t measure. Build a test set of 50-100 question-answer pairs grounded in your documents. Run your pipeline against them. Track hit rate (did the correct chunk get retrieved?), answer correctness, and faithfulness (did the answer stay within the provided context?). Frameworks like RAGAS make this structured evaluation easier.

Back to Your Company’s Documents

Remember where we started? LLMs know everything about 2024. They know nothing about your company.

But now you’ve got the pieces to change that. Those 47 PDFs your support team keeps emailing around? They’re chunks waiting to be embedded. The product docs living in a shared Google Drive folder that nobody reads because they’re 200 pages long? They’re a vector store that can answer questions in seconds. The onboarding guide your HR team wrote last quarter — the one new hires are supposed to read but never do? It’s a conversational RAG bot that actually gets used.

I built something almost identical to what we walked through today for an internal project last year. Took a Friday afternoon to get the first version running. By Monday, three teams were using it. Not because it was perfect — the chunk sizes were probably wrong, the prompt could’ve been tighter, and we hadn’t added any reranking yet. But it answered questions about our internal docs, with sources, and it didn’t hallucinate. That alone made it more useful than asking the LLM directly.

Your company’s knowledge doesn’t need to be locked in documents nobody reads. It doesn’t need a fine-tuned model to be accessible. It needs a pipeline — load, split, embed, retrieve, generate. That’s RAG. And now you know how to build one.

The LLM still knows nothing about your company. But with a vector store between you and it? Doesn’t need to.

LLMs Know Everything About 2024. They Know Nothing About Your Company.

The Architecture: What Actually Happens Inside a RAG Pipeline

Setting Up: Dependencies and API Keys

Loading Documents and Splitting Them Into Chunks

Embeddings and the Vector Store: Where the Magic Lives

The Retrieval Chain: Connecting Search to Generation

Conversation Memory: Making It Actually Feel Like a Chat

RAG vs. Fine-Tuning: When to Use Which

Improving Your Pipeline: What to Try Next

Back to Your Company’s Documents

Related Articles

Building a Sentiment Analysis Tool with Python

Introduction to Machine Learning: A Beginner Guide

ChatGPT API Tutorial: Build Your Own AI Chatbot with Python

Leave a Comment Cancel Reply