Build a RAG Chatbot: Complete Step-by-Step Guide (2026)
Learn how to build a RAG chatbot from scratch with Python and LangChain. Step-by-step tutorial with code examples, best practices, and production tips.
The first time I watched a RAG chatbot answer questions about my messy, 200-page internal documentation, I genuinely felt like I’d discovered a superpower. No more Ctrl+F hunting through endless PDFs. No more “I think it’s somewhere in section 7…” moments. Just ask a question, get an answer grounded in my actual data.
Here’s the thing about regular chatbots and LLMs: they’re brilliant, but they’ve got a glaring problem. They can only work with what they were trained on. Ask ChatGPT about your company’s internal processes? It’ll confidently make something up. That’s where RAG comes in—it’s the bridge between your private knowledge and the intelligence of modern language models.
In this tutorial, I’ll walk you through building a RAG chatbot from scratch using Python and LangChain. By the end, you’ll have a working system that can answer questions about your own documents. No PhD required—just some Python basics and a willingness to experiment.
Let’s build something that actually knows your stuff.
What Is a RAG Chatbot (And Why Should You Care)?
A RAG (Retrieval-Augmented Generation) chatbot is an AI system that combines a retrieval mechanism with a large language model. When you ask a question, it first searches a knowledge base for relevant information, then uses an LLM to generate a response grounded in that retrieved context—dramatically reducing hallucinations and providing answers based on actual data.
Think of it like the difference between a closed-book exam and an open-book exam. A regular LLM takes a closed-book test—it can only use what’s in its memory (training data). A RAG chatbot takes an open-book test—it can look up information before answering. Which would you rather have answering questions about your company’s policies?
The adoption numbers tell the story. According to 2024 industry research, 51% of large enterprises have now adopted RAG, up from just 31% the previous year. The RAG market is projected to hit $2.76 billion in 2026, with studies showing RAG systems reduce hallucinations by 70-90% compared to standard LLMs. This isn’t hype—it’s becoming essential infrastructure for AI agents and intelligent applications.
RAG vs Fine-Tuning vs Prompting
| Approach | How It Works | Best For | Cost |
|---|---|---|---|
| Prompting | Include info directly in the prompt | Small, static data | Very Low |
| Fine-Tuning | Train model on your data | Changing model behavior | High ($$$) |
| RAG | Retrieve relevant docs at query time | Large, dynamic knowledge bases | Medium |
Here’s my honest take: for 90% of “I want the AI to know about my stuff” use cases, RAG is the answer. Fine-tuning is overkill (and expensive), and prompting hits context limits fast. RAG hits the sweet spot.
Understanding RAG Architecture (The Big Picture)
Before we write any code, let’s understand what we’re building. A RAG system has five core components that work together in a pipeline.
The Five Core Components
- Document Loader - Ingests your raw data (PDFs, text files, web pages, etc.)
- Text Splitter - Breaks documents into smaller, manageable chunks
- Embedding Model - Converts text chunks into numerical vectors that capture meaning
- Vector Database - Stores embeddings and enables fast similarity search
- LLM (Large Language Model) - Generates human-readable answers from retrieved context
Think of it as an assembly line. Documents come in one end, get processed through each station, and intelligent answers come out the other end.
How the Query Flow Works
When a user asks a question, here’s what happens behind the scenes:
- The question gets converted into an embedding (same process as the documents)
- That embedding is used to search the vector database for similar chunks
- The most relevant chunks are retrieved and combined with the question
- Everything goes to the LLM, which generates a grounded response
This is what makes RAG powerful—the LLM doesn’t have to remember everything. It just needs to be good at synthesizing and explaining information that gets handed to it. If you’ve explored LangChain and other frameworks, you’ll recognize this as a common pattern.
Prerequisites: What You’ll Need
Before we dive into code, let’s make sure you’ve got everything ready:
Technical Requirements:
- Python 3.9 or higher (check with
python --version) - pip package manager
- A code editor (VS Code, PyCharm, whatever you’re comfortable with)
- Terminal/command line access
Accounts & API Keys:
- OpenAI API key (get one at platform.openai.com)
- We’ll use GPT-5.2 for generation and text-embedding-3-large for embeddings
Your Knowledge Base:
- Some documents to chat with! PDFs, text files, markdown—bring what you have
- For this tutorial, I’d recommend starting with 2-5 documents (we’ll discuss scaling later)
If you’re new to the OpenAI API, check out our OpenAI API basics guide first. It’ll take 30 minutes and make this tutorial much smoother.
Step 1: Setting Up Your Environment
Let’s create a clean environment for our RAG project.
First, create a project directory and virtual environment:
mkdir rag-chatbot
cd rag-chatbot
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Now install the required packages:
pip install langchain langchain-openai langchain-community chromadb pypdf python-dotenv
Here’s what each package does:
langchain- The orchestration framework (see LangChain documentation for the full picture)langchain-openai- OpenAI integrationslangchain-community- Community integrations (document loaders, etc.)chromadb- Our vector database (local, free, perfect for learning)pypdf- For loading PDF documentspython-dotenv- For managing environment variables securely
Create a .env file to store your API key:
# .env
OPENAI_API_KEY=sk-your-key-here
Important: Add .env to your .gitignore so you don’t accidentally commit your API key! I’ve seen that happen more times than I’d like to admit.
Let’s verify everything works:
# test_setup.py
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
if api_key and api_key.startswith("sk-"):
print("✅ OpenAI API key loaded successfully!")
else:
print("❌ API key not found or invalid. Check your .env file.")
Run it with python test_setup.py. You should see that green checkmark.
Step 2: Loading Your Documents
Now let’s get your documents into the system. LangChain provides loaders for almost every format you can imagine.
Supported File Types
| Format | Loader | Notes |
|---|---|---|
PyPDFLoader | Most common for business docs | |
| Text | TextLoader | Plain .txt files |
| Markdown | UnstructuredMarkdownLoader | Great for technical docs |
| Web Pages | WebBaseLoader | Scrape web content |
| CSV | CSVLoader | Structured data |
Loading PDF Documents
For this tutorial, create a documents/ folder and add some PDFs. Here’s how to load them:
# load_documents.py
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
# Load a single PDF
loader = PyPDFLoader("documents/my_document.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
print(f"First page preview: {documents[0].page_content[:200]}...")
For loading an entire directory of PDFs:
# Load all PDFs from a directory
loader = DirectoryLoader(
"documents/",
glob="**/*.pdf",
loader_cls=PyPDFLoader
)
documents = loader.load()
print(f"Total documents loaded: {len(documents)}")
for doc in documents[:3]:
print(f"- Source: {doc.metadata['source']}, Page: {doc.metadata.get('page', 'N/A')}")
Each document object contains:
page_content- The actual textmetadata- Info like source file, page number, etc.
Step 3: Chunking Your Documents (This Is Where Most People Go Wrong)
I’m going to be honest—I spent a week debugging why my RAG couldn’t answer questions about tables in my documents. Turns out, my chunking strategy was splitting tables mid-row, making them completely useless. Chunking matters more than most tutorials admit.
Why Chunking Matters
Here’s the problem: LLMs have context limits. You can’t dump a 200-page PDF into a prompt. So we split documents into smaller chunks that can be:
- Embedded independently
- Retrieved selectively
- Combined as needed
But here’s the catch:
- Chunks too big = You retrieve irrelevant content along with relevant content
- Chunks too small = You lose context and meaning
Choosing the Right Chunk Size
After building several RAG systems, here’s what’s worked for me:
| Document Type | Chunk Size | Overlap | Why |
|---|---|---|---|
| Dense technical docs | 500 tokens | 100 tokens | Preserves smaller concepts |
| Narrative content | 800 tokens | 150 tokens | Keeps context intact |
| FAQs / Q&A | 300 tokens | 50 tokens | Each item is self-contained |
I’ll be honest—the “best” chunk size is still debated in the community. I’ve seen successful systems with 200 tokens and others with 1000+. Start with 500-800 tokens and 10-20% overlap, then iterate based on your results.
Implementing the Splitter
# chunk_documents.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create the splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, # Target chunk size in characters (~200 tokens)
chunk_overlap=100, # Overlap between chunks
length_function=len,
separators=["\n\n", "\n", " ", ""] # Split priority
)
# Split the documents
chunks = text_splitter.split_documents(documents)
print(f"Original documents: {len(documents)}")
print(f"After chunking: {len(chunks)}")
print(f"\nSample chunk:\n{chunks[0].page_content[:300]}...")
The RecursiveCharacterTextSplitter is smart—it tries to split on paragraph breaks first, then sentences, then words. This preserves natural boundaries in your text.
Pro tip: Look at your chunks! Seriously. Print a few and make sure they contain coherent information. If you see tables split mid-row or sentences cut in half, adjust your parameters.
Step 4: Creating Embeddings
Embeddings are the magic that makes semantic search possible. Instead of matching keywords, embeddings match meaning.
What Are Embeddings?
An embedding is a numerical representation of text—a list of numbers (typically 1536 dimensions for OpenAI’s model) that captures the semantic meaning. Similar concepts end up as similar vectors.
For example:
- “dog” and “puppy” → similar embeddings
- “dog” and “refrigerator” → different embeddings
- “I love pizza” and “Pizza is my favorite food” → similar embeddings
This is why RAG can answer “what’s your refund policy?” even if the document says “returns and exchanges”—the embeddings are similar enough to retrieve the right content.
Generating Embeddings with OpenAI
LangChain makes this almost too easy. We’ll use OpenAI’s text-embedding-3-large model—it’s the best available as of January 2026. Check the OpenAI embeddings documentation for the latest pricing and capabilities.
# create_embeddings.py
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()
# Initialize the embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-large"
)
# Test with a single text
test_embedding = embeddings.embed_query("What is a RAG chatbot?")
print(f"Embedding dimension: {len(test_embedding)}")
print(f"First 5 values: {test_embedding[:5]}")
Cost note: OpenAI’s embeddings are cheap but not free—about $0.00013 per 1K tokens for text-embedding-3-large. For most projects, this is negligible. I’ve embedded millions of tokens for under $10.
Step 5: Setting Up Your Vector Database
Now we need somewhere to store these embeddings and search through them efficiently. That’s the vector database.
ChromaDB vs Pinecone vs Others
I’ve used most of the popular options, and here’s my honest assessment:
| Database | Best For | Pros | Cons |
|---|---|---|---|
| ChromaDB | Learning, small projects | Free, local, easy setup | Doesn’t scale well |
| Pinecone | Production | Managed, scalable, fast | Costs money at scale |
| Weaviate | Production | Open source, feature-rich | More complex setup |
| Qdrant | Production | Great performance, Rust-based | Newer, smaller community |
My recommendation: Start with ChromaDB locally, migrate to Pinecone when you need scale. You can swap them with minimal code changes. For code examples of working with different databases, see our vector database snippets.
Creating Your Vector Store with ChromaDB
Here’s where everything comes together. We’ll create embeddings for all our chunks and store them in ChromaDB:
# create_vectorstore.py
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from dotenv import load_dotenv
load_dotenv()
# 1. Load documents
print("Loading documents...")
loader = DirectoryLoader("documents/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
print(f"Loaded {len(documents)} pages")
# 2. Chunk documents
print("Chunking documents...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# 3. Create embeddings and store in ChromaDB
print("Creating embeddings and storing in vector database...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Save to disk
)
print(f"✅ Vector store created with {vectorstore._collection.count()} documents")
This does three things in one step:
- Generates embeddings for all chunks
- Stores them in ChromaDB
- Persists to disk so you don’t re-embed every time
Test that it works:
# Quick search test
results = vectorstore.similarity_search("What is the main topic?", k=3)
for i, doc in enumerate(results):
print(f"\nResult {i+1}:")
print(doc.page_content[:200])
If you’re building AI agents, RAG is often a core component. The vector store becomes the agent’s long-term memory.
Step 6: Building the Retriever
The retriever is the component that fetches relevant documents from your vector store. It’s simpler than it sounds.
Understanding Similarity Search
When you make a query, the retriever:
- Converts your query to an embedding
- Compares it against all stored embeddings using cosine similarity
- Returns the top k most similar chunks
The k parameter is important—it determines how many chunks you retrieve:
- k too low (1-2) = Might miss relevant information
- k too high (10+) = Includes irrelevant content, confuses the LLM
I typically start with k=4 and adjust based on results.
Creating the Retriever
# create_retriever.py
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
load_dotenv()
# Load existing vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Create retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # Return top 4 matches
)
# Test it
query = "What is the refund policy?"
docs = retriever.invoke(query)
print(f"Found {len(docs)} relevant documents:")
for i, doc in enumerate(docs):
print(f"\n--- Document {i+1} ---")
print(f"Source: {doc.metadata.get('source', 'Unknown')}")
print(f"Content: {doc.page_content[:200]}...")
If the retrieved documents look relevant to your query, you’re on the right track. If not, revisit your chunking strategy—that’s usually the culprit.
Step 7: Connecting the LLM
Now for the exciting part—connecting GPT-5.2 to generate answers using our retrieved context.
Crafting the Prompt Template
The prompt is crucial. We need to tell the LLM to:
- Only use the provided context
- Admit when it doesn’t know
- Answer naturally, not robotically
from langchain.prompts import ChatPromptTemplate
template = """You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only answer based on the context below. Do not use outside knowledge.
- If the context doesn't contain the answer, say "I don't have that information in my knowledge base."
- Be concise but complete.
- If relevant, mention which part of the documentation you're referencing.
Context:
{context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template)
Creating the RAG Chain
LangChain’s Expression Language (LCEL) makes this clean:
# create_rag_chain.py
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from dotenv import load_dotenv
load_dotenv()
# Setup components
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(
model="gpt-5.2",
temperature=0.1 # Low temperature for factual answers
)
prompt = ChatPromptTemplate.from_template("""You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only answer based on the context below. Do not use outside knowledge.
- If the context doesn't contain the answer, say "I don't have that information in my knowledge base."
- Be concise but complete.
Context:
{context}
Question: {question}
Answer:""")
# Format retrieved documents into a single string
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Create the chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Test it!
response = rag_chain.invoke("What are the main features of the product?")
print(response)
Run this and ask your first question. The moment it answers correctly based on YOUR documents… that’s the magic of RAG.
Step 8: Adding Conversational Memory
A single Q&A is great, but real chatbots need memory. “What about the second one?” only makes sense if the bot remembers what you were talking about.
LangChain provides memory components to track conversation history:
# rag_with_memory.py
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from dotenv import load_dotenv
load_dotenv()
# Setup
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-5.2", temperature=0.1)
# Set up memory
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
# Create conversational chain
qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
verbose=True
)
# Have a conversation!
def chat(question):
result = qa_chain.invoke({"question": question})
return result["answer"]
# Test multi-turn conversation
print("Bot:", chat("What features does the product have?"))
print("\n")
print("Bot:", chat("Tell me more about the first one")) # Uses memory!
print("\n")
print("Bot:", chat("How much does that cost?")) # Continues context
The ConversationBufferMemory keeps the full chat history. For longer conversations, consider ConversationSummaryMemory which summarizes older messages to save tokens.
Step 9: Testing and Evaluation
Your RAG chatbot is working—but is it working well? Let’s verify.
Quick Smoke Tests
Run these tests before deploying anything:
| Test Type | What to Ask | Expected Behavior |
|---|---|---|
| Happy path | Question with clear answer in docs | Accurate, sourced response |
| Multi-chunk | Question requiring info from multiple places | Synthesized answer |
| Not in docs | Question on topic NOT covered | ”I don’t have that information” |
| Edge case | Very specific or unusual phrasing | Should still retrieve correctly |
Evaluation Approaches
For basic projects, manual review works fine. For production, consider:
Retrieval metrics:
- Precision@K - Of the k documents retrieved, how many were relevant?
- Recall - Of all relevant documents, how many did we find?
Generation metrics:
- Groundedness - Is the answer supported by retrieved context?
- Relevance - Does the answer address the question?
- Faithfulness - Did the LLM stay faithful to the sources?
Tools like Ragas and TruLens can automate this evaluation. For now, just manually review 20-30 question-answer pairs with different types.
Common Issues and Debugging
| Symptom | Likely Cause | Fix |
|---|---|---|
| Wrong documents retrieved | Chunk size issue | Try smaller chunks, more overlap |
| Hallucinating despite good retrieval | Prompt issue | Strengthen “only use context” instruction |
| Slow responses | Too many chunks retrieved | Reduce k, or optimize embedding |
| Answers are too generic | Retrieved chunks too broad | Smaller chunks, better query |
Production Considerations
Your local prototype works. Now let’s talk about what changes for production.
Scaling Your RAG System
Here’s a story: my first RAG system worked beautifully with 5 PDFs. Then I added 50 and wondered why retrieval quality tanked. More documents = more noise = harder retrieval.
Scaling strategies:
- Migrate to managed vector DB - Pinecone, Weaviate, or Qdrant handle scale much better than ChromaDB
- Batch embedding - Don’t embed one document at a time; batch process
- Metadata filtering - Add metadata (date, category, source) to filter before similarity search
- Hybrid search - Combine semantic + keyword search for better precision
Monitoring and Maintenance
In production, you need visibility:
- Log queries and responses - For debugging and improvement
- Track retrieval quality - Are retrieved docs getting less relevant over time?
- Monitor latency - End-to-end response time
- Update knowledge base - Documents change; re-embed when they do
Security Considerations
A few things that are easy to overlook:
- Don’t expose raw chunks - Sanitize what goes into responses
- Input validation - Prompt injection is real; sanitize user input
- Rate limiting - Protect against abuse
- Access control - Not everyone should query every document
Troubleshooting Common Problems
A quick reference for when things go wrong:
“My chatbot says it doesn’t know, but the answer IS in the docs”
- Check retrieval first—are the right chunks being retrieved?
- If not: chunk size may be too large, or embeddings might not capture the meaning well
- If yes: prompt might be too restrictive, or answer is spread across chunks that aren’t retrieved together
“The chatbot makes stuff up”
- Strengthen the “only use context” prompt instruction
- Reduce temperature to 0.0 or 0.1
- Check if retrieved context is actually relevant
- Add explicit “if unsure, say so” instruction
“Responses are too slow”
- Reduce k (fewer documents to process)
- Use a faster model for simple queries
- Cache frequent questions
- Check if embedding is the bottleneck
“Memory seems to not work”
- Verify memory is initialized before the chain
- Check that
memory_keymatches what the chain expects - Print
memory.load_memory_variables({})to debug
Frequently Asked Questions
What is the difference between RAG and fine-tuning?
RAG retrieves external data at query time and uses it as context for the LLM. Fine-tuning actually changes the model’s weights by training on your data. RAG is cheaper, easier to update (just change your documents), and better for large, dynamic knowledge bases. Fine-tuning is better for changing the model’s behavior or tone.
What vector database should I use for RAG?
For learning and small projects, ChromaDB is perfect—it’s free, local, and requires no external setup. For production, I’d recommend Pinecone (managed, easiest to scale) or Weaviate (open source, more control). The good news: LangChain makes switching between them relatively painless.
How do I reduce hallucinations in my RAG chatbot?
Focus on three areas: 1) Better chunking that preserves context, 2) Retrieving truly relevant documents (experiment with k and chunk size), and 3) Prompt engineering that explicitly instructs the LLM to only use provided context and admit when it doesn’t know. Some teams add a post-processing step that verifies claims against retrieved sources.
What’s the best chunk size for RAG?
There’s no universal answer (I know, frustrating). 500-800 tokens with 100-200 overlap is a solid starting point for most document types. But it depends on your content—dense technical docs might need smaller chunks, while narrative content can use larger ones. The real answer: test with your actual documents and evaluate retrieval quality.
Can I build RAG without LangChain?
Absolutely. LangChain is a convenience layer, not a requirement. You can use the OpenAI SDK directly for embeddings and chat, combined with ChromaDB or Pinecone’s native Python clients. It’s more code, but you get more control. LangChain is great for prototyping; some teams strip it out for production simplicity.
How much does it cost to run a RAG chatbot?
The main costs are embeddings and LLM calls. OpenAI’s text-embedding-3-large costs about $0.00013/1K tokens—pennies for most datasets. GPT-5.2 costs approximately $0.01-0.03/1K tokens depending on input/output. For a small project with a few hundred documents and moderate usage, expect a few dollars per month. It scales linearly with usage.
Conclusion
You just built a RAG chatbot. Let that sink in for a moment.
You took documents—your messy, unstructured, nobody-reads-them documents—and turned them into an intelligent assistant that can actually answer questions. That’s not trivial.
Here’s what I hope you take away:
-
Chunking is where most people go wrong. Don’t skip this step. Look at your chunks. Adjust until they make sense as standalone units.
-
Start simple, then iterate. ChromaDB + basic retrieval + minimal prompt is enough to prove the concept. Don’t over-engineer on day one.
-
RAG is accessible. You don’t need a PhD or a million-dollar budget. You just proved that with a handful of Python scripts.
So what’s next? Go build something useful with YOUR data. Whether it’s internal documentation, customer support articles, or research papers—you now have the skills to make that knowledge conversational. For ready-to-copy code, check our RAG code snippets. And if you want to build a web interface for your chatbot, our Streamlit AI tutorial shows you how.
And when you’re ready to go deeper, look into AI agents that use RAG as their knowledge layer, or explore advanced patterns like agentic RAG and graph RAG. But that’s for another tutorial.
Happy building. 🚀