What is RAG (Retrieval-Augmented Generation)?
You ask ChatGPT a question about your company's internal HR policy and it confidently makes up an answer. Or you ask Claude about a bug in a library released last month and it cites APIs that don't exist. The model isn't "lying" — it just doesn't have that information in its weights. RAG is the most common fix.
The core idea in one paragraph
RAG stands for Retrieval-Augmented Generation. Instead of relying only on what an LLM memorized during training, you do two things at query time: (1) retrieve the most relevant chunks of text from an external knowledge source — usually a vector database, sometimes a regular keyword index or SQL store — and (2) stuff those chunks into the prompt before asking the model to generate an answer. The model then answers based on the retrieved context, not its parametric memory.
So a typical RAG prompt looks like: "Here are 5 passages from our internal docs. Using only this information, answer the user's question: …". That's it. The cleverness is not in the generation step — it's in retrieving the right 5 passages out of potentially millions.
Why RAG exists: the three problems it solves
LLMs have three well-known limitations that RAG directly addresses.
Knowledge cutoffs. GPT-4o, Claude Sonnet, Gemini — they all stop knowing the world at some training date. Anything after that, including your company's docs that they've never seen at all, is invisible to them. RAG injects fresh or private information at inference time, so you don't need to retrain the model every time a Confluence page changes.
Hallucinations. When a model doesn't know, it tends to generate plausible-sounding nonsense. Grounding it in retrieved passages reduces (but does not eliminate) hallucination, because the model has actual text to copy from rather than guessing. Even better, you can show citations — "this answer came from page 14 of the employee handbook" — which makes the system auditable.
Context window economics. Even with Gemini's 2M-token window or Claude's 200K, you can't paste your entire 50GB document corpus into every prompt. And you wouldn't want to — it's slow and expensive. RAG selects only the relevant slice. A good retrieval system might pull 8 chunks of 500 tokens each — 4K tokens of context — instead of dumping everything.
How a RAG system is actually built
A production RAG pipeline has two phases that run at different times.
Indexing (offline, runs when documents change):
- Load documents from your sources — Notion, Google Drive, GitHub, PDFs, a database.
- Chunk them into smaller pieces. Naive splitting by 500 tokens works but is rough; semantic chunking (split by section, paragraph, or markdown header) usually retrieves better.
- Embed each chunk into a vector using a model like OpenAI's
text-embedding-3-small, Cohere'sembed-v3, or open-sourcebge-m3. Each chunk becomes a 768- or 1536-dim vector. - Store the vectors in a vector database — Pinecone, Weaviate, Qdrant, pgvector, or Chroma for prototypes.
Query time (online, runs per user request):
- Embed the user's question with the same embedding model.
- Run a similarity search (cosine or dot product) to find the top-k most similar chunks.
- Optionally rerank with a cross-encoder like Cohere Rerank or
bge-reranker— this matters more than people think. - Compose a prompt with the retrieved chunks + question and send to the LLM.
- Return the answer, ideally with citations linking back to the source chunks.
The whole thing can be 200 lines of Python with LangChain or LlamaIndex. The hard part is not the code — it's tuning chunk size, retrieval quality, and prompt structure for your data.
What RAG looks like in real products
Cursor and GitHub Copilot do RAG over your codebase: when you ask "where do we handle Stripe webhooks?", they don't fine-tune on your repo, they retrieve the relevant files. Perplexity does RAG over the live web — that's basically its whole product. Notion AI's "Q&A" retrieves over your workspace. ChatGPT's "Custom GPT with knowledge" is a managed RAG service. Claude Projects is the same idea: upload files, ask questions, retrieval happens behind the scenes.
Inside enterprises, the most common RAG use cases are: customer support bots over a help-center corpus, internal knowledge assistants over Confluence/SharePoint, legal/contract search, and developer assistants over private code.
When NOT to use RAG
RAG is not always the answer, and the AI consulting industry has a bad habit of selling it for everything.
Skip RAG when the answer fits in the context window. If your knowledge base is one 30-page PDF, just paste it in. Long-context models like Gemini 1.5 Pro or Claude with prompt caching make this cheap and avoid an entire infrastructure layer. You only need RAG when the corpus is bigger than what fits comfortably in a prompt.
Skip RAG when you need behavior change, not knowledge. If you want the model to write in your brand voice, follow a specific output format, or learn a new task, RAG won't help — that's what fine-tuning is for. RAG adds facts; fine-tuning adds skills.
Skip RAG for math, reasoning, or aggregation. "How many of our 10,000 customer tickets mention shipping delays?" is not a retrieval problem; it's an analytics problem. Use SQL or a structured pipeline. RAG retrieves passages, it doesn't count things.
Be skeptical when the data is highly structured. If your "knowledge" is actually a product catalog, an inventory, or a CRM, a regular database query usually beats vector search on accuracy and latency. Hybrid setups (function calling → SQL) often work better than pure RAG.
The problems nobody tells you about
Demos look magical. Production is messy. The real failure modes:
- Retrieval is the bottleneck, not generation. If the right chunk isn't in the top-k, the model can't answer correctly no matter how smart it is. Most "RAG doesn't work" complaints are actually "our retrieval recall is 40%."
- Chunking destroys context. A passage that says "this approach is dangerous" is meaningless without knowing which approach. Good chunking preserves enough surrounding context, often via overlapping windows or parent-document retrieval.
- Embeddings have blind spots. They handle paraphrase well but struggle with rare entities, acronyms, and numbers. Hybrid search (BM25 + vector) almost always beats pure vector search.
- Stale indexes. When documents change, your index needs to update. This sounds obvious until you have a pipeline that silently breaks for three weeks.
- Evaluation is hard. "Did the answer use the right source?" is harder to measure than "did the model say something plausible?" Tools like Ragas, TruLens, and LangSmith help, but you'll still need a human-labeled eval set.
Next steps
If you're learning RAG, the concepts worth following up on are: embeddings (how text becomes vectors), vector databases (where those vectors live), chunking strategies, hybrid search and reranking, fine-tuning vs RAG (when each makes sense), long context windows (the alternative for small corpora), and agentic RAG (where the model decides what and when to retrieve, instead of always retrieving once).
Build a toy version first — 100 documents, Chroma, OpenAI embeddings, 50 lines of code. Then break it on purpose with hard questions, watch where retrieval fails, and you'll learn more in an afternoon than from any tutorial.