Retrieval-Augmented Generation (RAG)

A technique that lets an LLM look up relevant documents at query time and use them to ground its answer, reducing hallucinations.

Retrieval-Augmented Generation (RAG) is a technique that combines a search system with a large language model. Before the model answers a question, a retriever first pulls relevant chunks of text from an external knowledge base — typically a vector database of document embeddings — and these chunks are inserted into the prompt as context. The LLM then generates its answer grounded in those retrieved passages. RAG matters because LLMs by themselves only know what was in their training data, and they tend to hallucinate when asked about niche topics, recent events, or private company data. RAG sidesteps both problems: you can plug in your own up-to-date documents (product docs, internal wikis, legal contracts) without retraining the model, and the answer cites real sources you can verify. A typical example: a customer support chatbot. When a user asks "how do I cancel my subscription?", the system embeds the question, searches a vector store of help-center articles, retrieves the top 3-5 most relevant passages, and feeds them to Claude or GPT alongside the question. The model writes a natural-language answer using only what's in those passages — and can quote them. RAG is now the default architecture for "chat with your documents" products, enterprise knowledge assistants, and many coding agents that need to look up API docs. The quality depends heavily on the retrieval step: bad search means bad answers, regardless of how strong the LLM is. Related concepts to explore next: vector database, embeddings, chunking, hybrid search, reranking, and context window.