Skip to content

How to pick★★★★★9 min read

How to build your first RAG stack in 2026: a practical pick guide

Embedding model, vector DB, retriever, reranker, generator — five layers, dozens of options. Here's the no-overengineering path.

Most first-time RAG projects fail not because the technology is hard, but because the team picks 5 of the trendiest tools at each layer and ends up with a Frankenstein. Build the simplest version that could possibly work first. You can always swap out one piece later.

The five layers, in order

A basic RAG stack has five components. Skipping or simplifying any of them is fine — sometimes preferable — but you need to know they exist:

  1. Document loader / chunker — turns your source material into pieces small enough to embed.
  2. Embedding model — turns each chunk into a vector.
  3. Vector store — stores the vectors with their text and metadata.
  4. Retriever — given a user query, finds the most relevant chunks.
  5. Generator (LLM) — given the chunks plus the query, writes the answer.

The 80/20 rule: chunking and retrieval matter more than the embedding model or the vector DB. Most teams obsess over the wrong layer.

Layer 1: chunking

Default: split into ~500-token chunks with 50-token overlap, on natural boundaries (paragraph breaks, then sentences). LangChain's RecursiveCharacterTextSplitter or LlamaIndex's SentenceSplitter are the standard tools. Use them.

When the default fails: documents with strong structure (legal contracts, code, scientific papers with sections). For these, use semantic chunking or section-aware chunking — chunk by the document's natural units. Tools like Unstructured.io or Docling handle PDFs, HTML, and DOCX with structure preservation.

The trap most beginners fall into: making chunks too big ("more context per chunk = better retrieval, right?"). No. Big chunks dilute the retrieval signal. The relevant sentence gets averaged out by 2 paragraphs of unrelated text. Stay small. If you need more context, retrieve more chunks and let the LLM synthesize.

Layer 2: embedding model

For English-only: OpenAI text-embedding-3-small ($0.02 per 1M tokens) is the default. It's fast, cheap, good enough for almost everything. text-embedding-3-large is marginally better, 4× the price, rarely worth it.

For multilingual or specifically Chinese content: BGE M3 (open-weights, self-host or via Together / Replicate) and Cohere Embed v4 are the two strong choices. BGE M3 is free if you self-host; Cohere is hosted and reliable. OpenAI's embeddings handle Chinese decently but lose to specialist models.

For specialized domains (medical, legal, code): consider domain-fine-tuned embeddings if you have evaluation data showing it helps. Most teams should skip this.

Don't fine-tune embeddings as a first move. Almost always you'll get more lift from better chunking or a reranker.

Layer 3: vector store

For your first project: pgvector if you already have Postgres. Qdrant if you don't. Both are free, both scale to millions of chunks comfortably, both have great DX.

When to use Pinecone or other managed services: when ops capacity is the bottleneck, not money. Pinecone is more expensive but you stop thinking about the database. For startups with one engineer, this is often the right trade.

When to use Weaviate, Milvus, Chroma, LanceDB, etc: rarely on the first build. Each has interesting features (Weaviate's hybrid search, LanceDB's S3-native architecture) but the first version usually doesn't need them. Pick when you have a specific reason.

Don't pick a vector DB based on benchmark scaling charts. Your first project will have 10k chunks, not 100M. Every option handles 10k chunks instantly.

Layer 4: retrieval

Default: top-k semantic search (k=5 to 10). Embed the query, find the k closest chunks by cosine similarity. This is what every tutorial shows.

Upgrade #1: hybrid search — combine semantic search with keyword search (BM25). Many queries ("what was the price of the Q3 contract") need keyword matching to nail the right chunk. Hybrid almost always beats pure semantic. Qdrant, Weaviate, and pgvector + tsvector all support this.

Upgrade #2: reranker — after retrieving 20-30 candidates with vector search, rerank them with a cross-encoder model. Cohere Rerank v3 is the gold standard ($0.001 per query). Self-hosted options like BGE reranker are competitive. A reranker often gives a bigger quality lift than upgrading the embedding model.

Upgrade #3: query expansion / hypothetical document embedding (HyDE) — let the LLM rewrite the query before searching. Reasonable lift on ambiguous queries. Add when retrieval is the clear bottleneck.

Layer 5: the generator

Claude 4.5 Sonnet is the default RAG generator in 2026. It follows the "only use the provided context" instruction better than GPT-5 and Gemini 2.5 Pro, hallucinates less when context is missing, and produces cleaner citations.

GPT-5 is faster and cheaper at similar quality for simple Q&A. Use it for high-volume customer-facing RAG where latency dominates.

Gemini 2.5 Pro shines when context is enormous (1M+ tokens) — for those cases you can sometimes skip retrieval entirely and dump the whole corpus into context. This works for medium-sized document sets (10-50 PDFs) but doesn't scale to enterprise knowledge bases.

DeepSeek and open-source models are options when you're cost-constrained or need self-hosting. Quality is good but instruction-following on "don't hallucinate, cite the chunk" is weaker.

When NOT to build RAG

If your knowledge base is under 100k tokens total, just paste it all into the LLM context and skip retrieval. Modern long-context models handle this fine, the latency is acceptable, and you eliminate an entire class of retrieval failures.

If your data changes every few seconds (live transactions, real-time analytics), RAG is the wrong pattern. Use SQL or your existing query layer; just give the LLM tool-use access to query it.

If you don't have evaluation data, you don't have RAG — you have wishful thinking with retrieval. Spend a day producing 50-100 question/expected-answer pairs before tuning anything.

A minimum viable stack

For a real first project:

  • Loader: LlamaIndex SimpleDirectoryReader or LangChain document loaders
  • Chunker: RecursiveCharacterTextSplitter, 500 tokens, 50 overlap
  • Embedder: OpenAI text-embedding-3-small
  • Vector DB: pgvector or Qdrant
  • Retriever: top-10 semantic search + Cohere Rerank v3 → top-5
  • Generator: Claude 4.5 Sonnet with strict "answer only from context" system prompt

That stack costs almost nothing to run, takes about a day to set up, and is competitive with anything more complex on most use cases.

Next steps

  • Read about RAG evaluation: how to measure if your RAG is actually working
  • Look into hybrid search if you have keyword-heavy queries
  • Explore agentic RAG patterns: letting the LLM iterate on retrieval
  • Read about chunk size experiments — your data probably wants a specific number

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more