An embedding is a list of numbers — typically 256 to 3,072 of them — that represents the meaning of a piece of text. The magic isn't any single number; it's that texts with similar meanings produce similar lists. Once you can compare meanings as numbers, you can build semantic search, RAG, recommendations, deduplication, and clustering — all the AI features that don't quite need an LLM.
Concretely, what's an embedding
Feed a sentence into an embedding model and it returns a vector. For example, OpenAI's text-embedding-3-small returns 1,536 numbers like:
[0.012, -0.041, 0.087, -0.003, ..., 0.022]
The specific numbers are meaningless to humans. What matters: if you embed two sentences with similar meaning, the resulting vectors point in similar directions. "How do I reset my password?" and "I forgot my login" will have vectors with high cosine similarity (close to 1.0). "How do I reset my password?" and "What's the weather in Tokyo?" will have low similarity (close to 0).
This works because the embedding model was trained — on huge text datasets — to map similar meanings to similar regions of a high-dimensional space. The training objective specifically pushes related texts together and unrelated texts apart.
What you actually do with embeddings
Semantic search. Take all your documents, split into chunks, embed each chunk, store in a vector database. When a user searches, embed their query and find the chunks with the closest vectors. This finds relevant docs even when the query uses different words. The core engine of RAG.
Recommendations. Embed user behavior (articles read, products viewed). Find users or items with similar embeddings. Lightweight "more like this" without complex ML.
Deduplication. Embed every record in a database. Find pairs with similarity > 0.95. They're probably duplicates. Useful for cleaning customer lists, support tickets, listings.
Classification. Embed labeled examples of each category, then classify new items by which category's embeddings they're closest to. Cheap alternative to fine-tuning a classifier.
Clustering. Run k-means or DBSCAN on a batch of embeddings to group related items. Useful for support ticket triage, content theme discovery.
Which embedding model to use
In 2026 the practical choices:
- OpenAI
text-embedding-3-small— cheap (~$0.02 per million tokens), 1,536 dimensions, decent quality, English-strong. Good default. - OpenAI
text-embedding-3-large— better quality, 3,072 dimensions, ~3× the cost. Use when accuracy matters. - Voyage AI
voyage-3/voyage-3-large— frontier-level quality, slightly more expensive than OpenAI, often best for retrieval. - Cohere
embed-english-v3/embed-multilingual-v3— strong multilingual; pick the multilingual variant for non-English content. - BGE (Beijing Academy of AI Sciences) M3 / BGE-large — open-weight, free to self-host, top-tier quality. Best multilingual options for Chinese.
gte-large,e5-large— open-weight alternatives, lighter than BGE.
For Chinese-heavy content specifically: BGE-M3 and Cohere multilingual v3 outperform OpenAI noticeably. Don't blindly default to OpenAI for non-English.
How embeddings get stored and searched
A vector with 1,536 dimensions takes ~6KB. With a million chunks, you have ~6GB of vectors plus overhead. Brute-forcing similarity (O(n) per query) gets too slow above ~100K vectors.
This is what vector databases solve: indexing structures (HNSW, IVF) that find approximately-nearest neighbors in O(log n) or so. In 2026 your options:
- pgvector — Postgres extension, works in regular Postgres including Supabase. Best default for most apps because you keep your data in one DB.
- Pinecone — managed, scales easily, slightly pricier. Pick when you need very high throughput or zero ops.
- Qdrant — open-source, Rust, good performance, can self-host or use cloud.
- Weaviate, Chroma, Milvus — others in the same space.
- DuckDB / SQLite vector extensions — for small data (<10M rows).
For 90% of apps starting out, pgvector is the right answer.
Common mistakes
Embedding the wrong unit. Embedding entire 50-page documents gives you mush — the meaning is too dense to compress. Chunk first (250-500 tokens per chunk usually), embed each chunk. Retrieval is per-chunk, not per-document.
Mixing embedding models. All embeddings in your store must come from the same model. Switching models requires re-embedding everything. Plan for this.
Forgetting to normalize. Many similarity metrics assume unit-length vectors. Most modern APIs return normalized embeddings; if not, normalize before storing.
Treating similarity score as accuracy. A 0.85 cosine similarity doesn't mean "85% relevant." Set thresholds empirically by reviewing results, not by gut.
When NOT to use embeddings
- Exact-match queries. "Find rows where product_id = 12345" should be a SQL
WHERE, not a vector search. Embeddings are for fuzzy semantic matching. - Tiny corpora. If you have 50 documents, just put them all in a context window and let the LLM read them. Embeddings have a setup cost (chunking, embedding, indexing) that doesn't pay off below ~1,000 chunks.
- Highly structured data. SQL or JSON queries beat embeddings on data with clear schemas (financial records, inventory, user metadata).
- Hybrid keyword + semantic. Pure vector search can miss exact-keyword matches that matter (product codes, names). Use hybrid search (BM25 + vector) for best results.
Further reading
- What is RAG (Retrieval-Augmented Generation)
- What is a vector database, and do you need one
- How to pick a vector database (Pinecone vs pgvector vs Qdrant)
- Hybrid search (BM25 + vector) for RAG systems
- Build a personal RAG over your notes in an afternoon