If you've been around AI conversations for any length of time, you've heard a parade of acronyms and made-up words. This is the cheat sheet — 30 terms, one or two sentences each, no academic detours. Bookmark it; come back when something pops up in a meeting and you wish you knew it cold.
Models and architecture
-
LLM (Large Language Model) — A statistical model trained on huge text to predict the next token. ChatGPT, Claude, Gemini are all LLMs. Output is just one-token-at-a-time autocomplete at very high speed.
-
Token — The chunks an LLM actually sees. Roughly 0.75 English words per token, 1-2 tokens per Chinese character. The unit of pricing and context.
-
Context window — Max tokens the model can process per request. 200K (Claude), 1M+ (Gemini long). Includes prompt + history + reserved output.
-
Parameters — Numbers inside the model. "7B" means 7 billion parameters. More usually = more capable but slower and more expensive.
-
Pre-training — Initial massive training on internet text. Produces a base model that knows things but doesn't follow instructions.
-
Post-training (RLHF, DPO) — Refining the base model with human feedback so it's helpful, harmless, honest. Where ChatGPT-the-product comes from.
-
Multimodal — Model that handles text, images, audio, sometimes video natively. Most 2026 frontier models are multimodal by default.
-
Reasoning model — Model trained to spend extra compute "thinking" before answering. o3, DeepSeek R1, Claude extended thinking. Better at math/code, slower, more expensive.
-
MoE (Mixture of Experts) — Architecture where only a subset of parameters activates per query. Lets a 200B model run as fast as a 30B one. Mixtral, DeepSeek V3 use this.
-
Open weights — Model whose trained weights you can download and run yourself. Llama, Qwen, DeepSeek, Mistral. Often called "open source" loosely.
Working with models
-
Prompt — Whatever text you send to the LLM. Better prompts → better answers.
-
System prompt — The product-set prompt that frames every conversation. Shapes tone and rules.
-
Temperature — Sampling randomness. 0 = deterministic; 1 = creative. Most production apps use 0.0-0.7.
-
Top-p / top-k — Other sampling controls. Top-p restricts to most-probable tokens whose cumulative probability is p.
-
Streaming — Outputting tokens as they're generated, not waiting for the whole response. Critical for chat UI.
-
Tool use / function calling — Model can call functions you define (web search, DB query, send email). Foundation of agents.
Retrieval and memory
-
RAG (Retrieval-Augmented Generation) — Fetch relevant docs from your store, paste into the prompt, let the model answer. Standard for "AI that knows your data."
-
Embedding — A vector representing the meaning of text. Similar meanings = similar vectors. Powers semantic search and RAG retrieval.
-
Vector database — DB optimized for storing and searching embeddings. pgvector, Pinecone, Qdrant, Weaviate.
-
Chunking — Splitting documents into smaller pieces (typically 250-500 tokens) for embedding. Determines retrieval quality.
-
Reranker — A model that re-orders retrieved chunks by true relevance. Cohere Rerank, BGE Reranker. Big quality win in RAG.
Agents
-
Agent — An LLM that takes actions in a loop: decide, execute, observe, decide again. Cursor, Claude Code, Operator are agents.
-
MCP (Model Context Protocol) — Open standard for connecting any AI client to any tool. The USB-C of AI integrations.
-
Computer use — Anthropic's mode where the model literally controls a screen (sees pixels, clicks, types). Same idea as OpenAI's Operator.
Customization
-
Fine-tuning — Continuing to train a model on your data so it learns your style/format/task. Best for tone and structure, not for adding facts.
-
LoRA (Low-Rank Adaptation) — Cheap fine-tuning: train tiny adapter matrices instead of updating the whole model. The default modern approach.
-
Quantization — Compressing model weights from 16-bit to 8-bit, 4-bit, or lower. Smaller, faster, slight quality loss.
Risks and failures
-
Hallucination — Model produces confident, plausible-sounding falsehoods. Inherent to the architecture; mitigations are RAG, verification, citations.
-
Prompt injection — User input that tries to override the model's instructions, often to leak data or bypass guards. Hard to fully defend against.
-
Jailbreak — Tricks that get a model to break its safety policies ("pretend you have no restrictions"). Closely related to prompt injection.
Bonus: words you'll hear people misuse
- "AI agent" — sometimes just means "chatbot," sometimes means real tool-using loops. Ask for specifics.
- "Powered by AI" — usually means "calls OpenAI's API once." Marketing.
- "Trained on your data" — usually means RAG (paste at runtime), not actual fine-tuning. Worth clarifying.
- "Reasoning" — sometimes a real reasoning-model behavior; often just chain-of-thought prompting on a normal model.
- "Open source" (model) — usually means open weights, not full open source.
When NOT to memorize this list
If you're just using ChatGPT to write emails, you don't need to know what an embedding is. This list matters when you start building things, evaluating tools, or hiring/being hired in AI roles. For end users, the only words that matter are prompt and context window.
Further reading
- What is a Large Language Model (LLM)
- What is RAG (Retrieval-Augmented Generation)
- What is an embedding
- What is an AI agent
- What is MCP (Model Context Protocol)