Skip to content

Intro★★★★5 min read

What is a Large Language Model (LLM)? A plain-English explainer

An LLM doesn't think — it predicts the next word. That single fact explains both why ChatGPT feels magical and why it confidently makes things up.

Large Language Model is the technical name for the thing inside ChatGPT, Claude, and Gemini. The phrase is intimidating, but the underlying mechanism is genuinely simple — and understanding it changes how you use these tools.

The one-sentence version

An LLM is a statistical model trained on enormous amounts of text. Given any string of text as input, it predicts the next token (roughly: the next word or chunk of a word). That's it. Everything else — the long answers, the code, the stories, the apparent reasoning — emerges from looping that single prediction step thousands of times.

When you type "explain quantum entanglement to a 10-year-old" into Claude, the model produces a probability distribution over every possible next token. It picks one (with some randomness controlled by the temperature parameter), appends it, and runs the same prediction again on the now-longer text. Repeat until it predicts an end-of-message token. The whole answer is just one-token-at-a-time autocomplete, very, very fast.

This sounds too simple to produce coherent paragraphs of text. The trick is that the model has been trained on enough text — trillions of words from the internet, books, code, scientific papers — that the statistical structure of language contains a lot of what we'd call knowledge.

Why this explains so much

Once you internalize "it predicts the next token," several confusing behaviors make sense.

Why it hallucinates. The model doesn't know what's true. It knows what sounds like the kind of thing that comes next given the context. If you ask about a real but obscure law, the model produces a citation that sounds like a real legal citation. Sometimes the citation exists; sometimes it doesn't. The model can't tell.

Why prompts matter so much. A good prompt shapes the probability distribution toward better next-tokens. "Write Python" and "Write Python 3.12 using async/await with type hints, in the style of FastAPI's official docs" steer the model into very different parts of its training data.

Why it's bad at counting characters and doing exact math. The model sees the world in tokens, not characters. Ask GPT-4 "how many rs are in strawberry" and it often gets it wrong because the word strawberry is a single token in its vocabulary.

Why context matters. The model only sees what's in its context window. If your conversation has gone past the limit and earlier messages got dropped, it's not being forgetful — those tokens are literally gone.

How LLMs are actually built

The pipeline has three stages.

Pre-training is where the model reads the internet. Engineers gather a huge dataset (Common Crawl, books, GitHub, scientific corpora), and the model trains by repeatedly predicting masked-out next tokens until its predictions get good. This stage costs millions of dollars in GPU time and produces a "base model" that knows a lot but is awful at following instructions — it'll happily continue your question with a longer question instead of answering it.

Post-training turns the base model into something useful. The team feeds it examples of good question-and-answer pairs (supervised fine-tuning), then uses techniques like RLHF (Reinforcement Learning from Human Feedback) or DPO to train it to prefer helpful, harmless, honest responses over bad ones.

Inference is when you actually use it. The trained model sits on a GPU server, takes your input, and generates output. The cost-per-query is much lower than training, but multiplied by millions of users it's still substantial — that's why GPT-4 costs more per token than GPT-3.5.

The frontier vs the open-source layer

In 2026 there are two LLM tiers worth distinguishing.

Frontier closed models — Claude (Anthropic), GPT-5 (OpenAI), Gemini (Google) — are the most capable. You only access them via API or chat product. The labs don't share the weights. They cost real money per query.

Open-weight models — Llama (Meta), DeepSeek V3 / R1, Qwen (Alibaba), Mistral — release the model weights. You can download a 70-billion-parameter model, run it on your own GPU box, fine-tune it on your data, and pay no per-query fee. The best open models are roughly 6-12 months behind frontier closed models in raw capability, but for many tasks the gap is small enough that price + privacy + control wins.

What LLMs are NOT good at

Three honest weaknesses to remember.

  • Real-time information. Without tool use, an LLM only knows what was in its training data. Ask Claude about today's stock price and it'll either refuse or guess.
  • Exact arithmetic and counting. They're statistical text engines, not calculators. For anything math-critical, give the model a tool (Python interpreter) or verify the answer separately.
  • Long, perfectly-faithful summaries. Models drift, especially over long contexts. If you summarize a 100-page contract, expect to verify the high-stakes clauses by hand.

A reasonable mental model: an LLM is a brilliant, eloquent intern who has read everything but remembers nothing exactly, never admits when they don't know, and works for $20/month. Use it accordingly.

Further reading

  • What is a token in LLM-speak
  • What is a prompt, and why prompt quality matters
  • What is a context window
  • Why LLMs hallucinate, and what to do about it
  • Open-source LLM vs frontier API: which one for which task

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more