What are reasoning models? o3, DeepSeek R1, and the 'think before you speak' shift

Reasoning models — OpenAI's o3 / o4 series, DeepSeek R1 / R2, Claude's extended-thinking modes, Gemini's thinking variants — are LLMs trained to spend more compute before producing a final answer. Instead of immediately generating output, they generate a chain of internal "thinking" tokens (often hidden from the user), then produce the response. The reasoning era started in late 2024 with OpenAI's o1 and is now a category of its own.

What "thinking" actually means

A standard LLM does roughly the same compute per token regardless of question difficulty. "What's 2+2" and "prove the Riemann hypothesis" both get one forward pass per generated token. That's why standard models are bad at hard problems — they don't have a way to try harder.

Reasoning models add a step. When given a hard problem, the model first emits a long sequence of intermediate thoughts: try an approach, check, backtrack, try another, verify, conclude. Only then does it produce the final answer. The intermediate thoughts are typically:

Hidden in chat products (you see a "Thinking…" indicator)
Returned in some APIs as a separate field, so you can inspect them
Charged for as output tokens — they're real generation, just labeled differently

A hard math or code problem might use 20,000 thinking tokens before producing a 200-token answer. That's 100× more compute per query than non-reasoning models.

What reasoning models are actually better at

The gains aren't uniform. Reasoning helps the most on:

Math. Olympiad-style problems, complex algebra, proofs.
Coding (the hard parts). Algorithm design, debugging tricky logic bugs, code review for subtle issues.
Multi-step planning. Breaking goals into ordered substeps, especially under constraints.
Logic puzzles, scientific reasoning. Anything where you need to consider cases.
Strategic decisions. Trade-off analyses with multiple variables.

They're roughly the same as standard models on:

Casual conversation. No deep thought needed.
Creative writing. Reasoning doesn't help (and may hurt) the wandering quality of good prose.
Translation, summarization. Mostly pattern-matching tasks.
Simple coding ("write me a CRUD endpoint"). Standard models are already great at this.

What they're worse at

Three honest weaknesses.

Latency. A non-reasoning model answers in 2-5 seconds. A reasoning model on a hard problem can take 30 seconds to 5 minutes. Useless for chat UI; mandatory for some agents and analysis pipelines.

Cost. Reasoning tokens are output tokens, and they multiply. A query that would have cost $0.01 on a standard model can cost $0.30+ on a reasoning model. For high-volume apps, you can't afford to use reasoning everywhere.

Style and warmth. Reasoning models tend to produce more terse, sometimes mechanical output. They're great at correctness, less great at sounding like a human.

When to actually reach for reasoning models

A practical decision rule:

Easy task, fast UI? Standard model (Claude Sonnet, GPT-5 Standard, Gemini Flash).
Hard task, batch processing OK? Reasoning model (o3, DeepSeek R1, Claude with extended thinking).
Easy task, but verification critical? Standard model + an eval step (or a second reasoning-model pass on outputs).
You don't know how hard the task is? Try a standard model first. Escalate to reasoning if quality is low.

One pattern that's caught on: router-based architectures. A cheap router model classifies the query, then forwards to either a fast standard model (for easy queries) or a reasoning model (for hard ones). This gets you reasoning quality where it matters and standard speed/cost everywhere else.

Real reasoning models in 2026

The lineup that actually ships:

OpenAI o3 / o4 series — the original reasoning model line. Strong at code, science. Premium pricing.
DeepSeek R1 / R2 — open-weight (yes, full reasoning model, weights public), competitive quality, dramatically cheaper. Game-changer in 2025.
Claude Sonnet / Opus with extended thinking — Anthropic's approach: same model, dial up thinking time as a parameter.
Gemini 2.5 Pro thinking — Google's variant with strong long-context reasoning.
Qwen QwQ, others — open-weight Chinese reasoning models, especially strong for Chinese-language reasoning.

The DeepSeek R1 release in early 2025 was significant because it proved frontier-level reasoning could be done with open weights, putting pricing pressure on closed labs and opening reasoning to anyone with GPUs.

Common misuses

Three patterns that waste money and time.

Using reasoning for chat. Burning $0.30 to answer "hi how are you" is silly. Most chat traffic should hit a standard model.

Using reasoning when you don't need correctness. Marketing copy, casual emails, brainstorming — reasoning's strengths don't apply. Use a standard model.

Not budgeting the latency. Building a UI that calls a reasoning model and shows a spinner for 90 seconds is bad UX. If you must use reasoning live, communicate the wait clearly ("Analyzing... this may take up to 2 minutes") and consider streaming partial thinking.

When NOT to use reasoning models

Real-time chat. Latency too high.
Bulk batch where cost matters. Standard models 5-10× cheaper.
Creative writing where voice matters. Output gets stiff.
Tasks where standard models already 95%+ work. Marginal gain not worth marginal cost.