What 'Mixture of Experts' (MoE) actually means

When DeepSeek V3 launched in late 2024 with 671 billion parameters, the headlines focused on the size. What was actually impressive was that DeepSeek could run it at a fraction of the cost of a 70B dense model. The trick is Mixture of Experts. By 2026, MoE is the dominant architecture for frontier open-weights models — and increasingly the closed ones too. Here's what's going on.

The dense baseline

A standard 'dense' transformer like the original Llama or GPT-3 looks like a stack of layers. For every token you process, the model runs every parameter in every layer. If your model has 70B parameters, you're doing math involving all 70B for each token of input or output.

This is wasteful. Different parts of language and reasoning probably benefit from different specialized circuits, but a dense model uses everything for everything. It's like consulting all 70B 'employees' on every question, even questions only a few of them are good at.

What MoE changes

A Mixture of Experts model replaces the feed-forward network in each transformer layer with a set of 'experts' (typically 8, 16, 64, or 128 of them) plus a small 'router' network. For each token, the router picks the top-K experts (usually 2 of them) and only those experts process the token. The rest are dormant for that token.

This means an MoE model has two parameter counts that matter:

Total parameters (the headline number, e.g. DeepSeek V3's 671B). All these parameters exist in memory.
Active parameters per token (e.g. DeepSeek V3's 37B). This is what actually does math for any given token.

The model has 671B parameters of capacity but only does ~37B parameters of work per token. So it's much faster and cheaper to run than a 671B dense model would be — closer in inference cost to a 37B dense model. But because different tokens use different experts, the model collectively benefits from all 671B of learned capacity.

It's like having a large company with 100 specialist employees, but each customer's question only routes to the 2 most relevant ones. Capacity is total expertise; cost per question is just those 2.

Why this is suddenly everywhere

MoE has been around since the 1990s but only became practical at scale around 2021-2022 (Switch Transformer, GShard from Google) and went mainstream in 2024. The reason it's everywhere now:

Inference cost matters more than training cost at scale. If you're going to serve a model billions of times, saving 50% on each inference is worth a lot more than saving on training.
VRAM is cheap; compute is expensive. GPU memory has scaled faster than compute throughput. MoE trades 'more parameters in memory' for 'less compute per token', which is the right side of the trade.
Quality scales well. Empirically, an MoE with N total params, K active, often performs close to a dense model of size N, while costing like a dense model of size K. Free lunch (sort of).

Notable MoE models in 2025-2026:

DeepSeek V3 (671B total / 37B active) — open weights, currently best-in-class for the price.
DeepSeek R1 — reasoning model also MoE.
Mixtral 8x7B and 8x22B — Mistral's MoE family, smaller but very accessible to self-hosters.
Llama 4 (Scout 109B / Maverick 400B) — Meta's first MoE flagship.
Qwen 3 MoE variants — Alibaba's open-weights MoE.
GPT-5 / Claude / Gemini — almost certainly MoE internally though labs don't fully disclose architecture.

What MoE feels like to use vs. dense

For 99% of users, MoE is invisible — it's the same chat interface, the same API. But there are downstream effects:

Inference cost is lower for the same quality, which is why DeepSeek V3 can be priced so aggressively (often 1/10th of GPT-5 for similar tasks).

Self-hosting is harder. A 671B-parameter MoE needs to fit in VRAM, which is expensive (typically multi-GPU H100 setups). The 'cheap to run' part only kicks in if you can afford the VRAM in the first place. Mixtral 8x7B is the sweet spot for hobbyists at ~95GB total but reasonable inference cost.

Latency for the first token can be slightly higher because routing adds overhead. Throughput once generating is excellent.

Quantization is trickier. Different experts can have different distributions, so naive 4-bit quantization sometimes hurts MoE more than dense. The community has developed MoE-specific quantization techniques.

Fine-tuning is more complex. You need to decide whether to update all experts, only some, or just the router. Most consumer fine-tuning frameworks (axolotl, unsloth) handle this now but it took a while.

Common misconceptions

'A 671B MoE is as good as a 671B dense model.' No. It's somewhere between a dense model of its active size and a dense model of its total size. Empirically MoE quality scales well but not 1:1 with total params.

'MoE is the same as ensemble.' No. An ensemble runs N separate models and combines outputs. MoE has a single model where different parts activate per token. The router is trained jointly with the experts, end-to-end.

'Each expert is good at a specific topic.' Mostly no. The experts don't cleanly specialize into 'math expert' / 'code expert' / 'French expert' — they specialize in much more abstract patterns the router learns to dispatch to. Sometimes there's interpretability in expert assignments but not always.

'MoE is just a way to fake parameter counts.' Yes and no. It's not 'fake' in that those parameters do contribute to model quality. But comparing a 671B MoE to a 671B dense model on parameter count is misleading — compare on quality benchmarks instead.

When it matters for users

If you're picking between models in 2026, MoE explains why some of the most capable open-source options (DeepSeek V3, Llama 4 Maverick, Qwen 3 235B) are also among the cheapest per-token. For self-hosting, MoE shifts the bottleneck from compute to VRAM — plan accordingly.

If you're not training or self-hosting, MoE doesn't change how you build with the API. You still pick a model based on quality/cost trade-off; the underlying architecture is an implementation detail.

When NOT to overthink it

For application development, treat 'is it MoE' as a curiosity, not a decision factor. The right question is 'does this model perform well on my task at a cost I can absorb', not 'is the architecture MoE'. Some of the best models for specific use cases are dense (Mistral Small 3, Qwen 32B Instruct), some are MoE.