Mixture of Experts (MoE)

A neural network architecture that splits the model into many specialized "expert" sub-networks and routes each input to only a few of them, giving huge parameter counts at a fraction of the compute cost.

Mixture of Experts (MoE) is a neural network architecture where a layer contains many parallel sub-networks called "experts," plus a small "router" (or gating network) that decides which experts should handle each token. Instead of every parameter doing work on every input, only the top-k experts (often 2 out of 8, 16, or 64) are activated per token. This matters because it decouples total parameter count from inference cost. A dense 70B model uses all 70B parameters per token; an MoE model with 8 experts of 70B each (~560B total) might only activate ~140B parameters per token — much cheaper to run, while still benefiting from the broader knowledge stored across all experts. Mixtral 8x7B, DeepSeek-V3, GPT-4 (widely believed to be MoE), and Qwen's MoE variants all use this design. A useful analogy: a dense model is like a single generalist doctor who has read every textbook and answers every question personally. An MoE model is like a hospital with many specialists and a triage nurse — the nurse (router) glances at your symptoms and sends you to two specialists, who together write the answer. The hospital "knows" much more in total, but any single visit is fast. The trade-offs: MoE models need more memory (you have to load all experts even if you only use a few), training is trickier because the router can collapse onto a few favorite experts, and serving them efficiently requires expert-parallel infrastructure. Related concepts to explore next: sparse activation, router/gating network, dense model, Mixtral, DeepSeek-V3, expert parallelism.