Decoder

The part of a neural network that generates output tokens one at a time, used in most modern LLMs like GPT and Claude.

A decoder is the half of a Transformer architecture responsible for producing output — generating text one token at a time, where each new token is conditioned on everything generated so far. It uses **masked self-attention**, meaning each position can only attend to earlier positions, which is what makes left-to-right generation possible. The term matters because almost every modern large language model — GPT-4, Claude, Llama, Gemini — is a "decoder-only" model. They dropped the encoder half of the original 2017 Transformer and kept stacking decoder blocks, because it turns out that next-token prediction alone is enough to learn a remarkable amount about language, code, and reasoning. A useful analogy: imagine writing a sentence where you can only look back at the words you've already written, never peek ahead. That constraint is exactly what masked attention enforces. The model reads the prompt, then predicts the next word, appends it, and repeats — like autocomplete on steroids. This is different from **encoder-only** models like BERT (which read the whole input at once and are used for classification or embeddings) and **encoder-decoder** models like T5 or the original Transformer (used for translation and seq2seq tasks where input and output are clearly separated). Related concepts to explore: Transformer, encoder, attention, autoregressive, causal mask, next-token prediction.