Transformer

A neural network architecture introduced by Google in 2017 that uses self-attention to process sequences in parallel — the foundation of modern LLMs like GPT and Claude.

Transformer is a neural network architecture introduced by Google researchers in the 2017 paper *Attention Is All You Need*. Its core idea is **self-attention**: instead of reading a sequence word by word like older RNNs, every token can directly "look at" every other token in the input and decide which ones matter for understanding it. This matters because self-attention is highly parallelizable on GPUs, which made it possible to scale models to billions of parameters and train them on huge text corpora. Almost every modern large language model — GPT, Claude, Gemini, LLaMA — is a Transformer or a close variant. It also powers image models (Vision Transformer), audio models (Whisper), and code models. A rough analogy: imagine reading a long sentence where every word can simultaneously glance at every other word and weigh how relevant each is. In the sentence "The trophy didn't fit in the suitcase because it was too big", self-attention helps the model figure out that "it" refers to the trophy, not the suitcase, by attending strongly to "trophy". The original architecture has an encoder (reads input) and a decoder (generates output), but most modern LLMs are **decoder-only** Transformers that just predict the next token. The architecture's main weakness is that attention scales quadratically with sequence length, which is why long context windows are expensive and why research into more efficient attention variants is ongoing. Related concepts: self-attention, multi-head attention, positional encoding, encoder-decoder, decoder-only, GPT, BERT.