Attention

A mechanism that lets a model decide which other tokens in the input matter most when processing each token.

Attention is the core mechanism inside Transformer models that lets every token "look at" every other token in the input and decide which ones are relevant. For each token, the model computes a set of weights over all the other tokens, then builds a new representation as a weighted mix of them. Tokens that matter get high weights; irrelevant ones get near-zero weights. It matters because attention is what made modern LLMs possible. Before attention, models like RNNs processed text strictly left-to-right and struggled to connect words far apart in a sentence. Attention removes that bottleneck — any token can directly reference any other, no matter the distance. The 2017 paper "Attention Is All You Need" showed that you don't even need recurrence; attention alone is enough. A simple analogy: when you read the sentence "The trophy didn't fit in the suitcase because it was too big," your brain has to decide whether "it" refers to the trophy or the suitcase. Attention is the model doing the same thing — looking back at earlier words and weighting them to resolve the reference. In practice, models use *multi-head attention*, running many such lookups in parallel so different "heads" can track different kinds of relationships (syntax, coreference, topic, etc.). Self-attention (tokens attending to other tokens in the same sequence) is what powers GPT, Claude, and Gemini. Cross-attention (one sequence attending to another) is used in encoder-decoder models and multimodal systems. Related concepts to explore next: Transformer, self-attention, multi-head attention, KV cache, context window, positional encoding.