Self-Attention

A mechanism that lets each token in a sequence look at every other token and decide which ones matter most — the core operation inside Transformers.

Self-attention is the operation that lets a model weigh the relationships between all positions in a sequence at once. For every token, the model computes three vectors — query, key, and value — then uses the dot product of queries and keys to score how much each token should "pay attention to" every other token. Those scores become weights for combining the value vectors into a new, context-aware representation. It matters because it's the engine behind every modern LLM. Before self-attention, RNNs and LSTMs processed words one at a time and struggled with long-range dependencies. Self-attention reads the whole sentence in parallel, which is both faster on GPUs and far better at connecting distant words — the reason "Attention Is All You Need" (2017) reshaped the field. Concrete example: in the sentence "The trophy didn't fit in the suitcase because it was too big," self-attention is what helps the model figure out that "it" refers to the trophy, not the suitcase. Each word's representation gets updated by mixing in information from the words it attends to most strongly. In practice, models use *multi-head* self-attention — running several attention operations in parallel so different heads can specialize (one tracking syntax, another tracking coreference, etc.). Causal or "masked" self-attention, used in GPT-style decoders, prevents a token from attending to future positions during training. Related concepts: Transformer, multi-head attention, query/key/value, positional encoding, KV cache, cross-attention.