Multi-head attention

A Transformer mechanism that runs several attention operations in parallel, letting the model focus on different relationships in the input at the same time.

Multi-head attention is the building block that makes Transformers work. Instead of computing one attention pattern over the input tokens, the model splits its representations into several "heads" and runs attention in parallel on each, then concatenates the results. Each head learns to focus on a different kind of relationship. It matters because language has many overlapping structures at once — syntax, coreference, topic, tone — and a single attention map can't capture all of them cleanly. By giving the model multiple heads, each with its own learned query, key, and value projections, the network can attend to short-range word order in one head, long-range subject–verb agreement in another, and semantic similarity in a third. This is what gives models like GPT, Claude, and BERT their flexibility. A rough analogy: imagine reading a sentence with several highlighters. One highlighter marks pronouns and what they refer to, another marks verbs and their objects, another marks emotional words. Multi-head attention is the model doing all of those passes simultaneously and then combining the notes. Empirically, researchers have found that different heads in trained Transformers really do specialize — some track positional patterns, others track specific grammatical roles. The original "Attention Is All You Need" paper used 8 heads; modern large models use anywhere from 16 to 128. Variants like multi-query attention (MQA) and grouped-query attention (GQA) reduce the memory cost by sharing keys and values across heads, which is now standard in efficient inference for models like Llama. Related: self-attention, Transformer, query/key/value, GQA, attention heads.