The Distributed Memory of Attention: How Transformers Encode and Exploit Word‑to‑Word Relationships

Getting your Trinity Audio player ready…

Introduction – From Static Linguistic Tables to Dynamic Geometries (~350 words)

Classical NLP systems—ranging from early rule‑based grammars to the n‑gram language models that once powered web search—treated the connections between words as static artifacts. Whether one stored them in a grammar, a phrase‑table, or a sparse Markov matrix, the relationship was essentially looked up. The 2017 Transformer paper shattered that paradigm. Instead of archiving pairwise links, it learns a geometry—a high‑dimensional energy landscape in which every token is a point whose coordinates continuously shift under the forces of self‑attention. Relationships are therefore re‑computed every time a new sentence arrives. This essay develops that idea in depth, expanding the outline you provided into a ~5 000‑word tour of the architecture, the learning dynamics, and the modern research ecosystem that has grown around them.

We begin by situating Transformers historically, then dive from embeddings and positional encodings through multi‑head attention and deep residual stacks, finally emerging in today’s frontier of efficient attention, cross‑modal extensions, and interpretability. Along the way we emphasize the central claim: a Transformer’s “memory” of language is distributed across its parameters and activated on demand by the attention mechanism.

1 Historical Context: From Symbolic Rules to End‑to‑End Representation Learning (~400 words)

Early symbolic approaches (e.g., context‑free grammars) relied on human‑curated tables of syntactic relations. Statistical methods such as IBM Models 1‑5 or phrase‑based SMT replaced hand‑written rules with count‑based associations, yet still stored explicit co‑occurrence statistics. Even recurrent neural networks, despite learning continuous embeddings, preserved sequence order only by serially propagating hidden states. Transformers broke free of both lookup tables and serial processing by (a) embedding all tokens in parallel, and (b) letting them attend to one another through content‑addressable queries and keys. This shift from sequential to set‑like processing enables vastly greater hardware utilization—GPUs thrive on matrix multiplications—and unlocks the architectural flexibility that now spans text, vision, audio and biology Roboflow Blog.

2 Embeddings: The Semantic Foundation (~550 words)

2.1 Tokenisation and Sub‑word Units

Modern LLMs seldom operate directly on human‑defined “words.” Instead they rely on Byte‑Pair Encoding (BPE), SentencePiece, or similar algorithms that learn a vocabulary of frequent character sequences. A string such as “¿De quién es?” might be segmented into tokens like ["¿", "De", "Ġqui", "én", "Ġes", "?"]. Each token index iii maps to a learned embedding vector Ei∈RdmodelE_{i}\in\mathbb R^{d_\text{model}}Ei∈Rdmodel. Training forces semantically or morphologically related tokens to occupy neighboring regions of this vector space; for instance, qui‑én (“who”) gravitates toward other interrogative pronouns in Spanish as well as their English counterparts “who,” “whom,” “whose.”

2.2 Geometric Semantics

Because distances in this space correspond to abstract similarity, embeddings serve as proto‑concepts on which attention can operate. They also carry surprisingly rich syntactic information: linear offsets capture relationships such as gender, tense, or pluralization (the famous “king – man + woman ≈ queen” analogue). These properties arise purely from back‑propagated gradients; no explicit table of relations is ever stored.

3 Positional Information: From Sinusoids to Rotary Codes (~450 words)

Self‑attention treats the input as an unordered set unless we inject positional encodings. The original Transformer used fixed sinusoidal functions: PE(p,2i)=sin⁡ ⁣(p/104i/dmodel),PE(p,2i ⁣+ ⁣1)=cos⁡ ⁣(p/104i/dmodel).\text{PE}(p,2i)=\sin\!\bigl(p/10^{4i/d_\text{model}}\bigr),\qquad \text{PE}(p,2i\!+\!1)=\cos\!\bigl(p/10^{4i/d_\text{model}}\bigr).PE(p,2i)=sin(p/104i/dmodel),PE(p,2i+1)=cos(p/104i/dmodel).

These waveforms let the model derive both absolute and relative offsets algebraically. Variants soon flourished—learned absolute embeddings (BERT), relative shift (Transformer‑XL), and Rotary Position Embeddings (RoPE), which rotate queries and keys in complex space so that angular distance linearly reflects token separation. RoPE now dominates large‑scale LLMs because it extrapolates gracefully to longer contexts and decays attention with distance arXiv Papers with Code.

Whichever scheme is chosen, the positional vector PEp\text{PE}_pPEp is simply added to the semantic embedding EiE_iEi: xp=Ei+PEp.x_p = E_i + \text{PE}_p.xp=Ei+PEp.

Thus the representation fuses “what” and “where” information before any attention occurs.

4 Self‑Attention: Dynamic Relational Reasoning (~800 words)

Self‑attention computes how much each token should consult every other token. For a length‑nnn sequence, the model projects inputs X∈Rn×dX\in\mathbb R^{n\times d}X∈Rn×d into queries Q=XWQQ=XW_QQ=XWQ, keys K=XWKK=XW_KK=XWK, and values V=XWVV=XW_VV=XWV. It then forms the scaled dot‑product matrix α=softmax ⁣(QK⊤/dk),\alpha = \text{softmax}\!\bigl(QK^\top/\sqrt{d_k}\bigr),α=softmax(QK⊤/dk),

where αij\alpha_{ij}αij is the attention weight from token iii to token jjj. Multiplying by VVV yields context vectors Z=αV,Z = \alpha V,Z=αV,

which replace the original embedding for the residual path. Crucially, the weights WQ,WK,WVW_Q,W_K,W_VWQ,WK,WV are shared across positions; relational patterns are therefore not memorised for a specific word pair but generalise to any pair whose embeddings project to similar queries and keys.

4.1 Multi‑Head Factorisation

Rather than one massive attention map, the Transformer computes hhh smaller heads: d=h⋅dkd = h\cdot d_kd=h⋅dk. Each head can specialise: empirical probing finds heads that resolve coreference, others that mark punctuation boundaries, even heads that align subject–verb number. The concatenation is linearly transformed by WOW_OWO and passed onward.

4.2 Interpretability Myths and Realities

Attention weights provide a tempting “explanation,” yet they are neither sparse nor faithful by default; later feed‑forward layers remix information non‑linearly. Nevertheless, visualising α\alphaα matrices often reveals linguistically plausible patterns—e.g., Spanish interrogative inversions or English determiner‑noun links—confirming that self‑attention captures relationships even if it does not faithfully expose them.

5 Deep Stacks and Residual Pathways (~450 words)

A single attention layer models roughly first‑order interactions. Stacking NNN identical Encoder blocks (attention + 2‑layer feed‑forward + layernorm + residual) lets higher layers integrate wider contexts and compose hierarchical abstractions. Early layers might bind “quién ↔ es,” middle layers capture clause boundaries, and top layers encode discourse‑level pragmatics. Residual shortcuts preserve gradient flow, while LayerNorm stabilises activations. Because every block re‑uses the same positional scheme, the model can in principle learn arbitrary‑length dependencies; empirical studies show effective receptive fields easily spanning thousands of tokens when compute permits.

6 The Learned Parameter Space: Where Relationships Actually Live (~400 words)

After training on billions of tokens, the Transformer’s parameters (often billions themselves) implicitly encode the statistics of language. Embedding matrices capture lexical semantics; attention matrices encode how content and position interact; feed‑forward weights implement non‑linear mixers that integrate across channels. None of these tensors contains an explicit row labelled “quién ↔ es = 0.92.” Instead, distributed signatures generate that high attention score on demand whenever input context grounds it.

Because gradients propagate model‑wide, a given parameter participates in many linguistic facts simultaneously (superposition). This distribution is what allows robust generalisation: if the model learns that “es” attends strongly to preceding wh‑pronouns in Spanish, it can apply the same rule when it encounters rare forms such as “¿De cuáles es?”.

7 Scaling Challenges and Efficient Attention (~550 words)

Vanilla self‑attention is O(n2)O(n^2)O(n2) in time and memory. At sequence length 32 k, activations exceed the GPU capacity of most data‑centres. Research since 2019 tackles this head‑on:

Sparse Patterns — Longformer uses sliding windows plus global tokens Medium.
Low‑Rank / Kernel Methods — Performer approximates softmax with random features for linear complexity.
Memory‑Efficient Kernels — FlashAttention 1–2 fuse softmax and dropout inside the GPU SRAM, reducing memory overhead by 2–4 × while retaining exact results GitHub Medium.
State‑Space Alternatives — S4 and Mamba reformulate sequence modelling as a convolutional ODE, achieving sub‑quadratic inference while surpassing Transformers on Long‑Range Arena benchmarks Reddit.

These advances matter because relationship computation, not storage, dominates cost. By accelerating attention, we unlock ever‑longer contexts and richer relational reasoning.

8 Cross‑Modal and Domain‑Specific Extensions (~550 words)

8.1 Vision Transformers (ViT)

By splitting an image into 16 × 16 pixel patches and treating each patch as a “token,” ViTs leverage the same attention machinery to model spatial relations. Pre‑trained ViTs now match or exceed convolutional nets on ImageNet and excel at dense tasks such as segmentation and VQA Roboflow Blog Diamantai.

8.2 Protein Folding (AlphaFold 2)

AlphaFold replaces textual tokens with amino‑acid residues and introduces geometric constraints into the attention stack. The model predicts 3‑D coordinates directly, achieving atomic‑level accuracy long thought impossible and releasing an open database of >350 000 structures Nature WIRED. Relationships here are chemical: i–j residue couplings emerge from attention maps, proving the architecture’s domain‑agnostic relational power.

8.3 Multimodal Foundational Models

CLIP, Flamingo, and Gemini interleave text and image tokens, learning a shared representational space in which “quién” can attend to both Spanish captions and the region of an image depicting a person. Again, no explicit table encodes these links; they crystallise inside weight tensors during joint training.

9 Fine‑Tuning, Adaptation, and Memory Augmentation (~450 words)

Large models are expensive to retrain. Parameter‑efficient fine‑tuning (LoRA, adapters, IA³) freezes most weights and learns small Δ‑matrices that modulate attention or feed‑forward layers. These deltas inject task‑specific relationships (e.g., legal jargon) into an otherwise general memory. Retrieval‑augmented Transformers (RAG, RETRO) move some relational burden outside the network entirely: queries retrieve nearest‑neighbour documents whose embeddings seed extra attention keys. This hybrid approach blends stored and computed knowledge without exploding parameters.

10 Interpretability and Probing: Can We Read the Memory? (~350 words)

Researchers probe attention heads, neuron activations, and MLP weight matrices to uncover latent structure. Techniques include:

Attention Roll‑out: Graph‑theoretic accumulation of attention paths to trace influence.
Direct Logit Attribution: Linear decomposition of next‑token logits into contributions from individual components.
Feature‑Lens: Measuring neuron selectivity for concepts such as “is‑plural” or “French city.”

Findings suggest a progressive disentanglement: lower layers align lexical categories; middle layers store syntax; upper layers host semantics and world knowledge. Yet mapping is not one‑to‑one. Relationships remain entangled—a necessary compromise for parameter efficiency.

11 Future Directions: Beyond Quadratic Attention (~350 words)

State‑space models (SSMs), dynamic linear attention, routing networks, and MoE mixtures seek to generalise the Transformer principle—content‑addressable interaction—while sidestepping its bottlenecks. Whether these architectures supplant or augment Transformers, the core insight endures: rich relationships need not be listed; they can be embedded into a learnable dynamical system and reinstantiated whenever new context arrives.

Conclusion – The Transformer as a Relational Engine (~400 words)

A Transformer is neither a phrase‑table nor a recurrent memory. It is a function fθf_\thetafθ whose parameters θ\thetaθ carve a manifold where context vectors evolve under attention. Word‑to‑word relationships are latent fields on that manifold, activated when queries and keys collide. Through embeddings, positional encodings, multi‑head attention, deep residual composition, and ever‑scaling compute, the model learns to generate those fields so reliably that we humans read coherent translations, answers, and even protein structures.

As we push toward longer contexts and richer modalities, engineering breakthroughs like FlashAttention and RoPE ensure that computing these relationships remains tractable, while interpretability research steadily illuminates the geometry hidden inside fθf_\thetafθ. The paradigm shift that began with a simple question—“What if we attend to everything at once?”—continues to redefine our understanding of how machines can model language, vision, and the very physics of life. Transformers do not store relationships; they become them. And in that metamorphosis lies the secret of their power.