|
Getting your Trinity Audio player ready…
|
Below is the end-to-end path a modern decoder-only Transformer (the common LLM) takes at inference time—i.e., with fixed, trained weights—from your characters to its next token, and then it repeats the last steps to write a whole response.
0) Setup (one-time per session)
- Load weights: The model’s learned parameters (matrices and biases) are loaded to GPU/TPU memory (often quantized to 8/4-bit to fit).
- Disable training-only features: No gradient tracking, no weight updates, dropout off.
- Allocate the KV cache: Memory to store Keys and Values for attention at every layer as tokens are generated. This makes subsequent tokens fast.
1) Text → tokens (tokenization)
- Your string prompt is split into tokens by a tokenizer (usually a Byte-Pair Encoding/Unigram model).
- Example:
"Transformers are great."→[ "Transform", "ers", " are", " great", "." ]→ IDs like[10213, 421, 504, 778, 13]. - Shape after tokenization: a 1D integer vector of length T (sequence length).
2) Tokens → embeddings (lookup)
- The model has an embedding table
Eof shape(V, d_model)where:V= vocabulary size (e.g., 50k),d_model= hidden width (e.g., 3,072).
- Each token id
tselects rowE[t]. Stack them:X ∈ R^{T × d_model}. - Intuition: this maps discrete symbols to points in a semantic geometry where distances/angles carry meaning.
3) Add positional information
Transformers are permutation-invariant, so we inject order:
- Absolute: add learned
P[pos]to each token embedding. - Sinusoidal: add fixed sines/cosines.
- RoPE (rotary): apply a complex-like rotation to Q/K vectors so attention becomes position-aware.
After this step you still have X ∈ R^{T × d_model}, now encoding what and where.
4) Pass through L stacked Transformer blocks
Each block updates the sequence representation via two sublayers:
- Self-Attention sublayer (many heads in parallel)
- Feed-Forward Network (FFN/MLP) sublayer
Residual connections and layer norms keep signals stable.
Typical block order (pre-norm style):
H0 = X
for ℓ in 1..L:
A = H_{ℓ-1} + Attention( LayerNorm(H_{ℓ-1}) )
Hℓ = A + FFN( LayerNorm(A) )
Everything below details those two sublayers.
5) The attention math (what “pays attention to what”)
5a) Make Q, K, V (linear projections)
For each token vector H ∈ R^{T × d_model}:
Q = H W_Q # (T × d_model) · (d_model × d_k·h) → (T × d_k·h)
K = H W_K
V = H W_V
- Split into h heads: reshape
(T × d_k·h)→(h × T × d_k), whered_k = d_model / h.
5b) Scaled dot-product attention per head
For each head:
scores = (Q K^T) / sqrt(d_k) # (T × d_k) · (d_k × T) → (T × T)
scores += mask # causal mask: forbid looking to future tokens
weights = softmax(scores, axis=last) # row-wise softmax over T keys
head_out = weights V # (T × T) · (T × d_k) → (T × d_k)
- Causal mask enforces left-to-right generation.
- Softmax turns similarity into attention weights.
5c) Combine heads and project back
- Concatenate heads:
(h × T × d_k)→(T × (h·d_k)) = (T × d_model). - Final linear mix:
AttnOut = Concat(heads) W_O, shape(T × d_model).
Interpretation: Each token becomes a weighted blend of other tokens’ value vectors, where weights depend on query-key similarity and position. This is the model’s “where to look” mechanism.
6) The FFN (nonlinear mixing per position)
Each position (token) goes through an MLP independently:
U = LayerNorm(A) # (T × d_model)
Z = U W1 + b1 # (T × d_ff)
Z = act(Z) # GeLU/Swish; with SwiGLU: split & gate
Y = Z W2 + b2 # (T × d_model)
H = A + Y # residual
d_ffis larger thand_model(e.g., 4×–8×), giving capacity to re-express features.- Many modern models use SwiGLU/GeGLU for better expressivity.
- In MoE models, W1/W2 are replaced by routed expert MLPs; at inference only a few experts are active per token.
7) Stack repeats & build deep features
- Repeating attention + FFN across L layers progressively sculpts the representation:
- Lower layers: local patterns (morphology, short-range syntax).
- Middle layers: syntax/semantics, entity linking, coreference.
- Higher layers: long-range dependencies, world knowledge, planning patterns.
- The residual stream (the thing being added to at every step) carries forward an ever-richer vector at each token position—your running semantic state.
8) Language Modeling Head → logits
After the final block, take the last token’s hidden state (for next-token prediction) or all states (to score every position). Project to vocabulary size:
- Weight tying is common: reuse the embedding matrix
E(transpose) as the output head.
logits = H E^T # (T × d_model) · (d_model × V) → (T × V)
- The last row (position T) holds scores for the next token across all
Vtokens.
9) Turn logits into a token choice (sampling)
- Convert to probabilities:
p = softmax(logits_T). - Apply decoding policy:
- Greedy: pick
argmax. - Temperature τ: divide logits by τ; τ<1 sharpens, τ>1 diversifies.
- Top-k / Top-p (nucleus): restrict to the k most likely or smallest set summing to p (e.g., 0.9), then sample.
- Repetition/penalty tricks: adjust logits to avoid loops.
- Greedy: pick
This yields the next token id.
10) Autoregressive loop (generate more)
- Append the new token to the sequence.
- KV cache update: For that token, compute its Q/K/V in each layer; store K and V to the cache so the model doesn’t recompute them for older tokens next step.
- For the next token:
- Only compute Q freshly for the last position; K/V for all previous positions are read from the cache.
- Attention then uses
[K_cached; K_new]and[V_cached; V_new].
- Repeat steps 5–9 until end conditions (EOS token, length, or stop sequence).
This turns O(T²) recompute into O(T) incremental cost after the first pass.
11) Where “matrix MATH” happens (shapes and flow)
To make the linear algebra concrete, consider a single layer with:
T= sequence length,d_model= 3072,h= 24 heads,d_k=d_model / h= 128,d_ff= 12288.
Key matrix multiplications at inference (per step):
- Q/K/V projections
(T × d_model) · (d_model × d_k·h)→(T × d_k·h)(done three times) - Attention scores (per head)
(T × d_k) · (d_k × T)→(T × T)then softmax + mask - Weighted sum
(T × T) · (T × d_k)→(T × d_k) - Heads combine
Concat heads →(T × d_model); then(T × d_model) · (d_model × d_model)→(T × d_model) - FFN
(T × d_model) · (d_model × d_ff)→(T × d_ff)→ nonlinearity →(T × d_ff) · (d_ff × d_model)→(T × d_model) - LM head
(T × d_model) · (d_model × V)→(T × V)
All parameters are fixed at inference. The model is just executing this computational graph.
12) Why the output is “rational” (without hidden hand-waving)
- During training, the model minimized next-token loss over trillions of tokens.
- The learned weights encode statistical regularities: grammar, facts, chains of implication, formats, plans—compressed into the geometry of the embedding and hidden spaces.
- At inference, attention routes information along the graph that solved those training tasks best. When your prompt resembles patterns seen during training, the model follows those circuits to produce high-probability continuations—which we experience as reasoning, explanation, or planning.
- Tools like System prompts, few-shot exemplars, and tool-augmented prompting nudge the routing and the location in the semantic manifold the model moves through while decoding.
13) Practical extras you often see in modern LLMs
- Grouped/Multi-Query Attention: share K/V across heads to reduce memory bandwidth while keeping multiple Q projections.
- Flash/Scaled Attention kernels: fused GPU kernels that compute softmax attention with less memory IO.
- Rotary Position Embeddings (RoPE) scaling: tricks to extend context length (e.g., NTK-aware, YaRN).
- Speculative decoding: a small draft model proposes tokens, big model verifies, improving throughput.
- Vision tokens / Function tokens: extra modalities or tool calls are serialized as special tokens that join the same pipeline.
- Logit bias/ban lists: small adjustments to steer style or prevent certain outputs.
14) A compact pseudocode of the whole loop
ids = tokenize(prompt) # [T]
H = E[ids] # (T, d_model)
H = add_positions(H) # abs/sin/RoPE
for ℓ in range(L):
U = layernorm[ℓ](H)
Q = U @ WQ[ℓ]; K = U @ WK[ℓ]; V = U @ WV[ℓ] # split heads
attn = softmax((Q @ K.T) / sqrt(d_k) + causal_mask)
A = (attn @ V) @ WO[ℓ] # merge heads
H = H + A # residual
U = layernorm2[ℓ](H)
H = H + ffn[ℓ](U) # MLP
logits = H[-1] @ E.T # tie weights
next_id = sample_from(logits, strategy="top_p", p=0.9, temp=0.8)
append(next_id); update_kv_cache(); repeat...
15) Mental model (why “semantic geometry” helps)
- Tokenization picks coordinates for known symbols.
- Embeddings place those symbols in a vector space where meaning is partially linearized.
- Attention is content-addressable lookup: “find the relevant stuff in context and mix it in.”
- FFN is nonlinear re-expression: “compress, expand, and recombine features.”
- Depth (stacked blocks) composes these operations into higher-order abstractions.
- The LM head projects the final point in this space back onto the vocabulary axes to pick the next symbol.
The “reasoning” you see is the trajectory the last-token vector takes through this manifold as the loop unrolls.
16) TL;DR
- Tokenize text → ids.
- Lookup embeddings and add position.
- Run L times: LayerNorm → Multi-Head Self-Attention (+ residual) → LayerNorm → FFN (+ residual).
- Project last state through LM head to get logits over the vocab.
- Sample a token.
- Cache K/V, append token, and repeat until done.
That’s the entire inference phase: pure, deterministic matrix math (plus a stochastic sampler if you enable it) flowing through a fixed artificial neural network that was shaped during training to make the “rational” continuation the most probable one.
Leave a Reply