|
Getting your Trinity Audio player ready…
|
Introduction
When you type a sentence into a large language model (LLM), the model doesn’t “understand” words the way we do. Instead, it treats each word or part of a word—called a token—as a point in a vast numerical landscape. Behind the scenes, every decision the model makes about which word comes next is guided by statistics and probability.
In this essay, we’ll follow a single token—imagine the word “apple”—as it moves through two major phases of an LLM’s life: training, where the model learns from huge amounts of example text, and inference, where it uses what it learned to generate new text. Along the way, we’ll focus on the statistical and probabilistic machinery that lets these models turn raw text into predictions with surprising fluency.
From Tokens to Number Vectors: The Embedding
- Breaking text into tokens
- The very first step is tokenization: chopping text into pieces the model can handle. A token might be a whole word (“apple”), part of a word (“appl” + “e”), or even punctuation.
- Once the text is broken up, each token needs a way to be processed by math, because computers deal in numbers, not letters.
- The embedding table: a lookup of number lists
- Imagine a giant spreadsheet where each row corresponds to one token (say, “apple”) and each column is a numerical feature (an abstract quality like “fruitiness” or “color association”).
- This spreadsheet is called the embedding table. For a modern LLM, it might have hundreds of thousands of rows (one per token) and thousands of columns (dimensions)—for example, 50,000 tokens × 1,024 dimensions.
- Statistical initialization
- At the start of training, each embedding vector is filled with random numbers, typically drawn from a distribution centered at zero (like a bell curve).
- Those random numbers mean that initially there is no information about which tokens relate to which. “Apple” might be just as close to “banana” as it is to “justice.”
- Why random?
- Random initialization breaks symmetry: if every token started with the same vector, learning couldn’t differentiate them.
- The randomness also spreads tokens throughout the embedding space, giving the model room to pull related tokens closer together during learning.
Turning Scores into Chances: Probability Distributions
- Forward pass computes scores
- Once tokens are converted to embeddings, they pass through layers of the neural network (self-attention layers, feed-forward layers, etc.).
- At the final layer, the model produces a raw score for each possible next token—these are just numbers on the real line, which could be negative, positive, large, or small.
- Softmax: from scores to probabilities
- To interpret those raw scores as probabilities, the model uses a function called softmax.
- Softmax turns a list of arbitrary scores (s1,s2,…,sn)(s₁, s₂, …, sₙ) into nonnegative numbers (p1,p2,…,pn)(p₁, p₂, …, pₙ) that add up to 1.
- Formally, pi=esi∑j=1nesj p_i = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}} but in plain English, it’s like giving each token a “weight” (by exponentiating its score) and then dividing by the total weight so everything becomes a proper slice of the probability pie.
- Interpreting the probabilities
- If “banana” gets probability 0.30 and “apple” 0.05, you can think:
- “Given this context, there’s a 30% chance the next token is ‘banana,’ and a 5% chance it’s ‘apple.’”
- Those numbers are statistical guesses based on patterns the model saw during training.
- If “banana” gets probability 0.30 and “apple” 0.05, you can think:
Learning by Error: Training with Statistical Loss
- Objective: maximize the chance of the correct token
- During training, the model reads billions of real sentences. For each position in each sentence, it tries to predict the actual next token.
- From a statistical viewpoint, the model is performing maximum likelihood estimation: it adjusts its internal numbers so that the probability it assigns to the true next token becomes as large as possible.
- Measuring surprise: cross-entropy loss
- To quantify “how wrong” the model is, we use a measure called cross-entropy loss. If the model assigns a low probability to the true token, the loss is high; if it assigns a high probability, the loss is low.
- In everyday terms: cross-entropy tells you how surprised the model would be to see the correct word. If the model gives “apple” only a 5% chance but the correct next word really is “apple,” that’s a big surprise (high loss).
- Statistics of the loss
- Over a massive dataset, we compute the average cross-entropy across every token prediction.
- Minimizing this average is akin to finding the set of model parameters that best explain the data under a probabilistic model.
Nudging Numbers: Back-Propagation and Gradient Descent
- How do we tweak hundreds of billions of numbers?
- The model’s parameters include every entry in the embedding table plus every weight and bias in every neural layer. Each of those is a single number.
- Computing gradients: the chain rule in action
- When the model makes a prediction and computes its loss, we can ask:
- “If I change this number slightly, how would the loss change?”
- That sensitivity is called a gradient.
- Back-propagation efficiently computes the gradient of the loss with respect to every parameter by applying the chain rule of calculus through the network’s layers.
- When the model makes a prediction and computes its loss, we can ask:
- Stochastic gradient descent: small steps toward lower loss
- We don’t look at all billions of tokens at once—that would be too slow and would drown out local patterns. Instead, we pick a small batch (say, 512 sentences), compute the average gradient on that batch, and take a small step in the opposite direction of the gradient.
- Those repeated steps are collectively called stochastic gradient descent (SGD) or one of its more advanced variants (Adam, RMSProp, etc.).
- Statistical perspective
- Each batch gives a noisy estimate of the true gradient (hence “stochastic”).
- Over time, the random noise averages out, and the model’s parameters converge toward values that minimize overall surprise on the entire training set.
Generating with Controlled Randomness: Inference and Sampling
- Inference is “forward-only”
- Once training is done, we freeze the parameters. There’s no more back-propagation—only forward passes to compute probabilities for next tokens.
- Greedy vs. sampling
- Greedy decoding picks the single token with highest probability at each step. This can lead to repetitive or bland text.
- Sampling treats the probabilities like a lottery and picks tokens at random according to those chances. This injects diversity.
- Temperature: dialing randomness up or down
- Before sampling, we can divide the scores by a temperature parameter TT.
- If T<1T < 1, the distribution sharpens—high-probability tokens get relatively more weight, making the model more conservative.
- If T>1T > 1, the distribution flattens—rare tokens gain weight, making the output more creative (and riskier).
- From a statistical standpoint, temperature rescales the “confidence” of the model’s probability estimates.
- Before sampling, we can divide the scores by a temperature parameter TT.
- Top-k and nucleus (top-p) sampling
- Top-k: ignore all but the top k tokens by probability, then renormalize and sample.
- Nucleus (top-p): find the smallest set of tokens whose cumulative probability ≥ p (e.g. 90%), then sample from that set.
- These strategies limit the “lottery” to a reasonable pool, balancing safety and creativity.
Measuring Confidence: Perplexity and Uncertainty
- What is perplexity?
- Perplexity is a statistical metric that measures how well a probability model predicts a sample. It’s defined as the exponential of the average cross-entropy loss.
- In plain language: perplexity is the model’s average number of choices it is “equally perplexed” among. A perplexity of 50 means that, on average, the model is as uncertain as if it were choosing uniformly among 50 options.
- Why is perplexity useful?
- When comparing models, lower perplexity on the same test set generally indicates better predictive performance.
- It gives a single-number summary of how “surprised” the model still is when seeing real text.
- Uncertainty quantification
- The entire softmax probability distribution at each step reflects the model’s uncertainty.
- A sharp peak means high confidence in one token; a flat curve means the model is unsure and many tokens seem plausible.
Conclusion
At its core, a large language model is a statistical machine: it converts words into numbers, uses statistical learning to tune those numbers so that they assign high probabilities to real-world text, and then uses those learned probabilities to generate new text.
- Embeddings start as random scatterings of points in a high-dimensional space.
- Through maximum likelihood training and gradient-based optimization, embeddings—and all other parameters—are nudged to reflect real co-occurrence patterns in language.
- At inference time, the model treats its computed scores as chances and “draws” the next token from a probability distribution that can be tuned for creativity or precision.
Every word it writes is ultimately a choice made by crunching numbers and weighing probabilities. The magic of LLMs is the sheer scale—and statistical subtlety—of those processes, trained on oceans of text to build a surprisingly fluent mirror of human language.
Leave a Reply