|
Getting your Trinity Audio player ready…
|
Introduction
Large Language Models (LLMs) like GPT-4, Claude, and Gemini can produce text that sounds like it came from a human. But under the hood, they’re not “thinking” in words — they’re manipulating numbers. These numbers live in a statistical geometry of meaning, built up during training from billions of examples.
In this guide, we’ll explore — step-by-step — how an LLM takes in text, transforms it into a multidimensional map of meaning, processes it with an artificial neural network, and finally turns it into probabilities for the next word. We’ll also peel back the curtain on how that map is learned in the first place.
Part 1 — A Brief History of Embeddings
Before GPT-style transformers, the big breakthrough in representing meaning came from word embeddings in the early 2010s.
Word2Vec
Google’s word2vec (2013) showed that training a simple model to predict words from their neighbors produced dense, meaningful vectors. Suddenly, you could do:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
The relationships were stored in the geometry of the vectors, not in explicit rules.
GloVe
GloVe built embeddings from co-occurrence statistics across large corpora, still producing static vectors — meaning each word had one representation, no matter the sentence.
The Context Problem
Static embeddings failed for polysemy:
- “Bank” in “river bank” ≠ “bank” in “open an account.”
Transformers and Contextual Embeddings
With models like BERT and GPT, embeddings became contextual. The representation for “bank” now depends on the surrounding words. In GPT-style models, these representations are updated at every layer as the model processes a sequence.
Part 2 — Step 1: Tokenization
An LLM starts by breaking text into tokens — the atomic pieces it understands.
- A token might be a whole word:
"cat". - Or part of a word:
"un","believ","able". - Or punctuation:
".".
Example:
"The cat sat on the mat."
might tokenize to:
["The", " cat", " sat", " on", " the", " mat", "."]
Each token gets a token ID — an integer index.
Part 3 — Step 2: The Embedding Layer
The token IDs are looked up in an embedding matrix — a giant table with:
- Rows = tokens in the vocabulary (e.g., 50,000+).
- Columns = embedding dimensions (e.g., 4,096).
The output is a vector for each token.
If “cat” has ID 456, its embedding might be:
[0.12, -0.43, 0.91, ..., 0.05] (4,096 numbers)
Worked Tiny Example
If we shrink to 3 dimensions:
| Token | ID | Embedding (x, y, z) |
|---|---|---|
| “cat” | 1 | [0.8, 0.1, 0.4] |
| “dog” | 2 | [0.9, 0.2, 0.5] |
| “mat” | 3 | [0.1, 0.7, 0.3] |
Already, “cat” and “dog” are near each other — they have similar coordinates in this space.
Part 4 — Step 3: The Multidimensional Statistical Map
The embedding matrix forms a map of meaning. Tokens are points in a multidimensional space. Closeness = similar usage patterns in training data.
Why multidimensional?
Meaning has many independent aspects:
- Topic
- Part of speech
- Formality
- Sentiment
- Domain
A 2D map couldn’t capture all that — we need thousands of axes.
How it’s learned: The positions are discovered automatically during training by adjusting embeddings so that prediction accuracy improves.
Part 5 — Step 4: Transformer Layers and Context
The embedding is the starting point. Transformers update each token’s vector using self-attention and feed-forward networks.
Self-Attention
Every token:
- Makes a query vector (“What am I looking for?”).
- Every other token makes a key vector (“What do I have?”).
- Dot product of query and key = similarity score.
- Softmax over scores = attention weights.
Tokens then combine value vectors from others, weighted by these scores. This lets context flow between words.
Part 6 — Step 5: Weights and Biases
Inside every transformation are:
- Weights: Matrices that rotate, stretch, and mix vector dimensions.
- Biases: Shifts applied after weighting.
These are learned numbers, tuned during training to minimize prediction error.
Part 7 — Step 6: Output Logits
After many layers, each token position has a final hidden vector.
To predict the next token:
- Multiply this vector by the output matrix (often the transpose of the embedding matrix).
- Add biases.
- Result = logits — one score per vocabulary token.
Part 8 — Step 7: Softmax to Probabilities
Softmax turns logits into probabilities:
- Exponentiate each logit.
- Divide by the sum.
- Now they’re between 0 and 1 and sum to 1.
Example:
logits: mat=4.0, floor=2.0, roof=0.5
softmax: mat=0.82, floor=0.15, roof=0.03
Part 9 — Step 8: Choosing the Next Token
- Greedy: pick the highest probability.
- Sampling: pick randomly, weighted by probability.
- Top-k / Nucleus: limit the candidate pool.
Then append the chosen token and repeat.
Part 10 — Why It’s Statistical, Not “Knowing”
LLMs don’t look up facts — they match patterns. The geometry of the embedding space reflects co-occurrence in training data, not truth tables.
That’s why they can “hallucinate” when prompted with rare or ambiguous input.
Part 11 — Visual Metaphors
Galaxy Map: Each token is a star; clusters = related meanings.
Recipe Card: Each dimension is an ingredient; a token’s vector is the recipe.
Part 12 — Worked Numerical Example
Mini-model: vocab = “cat”, “dog”, “mat”, embed size = 2.
Embedding matrix:
cat: [0.8, 0.1]
dog: [0.9, 0.2]
mat: [0.1, 0.7]
Context vector: [0.75, 0.15]
Dot products:
- cat = 0.615
- dog = 0.705
- mat = 0.18
Softmax:
- cat: 0.365
- dog: 0.398
- mat: 0.237
Most likely = “dog”.
Part 13 — Appendix: How the Embedding Map Is Learned
Training Objective
Predict next token → compare with truth → adjust parameters to improve.
Loss Function
Cross-entropy:
Loss = -log(P(correct_token))
Lower is better.
Backpropagation
Computes gradients — nudges for each parameter.
If “cat” is wrong, its embedding moves toward better context alignment.
Attention Weights
Learned so queries align with keys that improve prediction.
Tied Embeddings
Often, input embeddings = output projection matrix, to unify understanding and generation.
Gradient Descent
Like rolling a marble down a hill — the “height” is error, the position is all parameters.
Statistical Nature
Positions in space come entirely from training statistics — no human labeling of “meaning”.
Emergent Geometry
Clustering, analogies, and semantic structure emerge naturally from optimizing prediction accuracy.
Conclusion
Every time an LLM answers a question, it:
- Turns words into points in a learned multidimensional space.
- Reshapes that space with attention and neural layers.
- Scores every possible next token with dot products.
- Converts scores into probabilities.
- Samples the next token.
- Repeats.
The magic isn’t in one step — it’s in the fact that all of this is learned automatically from data.
What you’re seeing in an LLM’s output is the emergent behavior of an enormous statistical machine, built to navigate a landscape of meaning it shaped for itself.
If you want, I can also produce a full-page visual chart showing both the inference pipeline and the training loop side-by-side — so you can see how learning builds the embedding map, and then how that map is used when generating text. That would make this explanation even more concrete visually.
Do you want me to make that chart next?
Leave a Reply