From Tokens to Predictions: A Plain-English Deep Dive into How Large Language Models Embed, Process, and Predict Language

Getting your Trinity Audio player ready…


Introduction

Large Language Models (LLMs) like GPT-4, Claude, and Gemini can produce text that sounds like it came from a human. But under the hood, they’re not “thinking” in words — they’re manipulating numbers. These numbers live in a statistical geometry of meaning, built up during training from billions of examples.

In this guide, we’ll explore — step-by-step — how an LLM takes in text, transforms it into a multidimensional map of meaning, processes it with an artificial neural network, and finally turns it into probabilities for the next word. We’ll also peel back the curtain on how that map is learned in the first place.


Part 1 — A Brief History of Embeddings

Before GPT-style transformers, the big breakthrough in representing meaning came from word embeddings in the early 2010s.

Word2Vec

Google’s word2vec (2013) showed that training a simple model to predict words from their neighbors produced dense, meaningful vectors. Suddenly, you could do:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The relationships were stored in the geometry of the vectors, not in explicit rules.

GloVe

GloVe built embeddings from co-occurrence statistics across large corpora, still producing static vectors — meaning each word had one representation, no matter the sentence.

The Context Problem

Static embeddings failed for polysemy:

  • “Bank” in “river bank” ≠ “bank” in “open an account.”

Transformers and Contextual Embeddings

With models like BERT and GPT, embeddings became contextual. The representation for “bank” now depends on the surrounding words. In GPT-style models, these representations are updated at every layer as the model processes a sequence.


Part 2 — Step 1: Tokenization

An LLM starts by breaking text into tokens — the atomic pieces it understands.

  • A token might be a whole word: "cat".
  • Or part of a word: "un", "believ", "able".
  • Or punctuation: ".".

Example:

"The cat sat on the mat."
might tokenize to:

["The", " cat", " sat", " on", " the", " mat", "."]

Each token gets a token ID — an integer index.


Part 3 — Step 2: The Embedding Layer

The token IDs are looked up in an embedding matrix — a giant table with:

  • Rows = tokens in the vocabulary (e.g., 50,000+).
  • Columns = embedding dimensions (e.g., 4,096).

The output is a vector for each token.
If “cat” has ID 456, its embedding might be:

[0.12, -0.43, 0.91, ..., 0.05]   (4,096 numbers)

Worked Tiny Example

If we shrink to 3 dimensions:

TokenIDEmbedding (x, y, z)
“cat”1[0.8, 0.1, 0.4]
“dog”2[0.9, 0.2, 0.5]
“mat”3[0.1, 0.7, 0.3]

Already, “cat” and “dog” are near each other — they have similar coordinates in this space.


Part 4 — Step 3: The Multidimensional Statistical Map

The embedding matrix forms a map of meaning. Tokens are points in a multidimensional space. Closeness = similar usage patterns in training data.

Why multidimensional?
Meaning has many independent aspects:

  • Topic
  • Part of speech
  • Formality
  • Sentiment
  • Domain

A 2D map couldn’t capture all that — we need thousands of axes.

How it’s learned: The positions are discovered automatically during training by adjusting embeddings so that prediction accuracy improves.


Part 5 — Step 4: Transformer Layers and Context

The embedding is the starting point. Transformers update each token’s vector using self-attention and feed-forward networks.

Self-Attention

Every token:

  1. Makes a query vector (“What am I looking for?”).
  2. Every other token makes a key vector (“What do I have?”).
  3. Dot product of query and key = similarity score.
  4. Softmax over scores = attention weights.

Tokens then combine value vectors from others, weighted by these scores. This lets context flow between words.


Part 6 — Step 5: Weights and Biases

Inside every transformation are:

  • Weights: Matrices that rotate, stretch, and mix vector dimensions.
  • Biases: Shifts applied after weighting.

These are learned numbers, tuned during training to minimize prediction error.


Part 7 — Step 6: Output Logits

After many layers, each token position has a final hidden vector.

To predict the next token:

  1. Multiply this vector by the output matrix (often the transpose of the embedding matrix).
  2. Add biases.
  3. Result = logits — one score per vocabulary token.

Part 8 — Step 7: Softmax to Probabilities

Softmax turns logits into probabilities:

  1. Exponentiate each logit.
  2. Divide by the sum.
  3. Now they’re between 0 and 1 and sum to 1.

Example:

logits: mat=4.0, floor=2.0, roof=0.5
softmax: mat=0.82, floor=0.15, roof=0.03

Part 9 — Step 8: Choosing the Next Token

  • Greedy: pick the highest probability.
  • Sampling: pick randomly, weighted by probability.
  • Top-k / Nucleus: limit the candidate pool.

Then append the chosen token and repeat.


Part 10 — Why It’s Statistical, Not “Knowing”

LLMs don’t look up facts — they match patterns. The geometry of the embedding space reflects co-occurrence in training data, not truth tables.

That’s why they can “hallucinate” when prompted with rare or ambiguous input.


Part 11 — Visual Metaphors

Galaxy Map: Each token is a star; clusters = related meanings.
Recipe Card: Each dimension is an ingredient; a token’s vector is the recipe.


Part 12 — Worked Numerical Example

Mini-model: vocab = “cat”, “dog”, “mat”, embed size = 2.

Embedding matrix:

cat: [0.8, 0.1]
dog: [0.9, 0.2]
mat: [0.1, 0.7]

Context vector: [0.75, 0.15]

Dot products:

  • cat = 0.615
  • dog = 0.705
  • mat = 0.18

Softmax:

  • cat: 0.365
  • dog: 0.398
  • mat: 0.237

Most likely = “dog”.


Part 13 — Appendix: How the Embedding Map Is Learned

Training Objective

Predict next token → compare with truth → adjust parameters to improve.

Loss Function

Cross-entropy:

Loss = -log(P(correct_token))

Lower is better.

Backpropagation

Computes gradients — nudges for each parameter.
If “cat” is wrong, its embedding moves toward better context alignment.

Attention Weights

Learned so queries align with keys that improve prediction.

Tied Embeddings

Often, input embeddings = output projection matrix, to unify understanding and generation.

Gradient Descent

Like rolling a marble down a hill — the “height” is error, the position is all parameters.

Statistical Nature

Positions in space come entirely from training statistics — no human labeling of “meaning”.

Emergent Geometry

Clustering, analogies, and semantic structure emerge naturally from optimizing prediction accuracy.


Conclusion

Every time an LLM answers a question, it:

  1. Turns words into points in a learned multidimensional space.
  2. Reshapes that space with attention and neural layers.
  3. Scores every possible next token with dot products.
  4. Converts scores into probabilities.
  5. Samples the next token.
  6. Repeats.

The magic isn’t in one step — it’s in the fact that all of this is learned automatically from data.
What you’re seeing in an LLM’s output is the emergent behavior of an enormous statistical machine, built to navigate a landscape of meaning it shaped for itself.


If you want, I can also produce a full-page visual chart showing both the inference pipeline and the training loop side-by-side — so you can see how learning builds the embedding map, and then how that map is used when generating text. That would make this explanation even more concrete visually.

Do you want me to make that chart next?


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *