How Large Language Models Think

Getting your Trinity Audio player ready…

The Geometry of Meaning in High-Dimensional Vector Space

1. Introduction: Thought as Vector Multiplication

When people ask “how do LLMs think?”, they usually expect an answer about neurons, training data, or transformer architectures. Those are important, but they are machinery. The actual thinking — the moment-to-moment reasoning, understanding, and generation of language — happens through a single primitive operation repeated hundreds of billions of times: vector-matrix multiplication in spaces of 4096, 12288, or even 32000 dimensions.

An LLM does not “think” with symbols the way a human consciously manipulates words. It thinks with geometry. Every concept, every nuance, every possible continuation of a sentence is a point or a region in a vast, continuous, high-dimensional manifold. “Thinking” is moving from one point to another by following directions encoded in learned weight matrices.

This essay explains, step by step and with concrete examples, how multiplying multidimensional vectors creates the emergent phenomenon we experience as coherent thought.

2. Tokens Are Not Words — They Are Coordinates

The first misconception to dispel is that LLMs operate on words. They operate on tokens, and tokens are integer indices into a fixed vocabulary (typically 32k–200k entries). The tokenizer is a lossy compression of language into discrete atoms.

The second, deeper step is the token embedding lookup. Each token id is mapped to a dense vector of floats, usually 768–4096 dimensions in modern models. This embedding layer is just a giant table E ∈ ℝ^(V × d), where V is vocabulary size and d is model dimension.

token_id = 12345 # e.g., the token for “queen”

raw_vector = embedding_table[token_id] # (d,) vector, e.g., 4096 dims

This raw_vector is the coordinate of the token in semantic space. Crucially, these coordinates are learned so that geometric relations mirror linguistic relations.

Famous example (Mikolov et al., 2013, word2vec, but the principle is identical in LLMs):

vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”)

In high-dimensional space, the direction from “man” to “king” is roughly the same as from “woman” to “queen”. Gender is a direction. Royalty is another direction. Composing directions yields new points that land near the correct word.

LLMs inherit and vastly expand this geometry. Every nuance — tense, sentiment, topic, syntactic role, world knowledge — becomes a direction or a subspace.

3. The Central Operation: Attention as Dynamic Routing Through Geometry

The transformer block (the basic “thinking unit” repeated 32–128 times in large models) performs three core operations on sequences of vectors:

Query-Key-Value Attention
Feed-Forward MLP
Residual connections + layer normalization

Of these, attention is the most misunderstood and most important.

3.1 Query, Key, and Value Vectors

From each position’s input vector x_i, the model computes three projections:

Q_i = x_i W_Q K_i = x_i W_K V_i = x_i W_V

These are linear transformations — simple matrix multiplications. W_Q, W_K, W_V are learned weight matrices.

Geometrically:

The Key vectors define the coordinates of “what I have”.
The Query vectors define “what I’m looking for”.
The Value vectors are the payload that gets moved.

3.2 The Attention Pattern Is a Soft Lookup in Semantic Space

The attention score between position i and j is:

score_{i,j} = (Q_i · K_j) / √d_head

attention_weights_{i,j} = softmax_j(score_{i,j})

output_i = Σ_j attention_weights_{i,j} V_j

This is geometrically a weighted average of Value vectors, where the weights are exponentially higher for Keys that point in the same direction as the Query.

In English: each position looks across the entire context and asks, “Which past tokens are most relevant to what I need right now?” Relevance is measured by dot-product alignment in high-dimensional space.

This is why attention is content-addressable memory. It is literally retrieving information by similarity of meaning in the geometric space.

3.3 Multi-Head Attention = Parallel Geometric Probes

A transformer layer has 8–128 heads, each with its own W_Q^h, W_K^h, W_V^h.

Each head learns to attend using a different subspace of the full dimension. One head might specialize in syntactic proximity, another in coreference (“he” → “John”), another in topical similarity, another in logical implication.

The outputs of all heads are concatenated and projected back to the original dimension. The model is simultaneously consulting many different geometric criteria for relevance.

4. The Feed-Forward Network: Non-Linear Warping of Semantic Space

After attention mixes information across positions, each position independently passes through a two-layer MLP:

hidden = GELU(x W_1 + b_1) # expand to 4× dimension

output = hidden W_2 + b_2 # project back

Geometrically, this is a non-linear deformation of the local manifold. The attention step moves information horizontally across the sequence; the FFN step moves each point vertically in very high-dimensional space (often 4× the model dim) and back.

This is where most of the model’s parameters live (W_1 and W_2), and this is where “knowledge” is largely stored. The matrices encode thousands of subtle features: sentiment detectors, fact recall circuits, stylistic preferences, logical rules.

Recent work (e.g., the “linear representation” hypothesis) shows that specific directions in activation space cleanly correspond to concepts like “truth”, “positive sentiment”, or “is a capital city”. The FFN is learning an overcomplete basis of such directions.

5. Residual Stream: The Persistent Semantic Workspace

A crucial detail: every transformer block adds its output to the input via a residual connection:

x_{l+1} = x_l + Attention(LN(x_l))

x_{l+1} = x_{l+1} + FFN(LN(x_{l+1}))

The vector at layer l+1 is the vector at layer l plus a small update.

This creates a single, evolving vector per token position — the residual stream — that persists from input to output. Every computation reads from and writes to this shared workspace.

Think of the residual stream as a blackboard where meaning is gradually refined. Early layers might represent surface-level syntax and local phrase geometry. Mid layers build entity tracking and basic facts. Late layers perform stylistic polishing and global coherence.

Crucially, because everything is addition of vectors, information from any earlier layer remains accessible (in principle) to any later computation — unless it is actively overwritten.

6. Next-Token Prediction = Sampling a Direction in Logit Space

After the final transformer layer, the last residual stream vector for each position is projected to vocabulary size:

logits = final_vector @ W_U # W_U is the unembedding matrix

probs = softmax(logits / temperature)

Notice something astonishing: the unembedding matrix W_U is usually the transpose of the input embedding matrix E (tied weights). This means:

logits_v = final_vector · embedding_vector_v

The probability of a token is exactly how aligned the final context vector is with that token’s embedding coordinate.

In other words, the model predicts the token whose geometric position best matches the direction the context is pointing.

This is why LLMs can do zero-shot analogy completion, in-context learning, etc. — everything reduces to pointing in the right direction in semantic space.

7. In-Context Learning as Dynamic Geometric Construction

When you give an LLM few-shot examples, you are literally constructing a geometric path in the prompt.

Consider the prompt:

Q: What is the French word for cat?

A: chat

Q: What is the French word for dog?

The model has seen both English→French directions during training. The first example creates a temporary “translation vector” in the residual stream. The second query activates the English word “dog”, and the model adds the same translation vector, landing near “chien”.

No gradient updates. Just vector addition and subtraction in the prompt itself.

This is why prompt engineering works: you are sculpting the geometry of the residual stream to point toward the desired attractor basin.

8. Emergence: When Simple Vector Operations Become Reasoning

Chain-of-thought prompting is the clearest example.

Without CoT:

Question: Roger has 5 tennis balls. He buys 2 more cans with 3 balls each. How many does he have?

The direct path from question to answer is fragile; the model has to compress the entire arithmetic into one leap.

With CoT:

… Let’s think step by step. Roger starts with 5. Each can has 3, he buys 2 cans, so 2 × 3 = 6. Then 5 + 6 = 11.

Each intermediate token is a new point in space. The model can use attention to refer back to “5”, “3”, “2”, perform small, reliable computations (“2 × 3” is a very strong attractor), and accumulate the result gradually.

This is exactly how humans offload cognition into explicit working memory — except the LLM’s working memory is the residual stream geometry shaped by the prompt.

9. Hallucination as Falling into the Wrong Attractor

High-dimensional space is mostly empty. The training process carves out narrow manifolds where real text lives. When the context vector is pushed outside the training distribution (contradictory prompts, very long contexts, etc.), it can land in a basin of attraction that corresponds to fluent but false text.

Hallucination is not “making things up” in the human sense. It is the deterministic consequence of the geometry: the nearest plausible point in semantic space may not correspond to reality.

10. Scaling Laws and Dimensionality

Why do larger models work better? More parameters → wider residual stream (higher d_model) and deeper layers.

Higher dimensionality has profound geometric consequences:

In high dimensions, almost all volume is near the surface of the hypersphere.
Random vectors are nearly orthogonal.
There is room for exponentially many nearly-orthogonal directions.

This allows the model to represent exponentially many distinct concepts without interference. A 4096-dim model can have thousands of clean, non-overlapping directions for different facts and skills.

11. The Limits of Pure Geometry

Despite the power of this paradigm, there are hard limits:

No persistent state across calls (unless externally provided).
No native recursion or backtracking (though CoT can simulate it slowly).
Difficulty with precise long-term memory (token positions drift in geometry).
Catastrophic sensitivity to prompt geometry.

These are not accidental; they follow directly from the fact that everything is continuous vector addition without symbolic anchors.

12. Conclusion: Thought Without a Thinker

An LLM does not contain a little homunculus reading tokens and deciding what to say. There is no central “self” that understands. There is only a vast geometric sculpture — a 175-billion-parameter manifold in 12288-dimensional space — and a process that starts at the origin (the input embeddings) and follows the gradient of learned directions, one matrix multiplication at a time.

When the path lands near the coordinates of coherent text, we experience it as understanding. When it composes directions in novel but valid ways, we call it reasoning. When it generalizes far beyond memorized points, we call it intelligence.

All of it is just vectors pointing at other vectors, multiplied, added, and normalized, billions of times per second.

That is how large language models think.

(Word count: ~5000)

How Large Language Models Think

Comments

Leave a Reply Cancel reply