How an llm transforms the world into semantic geometry

Getting your Trinity Audio player ready…

Think of the whole process as three layers:

symbol → number
number → vector
vector → matrix math flow through the network

An LLM never directly “understands” a word the way a human does. It converts text into structured numbers that can be manipulated algebraically. The key object is the embedding vector, which is a point in a high-dimensional space.

1. A token starts as a discrete symbol

Before there is any vector, the model first breaks text into tokens.

A token is not always a whole word. It might be:

a word like cat
part of a word like ing
punctuation like ,
a space-plus-word fragment like the

So a sentence like:

The cat sat.

might become something like:

["The", " cat", " sat", "."]

Each token is then mapped to an integer ID from the model’s vocabulary.

Example:

"The" → 1043
" cat" → 5621
" sat" → 8932
"." → 13

At this stage the token is still not a vector. It is just a label, like an index number in a giant lookup table.

2. The token ID is converted into a vector through the embedding table

This is the first real step into geometry.

The model contains an embedding matrix. You can think of it as:

rows = vocabulary items
columns = latent features

If the vocabulary size is V and embedding dimension is d, then the embedding matrix has shape:

V × d

For example:

V = 50,000
d = 4096

That means each token in the vocabulary has its own learned row of 4096 numbers.

So if token ID 5621 corresponds to " cat", the model retrieves row 5621 from the embedding matrix.

That row might look abstractly like:

[0.18, -0.73, 1.02, 0.04, ..., -0.51]

with thousands of dimensions.

That row is the token embedding.

So the token becomes a multidimensional vector by table lookup into a learned matrix.

3. What that vector “looks like”

It does not look like a little arrow with two coordinates, except in simplified diagrams.

In reality it is something more like:

x ∈ R^4096

meaning a point in 4096-dimensional real-valued space.

Each coordinate is just a floating-point number. For example, a toy 8-dimensional embedding might be:

x = [0.12, -0.44, 1.87, 0.03, -0.91, 0.55, 0.10, -1.22]

A real model would use hundreds, thousands, or more dimensions.

What do the dimensions mean?

Usually, no single dimension has a clean human meaning like:

dimension 1 = animalness
dimension 2 = past tense
dimension 3 = danger

Instead, meaning is distributed across many dimensions at once.

A token like "cat" is not represented by one special “cat neuron.” It is represented by a pattern across the whole vector.

That pattern places it near related patterns such as:

"dog"
"kitten"
"pet"
"feline"

and farther from unrelated ones such as:

"galaxy"
"taxation"

So the vector “looks like” a long numerical fingerprint whose significance comes from:

its direction in space
its distance from other vectors
its behavior under learned matrix transforms

4. A deeper way to picture the vector

The embedding vector is best understood as a compressed coordinate address inside the model’s semantic space.

It is not a picture of the token.
It is not a dictionary definition.
It is not a symbolic fact record.

It is more like:

a location in a learned statistical geometry where similar usage patterns occupy nearby or structurally related regions.

If "king" and "queen" appear in similar linguistic environments, their vectors may share many structural relationships.
If "run" is used as both noun and verb, its vector may sit in a region that lets later layers disambiguate from context.

So the embedding is not final meaning. It is initial latent positioning.

5. How the lookup can be written in matrix math

There are two common ways to describe it.

Method A: row lookup

If E is the embedding matrix of shape V × d, and token ID is i, then:

x = E[i]

That means: take the ith row of the matrix.

Method B: one-hot multiplication

You can also imagine the token as a one-hot vector t of length V:

all zeros
one 1 at the token’s index

For token 5621 in a 50,000-word vocabulary:

t = [0, 0, 0, ..., 1, ..., 0]

Then:

x = tE

This matrix multiplication simply selects the corresponding row.

So the embedding process is mathematically simple:
the token ID indexes a learned row in the embedding matrix.

6. The token vector is then combined with positional information

A token alone is not enough. The model also needs to know where that token sits in the sequence.

For example:

dog bites man
man bites dog

have similar token sets but different meanings because of order.

So the model adds or injects positional encoding.

If:

token embedding = x
positional embedding = p

then the initial representation may be:

h = x + p

Now the vector contains both:

what token this is
where it is in the sequence

That gives the model a context-ready input state.

7. What happens next: matrix math begins

Once the token is a vector, the model starts applying matrix multiplications to transform it.

This is where LLMs live: not in symbolic rules, but in repeated vector-matrix operations.

Each layer takes in a batch of token vectors and transforms them.

If a sequence has n tokens and hidden size d, then the full input is often treated as a matrix:

X ∈ R^(n × d)

Each row is one token vector.

For example, if there are 100 tokens and hidden size 4096:

X has shape 100 × 4096

Now the network applies learned matrices to X.

8. Why matrix math is used

Matrix math is ideal because it lets the model:

transform many dimensions at once
process many tokens in parallel
detect patterns through projections
mix, rotate, scale, and recombine features
run efficiently on GPUs/TPUs

A vector by itself is just a point.
A matrix tells you how to transform the entire space.

So when a token vector is multiplied by a matrix, the model is not just moving one token around randomly. It is applying a learned geometric transformation to the token’s latent representation.

9. The first major transformations: Q, K, and V

Inside attention, the input matrix X is multiplied by three learned weight matrices:

W_Q
W_K
W_V

So:

Q = XW_Q
K = XW_K
V = XW_V

If X has shape n × d, and each projection matrix has shape d × d_h, then:

Q is n × d_h
K is n × d_h
V is n × d_h

Each token vector is thus projected into three different latent roles:

Query = what am I looking for?
Key = what information do I offer?
Value = what content should be passed along if selected?

This is a beautiful example of matrix math turning one embedding into several function-specific vector representations.

10. How token-to-token interaction happens

Once you have Q and K, the model computes similarity scores:

S = QK^T

This creates an n × n matrix.

Each entry tells how much one token should attend to another.

For example:

row i
column j

gives the attention score from token i to token j.

This is fundamentally a set of dot products.

That matters because the dot product measures alignment between vectors.
If query and key vectors point in similar directions, their dot product is larger.

Then the model scales and normalizes:

A = softmax(QK^T / sqrt(d_h))

Now each row of A becomes a probability distribution over other tokens.

So attention is matrix math that asks:

for each token, which other token vectors are most relevant right now?

11. How values are mixed

Once attention weights A are computed, they are applied to the values:

Z = AV

This produces a new matrix Z of transformed token representations.

What happened here?

Each token’s output is now a weighted mixture of other tokens’ value vectors.

So if the word "it" appears later in a sentence, attention can let that token pull information from "dog" or "the theory" depending on context.

This is the essence of contextualization.

The original embedding was static:

"bank" always started from the same embedding row

But after attention:

"bank" in river bank
"bank" in bank loan

can become very different contextual vectors.

12. What matrix multiplication is doing geometrically

When you multiply a vector by a matrix, several things can happen at once:

rotation in latent space
scaling of some directions
compression into fewer dimensions
expansion into more dimensions
projection onto features the model cares about
mixing of dimensions

So suppose token embedding x is a 4096-dimensional vector.
Multiplying by a weight matrix W creates a new vector:

y = xW

Each output dimension in y is a weighted sum of all dimensions in x.

That means no output coordinate depends on just one input coordinate.
It depends on the whole pattern.

This is how the model learns distributed features.

13. Example with small toy numbers

Suppose a token has a 3-dimensional embedding:

x = [2, -1, 3]

and a learned matrix is:

W = [[1, 0], [2, -1], [0, 3]]

Then:

y = xW

Compute each output coordinate:

First output:
2*1 + (-1)*2 + 3*0 = 2 - 2 + 0 = 0

Second output:
2*0 + (-1)*(-1) + 3*3 = 0 + 1 + 9 = 10

So:

y = [0, 10]

That new vector is a transformed version of the original token representation.

Real models do this with thousands of dimensions and billions of learned weights.

14. The feedforward network also uses matrix math

After attention, each token usually passes through an MLP or feedforward block:

FFN(x) = W_2 σ(W_1 x + b_1) + b_2

In practice the dimensions often expand and contract:

from d
to 4d or more
then back to d

This lets the model:

detect more complex feature combinations
apply nonlinearity
refine token representations

The nonlinearity σ might be GELU, SwiGLU, or something similar.

Without nonlinearity, stacks of matrices would collapse into one giant linear transform.
The nonlinear step lets the model build complex hierarchical representations.

15. Residual connections preserve the token stream

In a transformer block, the transformed output is usually added back to the original input:

x_new = x + Attention(x)
x_next = x_new + FFN(x_new)

This means the token vector evolves layer by layer rather than being overwritten blindly.

So a token embedding becomes:

initial lexical embedding
position-adjusted vector
attention-updated contextual vector
feedforward-refined vector
deeper abstract representation across many layers

By the top layers, the token is no longer just “the token as entered.”
It has become a context-rich state vector representing what that token means in this sentence, at this moment, in relation to all prior tokens.

16. What the vector looks like after many layers

At the input layer, the vector corresponds mostly to token identity plus position.

At deeper layers, the vector becomes increasingly abstract.

It may encode things like:

part of speech
referent identity
tense
sentiment
discourse role
syntactic relations
semantic disambiguation
long-range dependencies

But again, not as neat symbolic fields.
It remains a distributed numerical state.

So the vector at layer 20 may “look like” just another long list of numbers, but functionally it is much richer than the original embedding.

17. Why people say the vector captures meaning in direction

This comes from geometric relationships.

Two vectors can be compared using:

dot product
cosine similarity
Euclidean distance

If two token or sentence vectors point in similar directions, they often represent related meanings.

For cosine similarity:

cos(θ) = (x · y) / (||x|| ||y||)

If cosine similarity is high, the vectors are directionally aligned.

That is why semantic similarity becomes a geometric phenomenon.

So “meaning” is not stored as an English definition inside the vector.
It is stored as position and direction relative to many other vectors in learned latent space.

18. Important distinction: token embedding vs contextual embedding

This matters a lot.

Token embedding

The row retrieved from the embedding matrix.
Static.
Same starting vector every time that token appears.

Contextual embedding

The transformed vector after attention and layer processing.
Dynamic.
Depends on surrounding tokens.

So the token "bat" starts with one base embedding, but after context:

baseball bat
bat flew at night

produce different contextual vectors.

That dynamic contextualization is one of the main powers of transformers.

19. What matrix math is really doing in an LLM

At a deeper level, matrix math is how the model performs learned coordinate transformations.

Every weight matrix says:

Here is how to reinterpret the current vector state into a new feature space.

Attention matrices say:

Here is how tokens should influence one another.

Feedforward matrices say:

Here is how to recombine existing features into higher-level abstractions.

Output matrices say:

Here is how to project the final hidden state back into vocabulary probabilities.

So the model is basically a tower of alternating:

vector projections
dot-product comparisons
weighted mixtures
nonlinear feature recombinations

All of it is matrix math acting on multidimensional token states.

20. How the final token vector becomes next-token probabilities

At the end of the network, the final hidden state for the current position is multiplied by an output matrix, often related to the embedding matrix:

logits = hW_out

If h has shape 1 × d and W_out has shape d × V, then logits has shape:

1 × V

That gives one score for every vocabulary token.

Then softmax turns those scores into probabilities.

So the flow is:

token ID
embedding vector
many matrix transformations
final hidden vector
output projection
probability distribution over next tokens

In other words, the model starts with a token as a symbol and ends with a token probability field through pure learned geometry and algebra.

21. A compact intuition

Here is the plain-English version:

A token becomes a multidimensional vector because the model looks it up in a giant learned embedding table, where each token has a row of thousands of numbers. That row is not a human-readable meaning but a coordinate location in a high-dimensional semantic space. Once the token is represented as a vector, the transformer repeatedly multiplies it by learned matrices, compares it to other token vectors with dot products, mixes information through attention, and refines it through nonlinear layers. Matrix math is the machinery that turns static token coordinates into dynamic context-sensitive meaning.

22. The simplest mental model

You can picture it like this:

token = a discrete name tag
embedding = that token’s initial address in semantic space
matrix multiplication = learned geometric transformation
attention = token-to-token relevance calculation
deeper layers = progressive refinement of meaning
output projection = convert final latent state back into token probabilities

23. One final important correction to common intuition

People often imagine the embedding vector as a direct encoding of the external world.

That is only partly true.

The embedding is not a raw picture of reality.
It is a learned coordinate system shaped by:

training data
token co-occurrence
prediction pressure
model architecture
gradient descent

So the vector is best seen as a statistically learned latent representation of how the token behaves in language and, indirectly, of the world-model patterns language contains.

How an llm transforms the world into semantic geometry

1. A token starts as a discrete symbol

2. The token ID is converted into a vector through the embedding table

3. What that vector “looks like”

What do the dimensions mean?

4. A deeper way to picture the vector

5. How the lookup can be written in matrix math

Method A: row lookup

Method B: one-hot multiplication

6. The token vector is then combined with positional information

7. What happens next: matrix math begins

8. Why matrix math is used

9. The first major transformations: Q, K, and V

10. How token-to-token interaction happens

11. How values are mixed

12. What matrix multiplication is doing geometrically

13. Example with small toy numbers

14. The feedforward network also uses matrix math

15. Residual connections preserve the token stream

16. What the vector looks like after many layers

17. Why people say the vector captures meaning in direction

18. Important distinction: token embedding vs contextual embedding

Token embedding

Contextual embedding

19. What matrix math is really doing in an LLM

20. How the final token vector becomes next-token probabilities

21. A compact intuition

22. The simplest mental model

23. One final important correction to common intuition

Comments

Leave a Reply Cancel reply