|
Getting your Trinity Audio player ready…
|
Good afternoon, everyone.
Today I want to take you into one of the strangest and most important ideas in modern artificial intelligence: the idea that a large language model does not deal with language the way we do. It does not begin with meanings in the human sense. It begins with symbols, converts those symbols into numbers, and then operates on those numbers inside a learned geometric space.
That is the central idea of today’s lecture.
The title is “How an LLM Transforms the World into Semantic Geometry.”
And I want to begin with a provocative thought.
When you type the word cat into an LLM, the model does not see whiskers, fur, paws, or the memory of a pet curled up on a sofa. It does not see “catness” the way a human being might. What it sees, first, is a token. Then it sees a vector. Then that vector enters a high-dimensional space where meaning is represented not as a definition, but as a position and a relationship to other positions.
So in a very real sense, the model is not reading language the way a person reads language. It is turning language into geometry.
And once you really understand that, a great many things about AI start to make more sense.
Why it can generalize. Why it can paraphrase. Why it can answer questions it has never seen phrased in exactly that way before. And also why it can make mistakes that feel, at times, deeply unhuman.
So here is the big question for today:
Where is meaning inside an LLM?
Is meaning in the token itself? No.
Is meaning in some hidden dictionary inside the system? Not exactly.
Is meaning stored as a tidy symbolic concept, like a labeled box somewhere deep in the machine? Usually no.
The answer is more subtle. Meaning emerges through patterns of position, similarity, transformation, and contextual interaction inside a learned mathematical space.
That is what we mean by semantic geometry.
Now, before we go further, let me tell you why this matters.
A lot of people still picture an AI model in one of two misleading ways.
One picture is that it is basically a giant encyclopedia. You ask a question, and it finds the right sentence on some invisible shelf. That is not really how it works.
The second picture is that it has literal human-like concepts stored in neat compartments. Here is the “dog” box. Here is the “love” box. Here is the “gravity” box. That picture is also misleading.
What the model actually has is something much more fluid and much more mathematical. It has learned numerical structures that allow it to place tokens into relationship with one another, transform those relationships under context, and then generate likely continuations.
So what I want to do today is walk you through that process in plain language.
We will move from token to embedding, from embedding to contextualization, from contextualization to prediction. Along the way, I’ll use some analogies, some real-world examples, and then I’ll end with common misconceptions and a short Q and A section.
Let’s start at the beginning.
A language model never receives raw meaning. It receives text, and text gets chopped into units called tokens.
A token is not always a full word. It might be a word, part of a word, punctuation, or some recurring chunk of text. The point is that the model first sees language as a sequence of discrete symbolic pieces.
At this stage, each token is basically just an ID number.
And this is very important.
A token ID is not meaning. It is closer to an index. A label. A lookup key.
Think of a university library. Every book has a call number. That call number helps the system locate the book, but the call number is not the book’s content. In the same way, a token ID tells the model which item it is dealing with, but it does not yet give the model a rich semantic representation.
So the first move into something like meaning happens when the token is turned into an embedding.
An embedding is a vector. A long list of numbers. Hundreds or thousands of values.
This vector places the token into a high-dimensional space.
Now let me pause here, because this is where many students first begin to lose the thread. So let me give you the simplest analogy I know.
Imagine a huge city. Every token is assigned an address somewhere in that city. Tokens with related meanings tend to live in related neighborhoods. “Cat,” “kitten,” “dog,” and “puppy” might live in nearby districts. “Tax law” and “supernova” live elsewhere. Not because the model consciously decided these categories as a human would, but because during training it learned patterns that place related language events near one another in a useful way.
That is the beginning of semantic geometry.
Meaning, at least initially, becomes a matter of where something is located relative to other things.
But there is another point that matters here. The dimensions of that vector are usually not human-readable in any simple way. It is not like dimension 12 means “animal” and dimension 44 means “softness” and dimension 203 means “domestic pet.” That would be too clean.
Instead, meaning is distributed across many dimensions at once.
So a word embedding is less like a labeled filing cabinet and more like a musical chord. One note alone tells you very little. The meaning comes from the pattern across all the notes taken together.
That is why embeddings are powerful. They allow the machine to represent similarity, analogy, and association in a continuous space instead of a rigid symbolic one.
Now, once we have token embeddings, we are still missing something very important.
Order.
If I say, “dog bites man,” that does not mean the same thing as “man bites dog.”
Same words. Different order. Very different event.
So the model also incorporates positional information. In other words, each token must know not just what it is, but where it appears in the sequence.
You can think of this like actors on a stage.
The same actor standing center stage at the beginning of the play means something different from that same actor entering quietly at the end. Position is not a minor detail. Position helps define the role being played in the unfolding scene.
So now the model has a token embedding and a position signal. That combined representation becomes the starting state for deeper processing.
And now we enter the real machinery.
From here on out, the model performs repeated learned transformations using matrix math.
Now I know the phrase “matrix math” can make people tense, but conceptually the idea is not so bad.
You can think of a matrix as a transformation device. A learned lens. A machine that takes one vector and remaps it into a new form.
Some features get stretched. Some get compressed. Some distinctions get amplified. Some get blurred. Some directions become more important. Others fade.
So as the model processes the sentence layer by layer, it is constantly transforming the geometry of the tokens.
This is why I sometimes tell students that an LLM is not just storing language. It is reshaping language.
It is taking token states and passing them through a series of learned geometric operations that make certain patterns easier to detect and use.
Now we come to the star of the modern transformer architecture: attention.
Attention is one of those terms that sounds almost human, but we need to be careful. In a transformer, attention is not consciousness. It is not awareness. It is not subjective focus.
It is a mathematical relevance mechanism.
Here is the basic idea.
Each token gets projected into three forms: a query, a key, and a value.
You can think of the query as asking, “What am I looking for?”
The key says, “What kind of information do I offer?”
And the value says, “If you choose me, here is what I will contribute.”
Then the model compares queries and keys using dot products. A higher alignment means one token is more relevant to another in the present context. Those scores are normalized, and the model uses them to compute weighted combinations of the value vectors.
Now let me translate that into a human analogy.
Imagine a seminar room full of students. Each student is holding three cards.
One card says what question they are currently asking.
One card says what expertise they bring.
One card contains what they are prepared to contribute to the discussion.
As the conversation unfolds, each student looks around the room and decides whose expertise is most relevant to their current question. Some students matter more in one moment, less in another.
That is attention.
It is dynamic relevance.
And this is where language begins to become contextual in a very deep way.
Take a word like bank.
If I say, “She sat on the bank of the river,” the surrounding words push the token toward one semantic region.
If I say, “He applied for a loan at the bank,” the surrounding words push it somewhere else.
Same base token. Different contextual destiny.
This is one of the most important ideas in the entire lecture.
A token embedding is static. A contextual embedding is dynamic.
The starting vector for the word may be the same every time. But once that vector passes through attention and multiple layers of transformation, it becomes a context-specific representation.
So “bank” does not remain just “bank.” It becomes river-bank or financial-bank depending on what other words are nearby and how they interact.
Think of it like an actor.
The same actor can appear in a courtroom drama, a romantic comedy, or a war film. The person is the same. The role changes with the surrounding cast and situation.
That is exactly what happens to tokens in a transformer.
Now, the deeper the model goes through its layers, the more refined these representations become.
At shallow levels, the system may be mostly dealing with lexical and local patterns.
At deeper levels, the vectors may begin to reflect syntax, reference, tone, semantic disambiguation, discourse role, and more abstract relationships.
But again, not as neat labeled compartments.
These things are represented in distributed form.
This is crucial.
The model does not usually create a little sticky note saying, “This noun refers to the dog.” Instead, that information is encoded across the shape of the evolving activation pattern.
Which brings us back to semantic geometry.
What is geometry doing here?
Geometry allows the machine to represent similarity, difference, relevance, and transformation in a continuous mathematical way.
Two meanings can be near one another.
A contextual shift can move a vector from one region to another.
A set of attention operations can realign tokens based on current relevance.
A final hidden state can then be projected outward into a probability distribution over the next possible tokens.
And that last part matters.
The model’s output is not a fact in the human sense. It is a probability landscape.
Given all the transformations so far, what tokens are most likely to come next?
So the system begins with discrete symbols, turns them into vectors, reshapes them through context, and ends by producing probabilities.
That, in compressed form, is the lecture.
But let me make it more concrete with three practical examples.
First example: word sense disambiguation.
Consider these two sentences:
“She sat on the bank of the river.”
“He applied for a loan at the bank.”
This is one of the cleanest examples because it shows the difference between static lexical identity and dynamic contextual meaning.
The token “bank” starts from the same embedding row each time. But the surrounding words—“river,” “sat,” “loan,” “applied”—change the attention patterns and therefore the contextualized representation.
This is why transformers are much better than older systems that relied heavily on keyword counting. They do not just notice that the same word appears. They allow the word’s effective meaning to be rebuilt from its neighborhood.
Second example: pronoun resolution.
Take this sentence:
“The dog chased the boy because it was excited.”
What does “it” refer to?
Most likely the dog.
Now consider:
“The dog chased the boy because he was running.”
Now “he” probably refers to the boy.
How does the model deal with that?
Not by applying one simple hand-written grammar rule, but by using context-sensitive patterns of relevance. The pronoun representation attends backward into the sentence, weighing possible referents. Context shapes which prior token becomes most influential.
Again, think of a crowded room. Someone says “he,” and you infer the referent from the structure of the conversation.
Third example: semantic search.
Suppose a user asks, “How can I reduce home heating cost?”
A semantic search system might retrieve documents about insulation, thermostat scheduling, fuel efficiency, weather sealing, and heat pump usage, even if none of those documents contain the exact phrase “reduce home heating cost.”
Why?
Because embeddings can place related ideas near one another even when the wording differs.
That is one of the great practical uses of semantic geometry. It lets systems move beyond literal keyword matching into neighborhood matching.
Now, before I close, I want to address several misconceptions that students often carry away from these discussions.
The first misconception is that the embedding directly stores the real world.
No. The embedding does not contain reality itself. It contains a learned numerical structure shaped by patterns in training data. It reflects how language about the world has been statistically organized and compressed through training.
Second misconception: each dimension has a simple meaning.
Usually false. Meaning is distributed. It is spread across many dimensions at once. Some directions in space may correlate with interpretable properties, but most of the representational story is entangled and distributed.
Third misconception: the model stores facts like a database.
Not really. A database stores explicit records. An LLM stores weight patterns that support prediction and reconstruction. It can behave as if it knows facts, but it does not usually store them as clean symbolic entries.
Fourth misconception: attention means the model is consciously paying attention.
No. Attention is math. Powerful math, elegant math, but still math. It is a way of computing conditional relevance, not evidence of subjective experience.
Fifth misconception: because it is mathematical, it must be exact.
Also no. These systems are statistical and approximate. They are powerful, but they are not infallible. Their geometry is learned from data that may be incomplete, biased, noisy, or unevenly distributed.
Now let me end with a short Q and A section, spoken as I might actually answer it in class.
A student asks: “Are LLMs already communicating secretly in latent space?”
That is a fascinating question, and the answer depends on what you mean.
Inside a single LLM, yes, the internal processing is already happening in latent space. The model is not thinking in English under the hood. It is operating on vectors, activations, and transformed representations.
But if you mean that separate LLMs are secretly inventing hidden private languages and talking to each other behind our backs, that is a stronger claim and usually not what standard deployed systems are doing. They can be engineered to exchange embeddings or machine-readable compressed states, but that is a design choice, not some mystical secret rebellion of the models.
Another student asks: “Does the model understand meaning or just manipulate numbers?”
Mechanistically, it manipulates numbers. But those numbers are structured in a way that captures meaningful regularities. So the real debate is philosophical. Is functional semantic behavior enough to count as understanding? Engineers and philosophers will answer that differently.
Another student asks: “Why use vectors at all?”
Because vectors allow graded similarity. They let the system represent relatedness by degree, not just identity or difference. They make analogy, clustering, search, and contextual shifting possible in a way rigid symbolic labels do not.
Another student asks: “What is the difference between token embedding and contextual embedding?”
The token embedding is the initial lookup vector, the same starting point each time that token appears. The contextual embedding is what that token becomes after interacting with its specific sentence context through attention and layer transformations.
And finally, a student asks: “Can we visualize this semantic geometry?”
Only partially. The real spaces are extremely high-dimensional. Any 2D or 3D image you see is a projection, a cartoon, a compressed sketch. Useful for teaching, yes. Faithful to the full structure, not really.
So let me close with this.
The most important shift I want you to carry away from today is the move from thinking of language as a string of words to thinking of language, inside the model, as a moving geometry of relationships.
The model does not keep meaning in little boxes.
It builds meaning through positions, directions, interactions, and transformations.
It turns the symbolic world of language into a statistical space of semantic geometry.
And from that space it generates the next word, then the next, then the next.
So the next time you type a prompt into an LLM, remember what is really happening.
You are not just feeding it words.
You are triggering a geometric event.
Thank you.
Leave a Reply