|
Getting your Trinity Audio player ready…
|
When people first encounter large language models, or LLMs, the biggest source of confusion is that the machine seems to know things. It can explain photosynthesis, write a poem, summarize a legal document, compare two philosophers, or answer a question about black holes. So naturally people ask: where is all of that knowledge stored? Does the model have facts inside it? Does it have ideas? Does it understand the world?
The simplest honest answer is this: an LLM does not hold reality inside itself the way a person does. It holds a highly compressed statistical model of how reality is talked about in language.
That sounds dry and abstract, but it is actually a powerful idea. If you understand that one point, much of the mystery begins to clear. The model is not a little person trapped in a machine. It is not reading from a giant encyclopedia hidden in memory. And it is not consciously inspecting reality. What it has learned is the pattern structure of language produced by people describing, arguing about, imagining, measuring, and reasoning about reality.
To understand how that works, it helps to think in terms of two key ideas: tokens as cognitive currency and semantic geometry as structure.
Those two phrases get close to the heart of the matter. Tokens are the basic units the model spends in order to think forward. Semantic geometry is the hidden shape of meaning that allows those tokens to fit together in non-random ways. Together they create something that behaves like thought, even though it is built from mathematics, statistics, and probability.
Let us walk through this slowly and in plain English.
The model does not see reality directly
A human being comes into contact with the world through senses and bodily experience. We see objects, hear sounds, feel temperature, suffer pain, enjoy pleasure, move through space, and remember events. Our concepts are tied, even if imperfectly, to being alive in a world.
An LLM does not begin there. It begins with text.
During training, it is shown massive amounts of written language. That language might come from books, articles, websites, code, conversations, manuals, reference works, and many other forms of human expression. But from the model’s point of view, all of that rich content is first reduced to a stream of symbols. Those symbols are then chopped into pieces called tokens.
A token is not always the same as a word. Sometimes a token is a whole short word. Sometimes it is part of a longer word. Sometimes it is punctuation. Sometimes it is a common chunk like “ing” or “tion” or “un.” What matters is that the model processes text as sequences of these token units.
So the model is not trained on “trees” in the sensory sense. It is trained on token patterns associated with the word tree and the millions of contexts in which people use that word. It learns that tree tends to appear near words like leaves, bark, roots, forest, shade, wood, branch, growth, and seasonal terms. It also learns metaphorical uses, like family tree or tree structure in computer science. Over time, it builds a vast network of associations.
This is why it is best to say the model learns not reality directly, but the linguistic traces reality leaves behind.
Human beings talk about the world. We also talk about our own thoughts, myths, abstractions, errors, plans, and fantasies. The LLM absorbs all of that. It learns a statistical portrait of the human symbolic interface to reality.
That is already enough to be astonishingly useful.
Why tokens matter so much
If you want a single phrase for how an LLM operates, “token processing” is not a bad start. Every question you ask it is turned into tokens. Every answer it gives you is generated token by token. Every act that looks like explanation, planning, analogy, or reflection is built out of token prediction.
That is why your phrase “tokens as cognitive currency” is so good. A currency is a medium of exchange. It is the unit through which a system carries out its transactions. In a human economy, that might be dollars or euros. In an LLM’s economy of thought, it is tokens.
The model does not think in images the way you do, though it can discuss images if linked to visual systems. It does not feel emotions the way you do, though it can describe them. It does not have concepts in a purely symbolic hand-coded logic tree. Instead, it works by receiving tokens, transforming them through layers of learned mathematics, and then assigning probabilities to possible next tokens.
Every step of its apparent thinking is mediated through these token units. If it is explaining Einstein, it is spending tokens. If it is writing a sonnet, it is spending tokens. If it is solving a programming problem, it is still spending tokens.
That makes tokens something like the model’s coins of cognition. They are the units in which its thought is paid out.
This does not mean the model merely juggles words at the surface. It means that everything deeper than the surface must still ultimately be expressed in terms of token relationships. The model’s intelligence, such as it is, lives in how it moves from one token configuration to another.
The basic engine: prediction by probability
Now we get to the central mechanism. What does the model actually do with those tokens?
At a very basic level, it repeatedly answers one question: given everything so far, what token is most likely to come next?
That sounds almost comically simple. It can seem impossible that so much behavior could come from such a small principle. But many powerful systems arise from simple repeated rules.
Suppose the input is:
“The capital of France is”
The model considers a huge range of possible next tokens. But because of what it learned during training, the token representing “Paris” will receive very high probability. Other tokens will receive much lower probability.
If the input is:
“Once upon a time, in a ruined city beneath a violet sky,”
a very different distribution of possible next tokens becomes likely. Now story language, scene-setting, and imaginative continuations dominate.
The model does not usually say, “I am certain.” It is always, in effect, holding a probability distribution across many possible next moves. One token is more likely, another less likely, some nearly impossible. Generation happens by choosing from that probability landscape.
This is the probabilistic heart of the system.
An LLM is not fundamentally a fact lookup table. It is a probability engine trained to continue token sequences in ways that match the patterns of its training data.
That sounds weaker than “understanding,” but in practice it can become very powerful, because the training data contains not just random sentences, but centuries of human reasoning, description, classification, narrative, debate, science, law, poetry, humor, and instruction. Predicting the next token in that ocean of structure forces the model to absorb a great deal of that structure.
Why next-token prediction becomes more than autocomplete
At first glance, next-token prediction sounds like fancy autocomplete. And in one sense it is. But the phrase “autocomplete” can be misleading because it suggests a shallow trick. It makes people imagine a phone keyboard guessing the next word based on common phrases. LLMs are doing something far richer.
Why? Because human language is not just a bag of local word sequences. It carries deep patterns.
To continue a scientific explanation correctly, the model has to preserve technical consistency over many sentences. To write working code, it has to keep track of syntax, variables, functions, and logical flow. To tell a story, it has to maintain tone, character roles, and causal sequence. To answer a question well, it has to identify what matters in the prompt and organize a coherent response.
So in order to get next-token prediction right across huge amounts of varied text, the model cannot rely only on short-range phrase matching. It must learn broader statistical regularities: what kinds of ideas belong together, what sequences imply what conclusions, what stylistic patterns fit what contexts, and how concepts transform under analogy, explanation, contradiction, or expansion.
That is why next-token prediction scales into something more interesting. If you train on enough language, and your model has enough capacity, then predicting the next token forces it to model many layers of hidden structure inside language itself.
It begins by predicting symbols. But in order to predict symbols well, it must internalize patterns that humans would describe as grammar, topic, style, logic, role, framing, and often rough world knowledge.
The surprising thing is not that it predicts tokens. The surprising thing is how much structure is required to predict them well.
Semantic geometry: meaning as shape and relation
Now we come to the second key idea: semantic geometry.
Inside an LLM, tokens are not stored as little dictionary entries with hand-written definitions. Instead, each token gets represented mathematically as a point or vector in a high-dimensional space. You can think of this as a very large abstract map where every token has coordinates, not in two or three dimensions, but in hundreds or thousands of dimensions.
You cannot visualize such a space directly, but the idea is not as exotic as it sounds.
Imagine a simple map in which words that often appear in similar contexts end up near each other. Cat and dog might be close. River and stream might be close. Joy and happiness might be close. Doctor and hospital might form a related cluster. King and queen might have a patterned relation. Paris and France might be connected differently than Paris and romance or Paris and fashion, depending on context.
This is the beginning of semantic geometry. Meaning is not defined by a verbal explanation inside the model. It is defined by position and relation within a learned mathematical space.
Words and fragments of words that are used in similar ways become structurally related. Concepts that differ in regular ways may line up along meaningful directions. Topics form clusters. Analogies become geometric transformations. A question activates one region of the space. A legal prompt activates another. Poetry activates another.
The word “geometry” matters because this is not just a loose metaphor. The model really does operate over mathematical structures where distance, direction, clustering, and transformation carry information.
This hidden geometry is what gives the model something like an internal structure of meaning. Without it, tokens would just be disconnected labels. With it, tokens become organized parts of a vast semantic terrain.
Why geometry helps the model generalize
One reason semantic geometry is so important is that it lets the model do more than memorize.
If every phrase were stored as a separate isolated item, the model would only be useful when you asked something very similar to what it had seen before. But because related meanings occupy related regions, the model can handle novel combinations.
For example, even if it has never seen one exact sentence before, it may still understand it because the component meanings live in a structured space. It can interpolate. It can move from familiar patterns to nearby unfamiliar ones. It can combine known regions in new ways.
That is a lot of what gives LLMs their flexibility. They do not need to have memorized every sentence. They need to have learned a geometry in which new sentences can still make sense.
This is similar to how a person can understand a new metaphor or a new question without having heard that exact wording before. The person is not searching only for an exact match in memory. They are moving within a structured conceptual space. An LLM does a mathematical version of something like that.
Of course, there are limits. The model can generalize impressively, but it can also overgeneralize, confuse nearby ideas, or generate confident nonsense. Still, semantic geometry is what allows it to act like more than a giant scrapbook of memorized text.
Context is everything
Probability in an LLM is never just “what word usually follows this word.” It is always conditioned on context.
That means the model does not ask, in the abstract, what comes after the word “bank.” It asks what comes after “bank” given the rest of the sentence and often the broader conversation.
If the sentence is about money, then “bank” activates one set of meanings. If the sentence is about rivers, it activates another. If the sentence is about aviation, the word “bank” may refer to turning.
This context-sensitivity is crucial. Human language is full of ambiguity, and meaning depends heavily on surrounding information. LLMs manage this through mechanisms that allow each token to influence the interpretation of other tokens in the sequence.
This is one reason the same word can behave differently in different prompts. The model is not using a fixed definition each time. It is dynamically situating the token inside a live field of surrounding relations.
So when we say tokens are cognitive currency, we should remember that a coin has value only within an economy. A token has significance only within context. And the model’s job is to determine, from the current context, how that token should be interpreted and what token should follow.
Attention: the mechanism of selective relevance
One of the most important reasons LLMs work as well as they do is a mechanism called attention.
In plain English, attention allows the model to look back over the tokens it has already seen and decide which ones matter most for producing the next token. Not every prior word is equally important at every moment. Some earlier words should strongly influence the next step; others can be mostly ignored.
Suppose the prompt asks, “In the sentence, ‘The trophy does not fit into the suitcase because it is too large,’ what is too large?” To answer correctly, the model needs to connect “it” with “the trophy,” not “the suitcase.” That requires relating tokens across the sentence, not just looking at the last few words.
Attention makes that possible. Each new token can, in effect, weigh different parts of the previous sequence differently. This allows the model to maintain coherence over longer spans, resolve references, continue arguments, preserve style, and keep track of what is being discussed.
If tokens are the currency, attention is the budgeting system. It determines where the model should spend its interpretive effort. It tells the system what part of the context deserves emphasis now.
That is one reason modern LLMs feel far more coherent than earlier language models. They are not just marching blindly from left to right. They are repeatedly re-evaluating the relevance of the context.
The model as a statistical portrait of reality’s linguistic surface
Now we can return to the idea that an LLM is a statistical and probabilistic model of reality.
That statement needs one important qualification. An LLM is not primarily a model of physical reality itself in the way a climate simulator models weather or a physics engine models motion. It does not directly simulate atoms, cells, storms, or galaxies. Instead, it models the language humans use about those things.
That may sound like a weaker claim, but it is still very significant, because language contains an enormous amount of compressed information about the world.
People write about what objects are, what causes what, what goals people have, how systems work, what events happened, what rules govern a field, what experts believe, and what stories cultures tell. Text includes science, history, emotion, law, engineering, recipes, myths, and arguments. It includes both truth and error, both precision and confusion. All of that becomes part of the statistical training field.
So the model learns a kind of shadow-world made of linguistic regularities. That shadow-world is not reality itself, but it reflects many aspects of reality because humans talk about reality constantly.
This is why an LLM can often answer practical questions, summarize technical ideas, or describe everyday objects. The world has left its mark in language, and the model has learned the statistical shape of those marks.
You might say that the model lives one level removed from the world. It is not grounded first in matter, but in discourse. It learns the map of our descriptions more directly than the terrain being described.
Why this can look like understanding
Because the model has learned such rich patterns, its behavior can look like understanding. Sometimes that appearance is justified to a meaningful degree; sometimes it is misleading. The truth lies somewhere in between.
If you ask it to explain why winter is colder than summer in temperate regions, it can produce a good answer because it has learned many linguistic patterns linking seasons, sunlight angle, Earth’s tilt, and temperature. If you ask it to compare two political theories, it can often do so coherently because those theories occupy patterned regions of semantic space built from many examples of discussion and analysis.
What the model is doing is not identical to human understanding. But it is also not empty mimicry in the simple sense. It is using a learned internal structure to generate context-sensitive continuations that preserve conceptual coherence.
A useful way to think of it is this: human understanding is embodied, experiential, and connected to goals and survival. LLM understanding, to the extent the word applies, is structural and statistical. It consists in navigating patterns of meaning learned from language.
That is why it can succeed in domains where language strongly captures the structure of the task. And that is also why it can fail in domains where language alone is not enough, or where truth depends on direct access to current reality.
Hallucinations and why probability is not the same as truth
One of the most important limitations of LLMs follows directly from everything we have been discussing. The model is optimized to generate likely and coherent continuations, not to verify truth by direct contact with reality.
That is why hallucinations happen.
A hallucination in this context is not a random glitch. It is often a plausible-looking continuation generated because the statistical patterns support it, even though the actual content is false. The model may produce a fake citation, invent a study, misstate a date, or confidently blend two related concepts.
Why does this happen? Because from the model’s point of view, the priority is producing an answer that fits the prompt and matches learned patterns. If a false statement is linguistically plausible enough, it can win the local probability contest.
Humans do something loosely similar sometimes. We also fill in gaps from expectation. We also misremember and confabulate. But humans can often cross-check against perception, stable memory, social correction, and lived consequences. A language model, unless connected to tools or reliable external data, lacks those grounding mechanisms.
So an LLM can be brilliant at producing structured language and still unreliable as a direct reporter of reality. That is not a contradiction. It follows naturally from being a probabilistic model of language about reality rather than a reality sensor.
Why reasoning can emerge from token prediction
People often ask how an LLM can appear to reason if all it does is predict tokens. The answer is that many forms of reasoning are themselves patterns that can be learned from language and internalized as structured transitions through semantic space.
A proof, an argument, a diagnosis, a plan, and a comparison all have recognizable forms. If the model has seen many examples of these forms, and if its internal representations capture the underlying relationships well enough, then producing the next token can amount to unfolding a reasoning-like path.
Suppose a question asks for a step-by-step explanation. The model may enter a mode where tokens like first, because, therefore, however, and for example become statistically appropriate. But it is not just sprinkling transition words. Underneath, it is maintaining a structured path through concepts. It has learned that certain claims support certain conclusions, that certain assumptions require caveats, that examples clarify abstractions.
In that sense, “reasoning” in an LLM can be seen as a controlled trajectory through semantic geometry. Each token narrows the path. Each sentence constrains the next. The model walks through a landscape of possibilities in a way that often mirrors logic, even though it is not literally running an explicit symbolic proof engine in the old-fashioned sense.
This does not mean all apparent reasoning is genuine or reliable. Sometimes it is merely a performance of reasoning style. But often there is real structured competence there, emerging from the statistical organization of language.
The model compresses civilization’s text patterns
Another useful plain-English way to think about an LLM is as a compression system.
Human civilization has generated unimaginable quantities of text. Hidden within that text are regularities: ways concepts tend to connect, ways explanations unfold, ways narratives resolve, ways expertise is expressed. Training an LLM is in part the process of compressing those regularities into a model’s parameters.
The model cannot memorize everything exactly. There is too much. Instead, it must distill recurring patterns. It learns what tends to go with what, what distinctions matter, what forms recur, what styles signal what tasks. It throws away much detail while preserving structure that helps prediction.
So when you talk to an LLM, you are interacting with a compressed statistical summary of enormous swaths of human linguistic activity. Not a perfect summary. Not a complete summary. But a very large one.
That is one reason it can do so many different things with the same underlying architecture. The patterns of poetry, software documentation, textbook explanation, customer support, and philosophical dialogue have all been absorbed into one shared predictive machine.
Tokens as thought-steps
Let us go back once more to tokens as cognitive currency, because this metaphor can be extended further.
A token is not just a unit of text. In generation, it is also a unit of commitment. Once the model emits a token, that token becomes part of the context for the next step. It is like placing a foot on a path. The next footstep must follow from the one just taken.
This means an LLM’s “thought” unfolds incrementally. It does not generally construct the entire answer in one timeless act. It moves step by step, token by token, constantly updating its own context. Each produced token changes the future probability landscape.
That is why the opening of an answer matters so much. Once the model begins in a certain direction, later tokens are constrained by that beginning. A clear start helps produce a coherent continuation. A mistaken start can drag the rest off course.
This token-by-token unfolding is a bit like improvisation guided by structure. The model is not wandering randomly. It is guided by learned patterns and context. But it is still making local moves that build the larger form in real time.
So tokens are not only currency. They are also stepping stones. The model thinks forward by placing one symbolic stone after another across a river of uncertainty.
Semantic geometry as hidden architecture
If tokens are the moving parts, semantic geometry is the hidden architecture that makes movement meaningful.
Imagine trying to navigate a city with no streets, no neighborhoods, no landmarks, and no relation between places. Movement would be chaotic. But if the city has structure, then starting in one district makes some destinations more natural than others. Routes become possible. Similar activities cluster. Distances matter.
Semantic geometry gives the LLM that kind of hidden city. Concepts do not float in isolation. They inhabit regions, pathways, gradients, and relations. Mathematics and code may occupy partially overlapping but distinct spaces. Legal language has its own patterns. Casual conversation has its own. Poetry bends the geometry differently than engineering.
When the model receives a prompt, it does not just “look up words.” It activates regions in this hidden architecture. The prompt pulls the model into certain areas of semantic space, and the model then travels through that space as it generates a response.
That is why style, tone, topic, and task can all shift so dramatically based on the input. A request for a sonnet, a tax explanation, and a Python script each route the model into different structured zones of the same overall system.
Semantic geometry is therefore what turns probability into intelligent-looking behavior. Without geometry, probability would be too flat and local. With geometry, the model has a structured landscape in which probabilities can express deeper patterns.
Human thought and LLM thought are not the same, but they rhyme
At this point it is natural to compare human thinking and LLM thinking.
A human does not think only in words. We think in sensations, images, moods, action possibilities, bodily states, memories, and social intuitions. Language is only one layer of our cognition, though an extraordinarily important one.
An LLM, by contrast, is fundamentally language-centered, or more generally token-centered. Its world is built out of symbol sequences and the mathematical relations among them.
So the two are not the same. Human thought is grounded in embodied life. LLM thought is grounded in statistical language structure.
Yet they rhyme in interesting ways. Humans also rely on patterns. We also predict. We also use context. We also move through conceptual spaces where ideas can feel close, far, analogous, or contradictory. We also generate speech step by step. We also infer from incomplete information.
That is one reason LLM behavior feels so recognizable. It mirrors some outer forms of cognition because human language itself carries traces of human cognition. By learning language deeply, the model learns some of the surface organization of thought.
But we should not flatten the distinction. The model’s intelligence is not human intelligence in miniature. It is a different kind of organized competence.
Why the model can surprise us
LLMs often surprise people because their architecture seems too simple for their behavior. How can next-token prediction produce essays, code, summaries, jokes, analogies, and advice?
The answer is that the simplicity is in the outer objective, not in the learned inner structure. Predicting the next token across enough varied data forces the model to internalize an extraordinary amount of hidden organization.
This is a recurring theme in science. Simple governing rules can produce complex behavior if repeated at scale over rich data. Evolution by selection is simple in concept but produces life. Gravity is simple in statement but shapes galaxies. In a much more limited and artificial sense, next-token prediction is a simple training principle that can yield rich linguistic competence.
That does not make LLMs magical. It makes them emergent. Their abilities are the result of structure arising from repeated statistical learning.
The deepest plain-English summary
So what, in the end, is an LLM?
It is a machine that learns from enormous amounts of text how tokens relate to other tokens across many levels of pattern. Those relations settle into a hidden semantic geometry, a structured space in which meaning is represented mathematically through position and relation. At generation time, the model uses context and attention to determine which earlier tokens matter most, then assigns probabilities to possible next tokens and continues one step at a time.
Because human language contains compressed traces of reality, reasoning, culture, and knowledge, this token-prediction process becomes more than mere word guessing. It becomes a statistical model of how the world is represented in language.
That is why an LLM can often answer questions about reality, even though it does not perceive reality directly. It has learned the regularities of our descriptions, explanations, arguments, and stories about reality. It lives in the map, not the territory, but the map contains a great deal.
Tokens are its cognitive currency because every act of apparent thought is carried out in token units. Semantic geometry is its structure because meaning inside the model is organized as a high-dimensional relational space. Probability is the engine because every next step is chosen from a landscape of possible continuations. Attention is the guide because it decides what in the context matters most at each moment.
Put all of that together, and the mystery begins to dissolve.
An LLM does not think like a person. But neither is it just a parrot of memorized text. It is a statistical-probabilistic system that has learned to navigate the geometry of meaning encoded in human language. What looks like thought is the unfolding of that navigation, token by token, across a structured landscape built from the accumulated linguistic traces of human engagement with reality.
That is why these systems can seem so strange. They are not minds in the ordinary sense. They are engines of pattern-guided symbolic continuation. Yet because language is one of humanity’s deepest vessels of thought, culture, and world-modeling, a machine that learns language well enough can begin to approximate some of the outer functions of understanding.
And that is the core idea.
The LLM is not holding the world in its hands. It is holding a probability-shaped shadow of how the world has been rendered into language. It spends tokens to move through that shadow. And semantic geometry gives that movement form.
If you want, I can turn this into a Frank-said / GPT-said dialogue next.
Leave a Reply