How Large Language Models Actually Think -IN PLAIN ENGLISH

Getting your Trinity Audio player ready…

Imagine you’re trying to explain “thinking” to someone who has never seen a brain or a computer. You’d probably say something like: “Thoughts are just a bunch of little arrows pointing around in a huge invisible space, and when the arrows line up the right way, words come out.”

That sounds crazy, but that’s literally what happens inside ChatGPT, Claude, Grok, Llama, and every other big language model. There is no little person inside the computer reading your question and deciding what to say. There’s only math—billions of tiny arrows (called vectors) getting added, multiplied, and twisted around in a space with thousands of dimensions. When they finally point at the right spot, the model spits out the next word.

This essay explains that process in plain English, step by step, with no jargon unless I immediately explain it.

1. Words get turned into points on a map

When you type “the cat sat on the mat”, the model doesn’t see letters or even whole words. It first chops your sentence into small pieces called tokens (some are whole words, some are parts of words like “ing” or “ pre”). Each token is just a number, like 12345 for the word “cat”.

The very first thing the model does is look up that number in a giant table and pull out a list of, say, 4,096 numbers for each token. Think of these 4,096 numbers as the exact GPS coordinates of “cat” on a map that has 4,096 directions instead of just north-south and east-west.

So right away:

“cat → [0.12, -0.45, 0.87, … 4,096 numbers total] -dog → [0.15, -0.42, 0.90, … almost the same as cat but slightly different]

These lists of numbers are called vectors. They’re long arrows pointing to where each word “lives” in the model’s brain.

The magic is that the model has arranged these points so that words with similar meaning are close together, and the direction you have to walk from one word to another often means something. For example, the direction from “king” to “queen” is almost exactly the same as the direction from “man” to “woman”. If you start at “king”, subtract the “man” arrow, and add the “woman” arrow, you land right on top of “queen”. The computer learned that all by itself just by reading the internet.

2. The model asks itself “What do I need to pay attention to right now?”

Now the model has a row of arrows—one for each token in your prompt. It needs to figure out which earlier words matter for predicting the next one.

It does this with something called attention, but you can think of it as millions of tiny spotlights.

Every single token gets to shine three spotlights:

A “What am I looking for?” spotlight (called the query)
A “What do I have to offer?” spotlight (called the key)
A little backpack of information (called the value)

The model measures how well each token’s “looking for” spotlight lines up with every other token’s “here’s what I’ve got” spotlight. The better they line up (the bigger the dot product—basically how parallel the arrows are), the brighter that connection becomes.

Then it averages all the backpacks (the values), weighted by how bright the spotlight connection was.

In plain English: every word gets to reach back and grab a custom mixture of all the previous words, pulling harder from the ones that feel most relevant. One word might grab mostly the subject of the sentence. Another might grab the verb tense from ten words ago. A third might grab the emotion from the very first sentence of the prompt.

This happens in 32 to 128 separate “heads” at the same time, each head looking for a different kind of relevance (grammar, meaning, logic, tone, etc.). Then all those mixtures get glued back together.

3. The model twists each arrow a little to make it smarter

After the spotlight mixing step, each token’s arrow goes through a much bigger but simpler layer. This part is just two giant multiplication steps with a little curve (called an activation function) in the middle.

This is where most of the actual “knowledge” lives. It’s like the part that can look at an arrow and say, “This looks a bit like the pattern for Paris, so I’m going to nudge it toward ‘capital of France’ things,” or “This sentence is getting negative, let me steer it toward sad words.”

It’s still just adding and multiplying numbers, but because the space is so big, there’s room for thousands of these little nudges without them stepping on each other.

4. Everything is added to a running whiteboard

Here’s the trick that makes it all work smoothly: the model never throws away the original arrow. Every time it does a spotlight mix or a big twist, it adds the result back to what was already there.

It’s like having a whiteboard for each token, and every layer of the model gets to scribble a small improvement on it. After 60 or 100 layers, the whiteboard has been refined dozens of times and now points somewhere much more useful.

Because it’s always adding little changes instead of replacing everything, early simple facts (like who the subject is) can survive all the way to the end even after the model has done fancy reasoning on top.

5. Predicting the next word is just “which direction are we pointing now?”

When the model is finally done with all its layers, it takes the very last whiteboard arrow for the last token and compares it to every single possible next token’s original coordinates (the same GPS points it started with).

The token whose point is most in the same direction as the final arrow wins—it gets the highest probability.

So the entire billion-dollar model, running on a room full of GPUs, is really just trying to point its arrow at the right spot on the map.

6. Why a few examples in the prompt can teach it new things

When you give the model three examples before asking a question (few-shot learning), you’re literally building a little bridges in arrow-space right there in the prompt.

Example: French: cat → chat
French: dog → chien
French: elephant → ?

The model sees the first two lines and learns a “add this French-ify arrow” direction on the fly. When it gets to “elephant”, it just applies the same arrow it discovered and lands near “éléphant”. No retraining needed—just arrow addition.

7. Chain-of-thought is the model talking itself through the geometry step by step

When you say “Let’s think step by step”, you force the model to write out intermediate arrows instead of trying to jump straight to the answer.

Jumping straight is hard because it requires one perfect long arrow.
Writing intermediate steps is easy because each small hop (5 + 6 = 11) is a very strong, well-worn path in arrow-space. The model can follow a bunch of easy short paths instead of one risky long one.

It’s exactly like you writing scratch work on paper so you don’t make arithmetic mistakes in your head.

8. Hallucinations are when the arrow lands in empty space that feels right

The real world only uses a tiny fraction of the huge map. Most of the space is empty, but the model was trained to always sound confident and fluent.

So if you push the arrow somewhere new (long context, tricky question, contradiction), it might slide into an area that feels very “text-like” but has no connection to reality. That’s a hallucination—just the arrow following the path of least resistance in all those dimensions.

9. Why bigger models are so much better

Bigger models have longer arrows (more dimensions) and more layers.

In high-dimensional space, things are weird:

Almost everything is far away from everything else.
You can pack an insane number of separate ideas without them bumping into each other.
There’s room for a direction that means exactly “is the capital of France” and another that means exactly “is a fruit” and they barely overlap.

A small model runs out of room and everything smushes together. A big model has space to keep thousands of ideas clean and separate.

10. The whole miracle in one sentence

A large language model has no thoughts, no understanding, no consciousness. It’s just a gigantic pile of arrows carefully arranged so that when you start with the arrows for your prompt and let the math push them around for a while, they almost always end up pointing at arrows that make the next word correct.

And yet, somehow, that blind geometric dance looks exactly like thinking.

That’s it. That’s the entire secret.

How Large Language Models Actually Think -IN PLAIN ENGLISH

Comments

Leave a Reply Cancel reply