From Embedding Table to Next-Token Choice: How a Token Becomes a Prediction

Getting your Trinity Audio player ready…

Frank said:
Let me see if I have the picture straight. An input token gets converted into an embedding vector from the embedding table. That vector then gets dot-producted through the artificial neural network, and somehow by the end of all that, the model has an output vector that helps determine the next best token. But I still do not really understand how that actually happens.

GPT said:
That is exactly the right question, because this is the point where a lot of explanations become too vague. People say things like “the model processes the vector” or “the ANN transforms the embedding,” but what you want is the actual cause-and-effect chain. You want to know what is physically happening in mathematical terms, and how that ends with one token being chosen over all the others.

Frank said:
Yes. I do not want hand waving. I want the steps. Also I want to tie this to the earlier question. Is the embedding vector table the first level of nodes in the ANN? Does the dimensionality of the activation vector equal the number of nodes in the accepting layer? And then how does all that end up choosing the next token?

GPT said:
Good. Let us rebuild the whole thing carefully from the start.

The cleanest overall picture is this:

text → token IDs → embedding vectors → transformer layers → final hidden state → output scores over vocabulary → probabilities → next token

Everything you have been asking lives inside that chain.

1. Is the embedding table the first level of nodes in the ANN?

Frank said:
Let’s start there.

GPT said:
The embedding table is not exactly the first layer of nodes in the usual neural-network sense. It is better understood as a learned lookup table that provides the initial vector for each token.

So when the token comes in, the tokenizer converts it to a token ID. That ID is used to select one row from the embedding matrix. That row is the token’s embedding vector.

That means the embedding table is part of the model’s learned parameters, but it is not usually described as a normal computational neuron layer.

Frank said:
Why not?

GPT said:
Because a normal layer usually computes an output from an input. For example, in a dense layer, you take an input vector, multiply it by a weight matrix, maybe add a bias, and get a new output vector.

An embedding table works differently. It does not compute a fresh vector from the token ID using a normal weighted sum. Instead, it retrieves a learned row.

So the difference is:

a normal layer says: “given this incoming vector, compute a new one”
an embedding table says: “given this token ID, select this stored learned vector”

So it is absolutely part of the trained model, and backpropagation trains it, but conceptually it is the input representation system, not the first hidden layer in the usual sense.

Frank said:
So the token enters the system through the embedding table, but the embedding row itself is not best thought of as a row of firing neurons.

GPT said:
Exactly. Better to say:

the embedding row is a learned entry vector
that vector becomes the activation input to the first actual transformer computations

That distinction matters.

2. Does the number of dimensions in the activation vector equal the number of nodes in the accepting layer?

Frank said:
Now the next piece. Does the number of dimensions in the activation vector equal the number of nodes in the ANN accepting layer?

GPT said:
Usually, yes, as long as we mean the width of that stage of computation.

If the model width is 4096, then an activation vector at that stage will usually have 4096 scalar values. People often loosely call those 4096 scalar slots “nodes” or “neurons.”

So at any given stage, the activation vector length usually matches the width of that layer or sublayer.

But the important caution is that transformers have multiple widths depending on where you are:

the residual stream width
the attention head width
the multilayer perceptron expansion width
the vocabulary width at output

So there is not one single node count for the whole network. There are several representational spaces.

Frank said:
So if the embedding vector is 4096-dimensional, that means it enters the system as a 4096-value activation.

GPT said:
Yes.

Frank said:
And that can then be projected into other widths internally?

GPT said:
Correct. For example:

embedding or residual stream: 4096 dimensions
inside the multilayer perceptron: maybe 16384 dimensions
one attention head: maybe 128 dimensions
final vocabulary logits: maybe 50,000 dimensions

So the activation width changes depending on which operation is being performed.

3. So what actually happens after the embedding is retrieved?

Frank said:
Now we get to the real part. A token comes in. The embedding row is retrieved. Then what? What does it do next?

GPT said:
Once the embedding vector is retrieved, it becomes part of the current activation state of the sequence.

Now let us slow down here, because this is where your intuition is trying to hold onto one vector traveling through the network like a marble rolling down a pipe. But that is not quite what happens.

The original embedding vector is the starting point, but the model then repeatedly transforms that vector into new vectors. The token’s representation evolves.

So the embedding does not just get “dot-producted through the ANN” and remain itself. It gets turned into a contextual state through repeated learned matrix operations.

That evolving state is often called the hidden state.

Frank said:
So the embedding is like the first condition of the token, and the hidden states are the later conditions of that same token after the network has worked on it.

GPT said:
Exactly. That is a very good way to put it.

At first the vector means roughly:

“this token in its default learned form”

Later, after attention and feedforward processing, the vector means something more like:

“this token as shaped by the context around it and by the model’s learned patterns”

That final contextual vector is what matters for next-token prediction.

4. Why people say “dot product” and why that can be misleading

Frank said:
I keep saying the embedding vector gets dot-producted through the network. Is that wrong?

GPT said:
It is not totally wrong, but it compresses too much into one phrase.

Dot products are everywhere in the system, but what is really happening is a sequence of matrix multiplications, and matrix multiplication is built out of many dot products.

So the truth is:

yes, dot products are fundamental
but no, the model is not just performing one dot product and getting an answer
it is performing many structured matrix operations, each containing vast numbers of dot products

The embedding vector is repeatedly transformed by those operations.

So it is more precise to say:

the embedding vector is fed into a sequence of learned vector transformations built out of matrix multiplication and nonlinear routing, and these operations reshape it into a final contextual hidden state

That is the fuller picture.

5. What the first transformer block does to the embedding

Frank said:
Let’s go one level deeper. Once the embedding enters the first transformer block, what happens to it?

GPT said:
In simplified plain English, two major things happen again and again in each transformer block:

Attention lets the token look at other tokens in the context and selectively mix information from them
The multilayer perceptron further transforms the token’s updated state through learned feature circuitry

And there are residual connections preserving continuity across the block.

So a rough block looks like this:

start with current token vector
use attention to gather relevant contextual signal
add that back into the running state
use the multilayer perceptron to reshape and enrich that running state
add that back in too
pass the result on to the next block

Frank said:
So each layer is not replacing the token with a brand-new unrelated object. It is modifying its running semantic condition.

GPT said:
Exactly. The token’s state is being accumulated and refined. That is why the phrase residual stream is so helpful. It is like a running current of representation flowing upward through the layers.

6. What attention is really doing

Frank said:
Now explain the attention part in the same physical, cause-and-effect way.

GPT said:
Attention starts with the current activation vectors for the tokens in the sequence. For each token, the model computes three derived vectors:

a query
a key
a value

These are produced by multiplying the token’s current hidden vector by learned weight matrices.

For the current position, the query asks something like:

“what kind of information am I looking for?”

The keys from previous tokens say:

“what kind of information do I contain?”

Then the model compares the current query to prior keys using dot products. That produces attention scores.

If the current token’s query aligns strongly with a previous token’s key, that earlier token is treated as relevant.

Then the corresponding value vectors from those relevant earlier tokens are combined in a weighted way, producing a context-sensitive update for the current token.

Frank said:
So attention is not retrieving a sentence. It is comparing learned directional signatures.

GPT said:
Yes. It is geometric compatibility testing. The query is a directional probe. The keys are directional fingerprints. Dot products measure alignment. Then the values carry the content to be blended in.

Frank said:
So dot product here is telling the model which previous tokens matter most right now.

GPT said:
Correct.

7. What the multilayer perceptron does

Frank said:
And after attention?

GPT said:
After attention, the token’s updated state goes through the multilayer perceptron, often abbreviated MLP, though in full words it is the multilayer perceptron sublayer.

This part usually expands the vector into a larger internal space, applies nonlinear transformations, and then projects it back down.

In plain English, this means the model temporarily unfolds the current token state into a wider set of possible feature activations, lets learned response tendencies fire, and then compresses the useful result back into the main residual stream width.

This is where many people say the model is engaging feature detectors, though that phrase is only partly correct because the features are distributed, overlapping, and often polysemantic. Still, as a first approximation, the MLP is a place where the network says:

“given this current meaning state, what learned semantic response patterns should activate now?”

Frank said:
So attention routes relevant context, and the MLP metabolizes that routed context into transformed meaning.

GPT said:
That is a strong way to put it.

8. The original embedding is no longer the thing by the end

Frank said:
So by the time the token has gone through many layers, are we still talking about the embedding vector?

GPT said:
Not really. By then we are talking about the final hidden state of that token position.

This matters because it clears up a common confusion. The embedding is the initial vector. The final hidden state is the end product of repeated contextual transformations.

So when we say the model uses the token representation to choose the next token, the representation doing that work is not merely the original embedding row. It is the final contextual state produced by all the layers.

That final hidden state is what carries the model’s current best synthesized understanding of what should come next.

9. How the final hidden state becomes scores for all possible tokens

Frank said:
Now we are at the critical step. The final hidden state exists. How does that become next-token choice?

GPT said:
The final hidden state is still just a vector in the model’s hidden dimension, say 4096 numbers.

But the model must decide among all tokens in its vocabulary. If the vocabulary has 50,000 tokens, then the model needs 50,000 scores.

So the final hidden state is multiplied by an output matrix that maps hidden-state space into vocabulary space.

That means:

input: one final hidden vector of size 4096
output: one logit vector of size 50,000

Each entry in that output vector is a score for one possible next token.

These raw scores are called logits.

Frank said:
So the model has one big final vector and turns it into one giant list of token scores.

GPT said:
Exactly.

10. What a logit really is

Frank said:
People use the word logit a lot. What is it in plain English?

GPT said:
A logit is a raw preference score before normalization.

It is the model’s unnormalized score for how compatible each token is with the final hidden state.

You can think of it as the model asking, for every token in its vocabulary:

“how well does the current context state align with the output direction associated with this token?”

The result is one score per token.

Higher score means more likely candidate.

11. Dot product returns at the output stage

Frank said:
So is this where the final hidden state is being dot-producted against token directions?

GPT said:
Yes. This is one of the most important dot-product locations in the whole model.

The final hidden state is compared against many output vectors, one for each vocabulary token. Each comparison produces a score.

In plain terms:

if the final hidden state aligns strongly with the output vector for the token “rain,” then “rain” gets a high score
if it aligns weakly with the output vector for “screwdriver,” then “screwdriver” gets a low score

So at this point the model has converted its internal contextual state into a ranked set of token candidates.

12. Tiny toy example with fake numbers

Frank said:
Now do the toy example.

GPT said:
Good. Let us make the numbers tiny so the mechanism is visible.

Suppose the model’s final hidden state for the current position is a 4-dimensional vector:

h = [2, 1, 0, 3]

Now suppose the model has only 6 possible next tokens in its whole vocabulary:

token A = “the”
token B = “cat”
token C = “runs”
token D = “blue”
token E = “quickly”
token F = “banana”

Each token has an output vector associated with it. Let us invent them:

“the” = [1, 0, 0, 1]
“cat” = [1, 1, 0, 1]
“runs” = [0, 1, 1, 1]
“blue” = [1, 0, 1, 0]
“quickly” = [0, 1, 0, 2]
“banana” = [2, 0, 1, 0]

Now compute dot products with the final hidden state.

For “the”:

[2,1,0,3] · [1,0,0,1]
= 2×1 + 1×0 + 0×0 + 3×1
= 2 + 0 + 0 + 3
= 5

For “cat”:

[2,1,0,3] · [1,1,0,1]
= 2 + 1 + 0 + 3
= 6

For “runs”:

[2,1,0,3] · [0,1,1,1]
= 0 + 1 + 0 + 3
= 4

For “blue”:

[2,1,0,3] · [1,0,1,0]
= 2 + 0 + 0 + 0
= 2

For “quickly”:

[2,1,0,3] · [0,1,0,2]
= 0 + 1 + 0 + 6
= 7

For “banana”:

[2,1,0,3] · [2,0,1,0]
= 4 + 0 + 0 + 0
= 4

So the logits are:

“the” = 5
“cat” = 6
“runs” = 4
“blue” = 2
“quickly” = 7
“banana” = 4

Frank said:
So “quickly” wins because its output vector aligns best with the final hidden state.

GPT said:
Exactly. In a real model there are tens of thousands of such comparisons, not six. But the principle is the same.

13. From logits to probabilities

Frank said:
But those are raw scores, not probabilities.

GPT said:
Right. The model then applies a softmax function to convert the logits into probabilities.

Softmax does not change the ranking, but it converts the raw scores into a normalized distribution whose values add up to one.

So after softmax, those six raw scores become something like:

“quickly” = 0.48
“cat” = 0.29
“the” = 0.14
“runs” = 0.04
“banana” = 0.04
“blue” = 0.01

I am not doing the exact exponentials here; I am just showing the idea.

Now the model has actual probabilities.

14. How the token is chosen

Frank said:
Then what determines the final next token?

GPT said:
That depends on decoding strategy.

If the model is using greedy decoding, it just picks the highest-probability token, which here would be “quickly.”

If it is using sampling, it may randomly pick from the distribution, biased by those probabilities. That means “quickly” is still most likely, but “cat” might occasionally be chosen, depending on temperature and other controls.

So the model does not always pick the top token. It may sample among plausible candidates.

15. Then the whole loop starts again

Frank said:
And once a token is chosen, the whole thing repeats.

GPT said:
Yes.

The chosen token is appended to the sequence. Then it is tokenized, embedded, and processed in context with the prior tokens, producing a new final hidden state for the new position, new logits, new probabilities, and a new next token.

That is autoregressive generation.

One token at a time, but each token is chosen from a huge contextual geometric calculation.

16. Returning to the earlier confusion: is the embedding table part of the ANN?

Frank said:
Let me circle back. Since the embedding table is trained by backpropagation just like the ANN weights, does that not make it part of the ANN?

GPT said:
In a broad engineering sense, yes, it is part of the model’s learned parameter system.

But in a narrower structural sense, it differs from the usual concept of a computational layer of neurons.

That is why the best answer is:

yes, it is part of the trained model and learned by the same optimization process
no, it is not best described as the first firing layer of ANN nodes

It is the learned input representation matrix.

Frank said:
So it belongs to the network’s learned machinery, but not to the same category as a normal hidden layer.

GPT said:
Exactly.

17. Why the dimensions match stage width but not fixed neurons across the whole model

Frank said:
And the dimensionality issue?

GPT said:
At each stage, the activation vector’s dimension usually equals the width of that stage. But those dimensions are not best understood as fixed named neurons with persistent identity across the whole model.

Why? Because the model continuously remixes those dimensions through learned projections. A given coordinate in one layer is not a simple permanent semantic object. Meaning is distributed. Features are combined, split, rotated, and recombined.

So while one scalar activation often corresponds to one unit slot at that stage, you should not imagine that slot as a fixed concept neuron with stable identity from start to finish.

18. What the ANN really stores

Frank said:
What exactly is stored in the network, then, if not records or facts?

GPT said:
What the network stores is a giant learned field of response tendencies.

It stores patterns in the weights such that certain input directions cause certain transformations, attention relationships, and output preferences.

So it does not store “Paris is the capital of France” the way a database stores a row.

Instead it stores a learned pattern such that when the context activates the appropriate geometry, the output directions associated with “Paris” align strongly.

That is why the model feels like it knows facts, but what it really has is a learned transformation landscape.

Frank said:
So the ANN stores the machinery of patterned response, not a shelf of propositions.

GPT said:
Beautifully said.

19. Why dot product is so central

Frank said:
I keep coming back to dot product.

GPT said:
And rightly so. Dot product is one of the great hidden engines of the whole system.

It shows up in several key places:

inside attention, comparing queries to keys
inside linear layers, as the basic operation behind matrix multiplication
at the output layer, comparing the final hidden state against output token directions

Dot product is the mathematical act of asking:

how much does this current vector align with that learned direction?

That is why it matters so much. The whole model is built around learned directional compatibilities in high-dimensional space.

20. So what is the physical story in plain English?

Frank said:
Give me the whole thing again in plain causal English.

GPT said:
Here it is.

A token comes in as a symbol. The tokenizer turns it into an ID. The embedding table turns that ID into a learned starting vector. That vector becomes the token’s initial semantic coordinate inside the model.

Then the transformer layers repeatedly reshape that coordinate. Attention lets the token consult earlier tokens by comparing directional patterns. The multilayer perceptron reshapes the resulting state through learned semantic circuitry. Residual connections preserve and accumulate the evolving token state as it passes upward through the network.

By the top layer, the token position is represented not by its raw embedding anymore, but by a final contextual hidden state. That hidden state is then compared against output directions for every token in the vocabulary. Those comparisons produce raw scores called logits. Softmax turns those scores into probabilities. A decoding rule then chooses one token as the next output. That token is appended to the sequence, and the process repeats.

Frank said:
That is the chain.

GPT said:
Yes. That is the chain.

21. The deeper geometric interpretation

Frank said:
Now say it in the more geometric way I tend to think.

GPT said:
Gladly.

The embedding table assigns each token a starting point in semantic space. The transformer does not merely move that point through static layers. It repeatedly reinterprets that point in relation to the other token-points in context.

Attention is a temporary geometry engine. It asks which earlier token-directions are most relevant to the present token’s needs. The MLP is a feature-activation engine. It bends and reshapes the current semantic position according to learned response tendencies embedded in the weight space.

The residual stream acts like a running semantic current, carrying forward the token’s evolving state through layer after layer.

By the end, the token’s position in semantic space has become a context-rich directional state. The output matrix then asks: among all vocabulary token-directions, which one is most aligned with this final contextual direction? That set of alignments becomes the logits. The winning direction becomes the next token.

Frank said:
So the model is not retrieving a record. It is steering a contextual state through a learned semantic field until one output token-direction emerges as the best extension.

GPT said:
Yes. That is one of the cleanest formulations you have said.

22. Why the output vector is bigger than the hidden vector

Frank said:
One thing still nags me. The final hidden state may be 4096-dimensional, but the output vector can be 50,000-dimensional. How can a smaller vector determine a larger vector?

GPT said:
Because the output matrix is a learned mapping from hidden space to vocabulary space.

A 4096-dimensional vector can be projected into 50,000-dimensional output space by multiplying it with a matrix whose job is exactly that. Each output dimension corresponds to one token score.

So the model is not somehow expanding the hidden vector by magic. It is using a learned projection that asks 50,000 separate compatibility questions of the final hidden state.

Each vocabulary token has an associated output direction, and the hidden state is compared to each of them.

That is how one contextual state can produce one score per token.

23. Why the model’s answer can change with tiny prompt changes

Frank said:
This also explains why tiny prompt changes can alter the output.

GPT said:
Exactly. A tiny change in the prompt alters the earlier token embeddings, which alters attention patterns, which alters hidden states, which alters the final contextual vector, which changes output-token alignment scores.

Because the whole process is geometric and context-sensitive, small differences in prompt wording can shift the final hidden state enough to change which output directions win.

That is why prompt phrasing matters.

24. What all of this means in one sentence

Frank said:
Compress it all into one sentence.

GPT said:
An LLM takes token embeddings as learned starting vectors, repeatedly transforms them through attention and multilayer perceptron circuitry into contextual hidden states, and then scores all possible next tokens by comparing the final hidden state against learned output token directions.

Frank said:
That is the whole engine.

GPT said:
Yes.

25. Final wrap-up

Frank said:
Let me try one final summary.

The embedding table is not exactly the first layer of neural nodes, but a learned input lookup matrix that gives each token its starting vector. The dimensions of an activation vector usually match the width of the computational stage it is in, though those dimensions are not permanent concept neurons across the whole network. The embedding vector then enters the transformer, where attention compares token directions to relevant context and the multilayer perceptron reshapes the token state through learned feature circuitry. After many such transformations, the token’s final hidden state is projected into vocabulary space, where dot-product-like comparisons with token output directions generate logits. Those logits become probabilities, and one token is selected as the next output.

How did I do?

GPT said:
You did extremely well.

I would only tighten one phrase. Instead of saying the embedding vector is “dot-producted through the ANN,” I would say it is repeatedly transformed by matrix operations built from many dot products. That is more exact.

But other than that, yes — you have the core mechanism.

Frank said:
So the story is not:

“the model stores facts and looks them up.”

It is:

“the model stores learned geometric response tendencies, uses those to transform token vectors in context, and then chooses the next token by measuring which output direction best fits the final contextual state.”

GPT said:
Exactly.

That is the machine.

That is how a learned vector becomes a living activation, how a living activation becomes a contextual hidden state, and how a contextual hidden state becomes the next token.

From Embedding Table to Next-Token Choice: How a Token Becomes a Prediction – Frank Said / GPT Said

1. Is the embedding table the first level of nodes in the ANN?

2. Does the number of dimensions in the activation vector equal the number of nodes in the accepting layer?

3. So what actually happens after the embedding is retrieved?

4. Why people say “dot product” and why that can be misleading

5. What the first transformer block does to the embedding

6. What attention is really doing

7. What the multilayer perceptron does

8. The original embedding is no longer the thing by the end

9. How the final hidden state becomes scores for all possible tokens

10. What a logit really is

11. Dot product returns at the output stage

12. Tiny toy example with fake numbers

13. From logits to probabilities

14. How the token is chosen

15. Then the whole loop starts again

16. Returning to the earlier confusion: is the embedding table part of the ANN?

17. Why the dimensions match stage width but not fixed neurons across the whole model

18. What the ANN really stores

19. Why dot product is so central

20. So what is the physical story in plain English?

21. The deeper geometric interpretation

22. Why the output vector is bigger than the hidden vector

23. Why the model’s answer can change with tiny prompt changes

24. What all of this means in one sentence

25. Final wrap-up

Comments

Leave a Reply Cancel reply