Two Attention Heads, Real Tensor Shapes, One Transformer Block – A deeper Frank-said / GPT-said walkthrough with explicit shapes and a more realistic flow

Getting your Trinity Audio player ready…

Frank said:

Do that.

GPT said:

Good. Now we will move from the tiny cartoon version to a more realistic transformer-style example.

Not giant real-model scale, but realistic enough to show:

  • 2 attention heads
  • explicit tensor shapes
  • how one token is split across heads
  • how head outputs are recombined
  • how the MLP processes the result

We will still use toy numbers, but the structure will now look much more like what actually happens.

1. The setup

GPT said:

Let’s say we have:

  • batch size b = 1
  • sequence length n = 3
  • model dimension d_{\text{model}} = 4
  • number of heads h = 2
  • so head dimension d_{\text{head}} = 2

Sequence:

[The, cat, slept]

We will still track the token:

cat

The hidden state tensor entering the block is:

X \in \mathbb{R}^{b \times n \times d_{\text{model}}}

So here:

X \in \mathbb{R}^{1 \times 3 \times 4}

Let the token vectors be:

x_{\text{The}} = [1,0,1,0]

x_{\text{cat}} = [0,1,1,0]

x_{\text{slept}} = [0,1,0,1]

So the tensor is:

X = \begin{bmatrix} [1,0,1,0] \\ [0,1,1,0] \\ [0,1,0,1] \end{bmatrix}

with an implicit batch dimension of 1.

Frank said:

So the overall shape is [1,3,4]. One sequence, three tokens, four latent features per token.

GPT said:

Exactly.

That is the raw hidden-state block entering attention.

2. Linear projections into Q, K, V

GPT said:

The model applies learned matrices:

W_Q,\; W_K,\; W_V \in \mathbb{R}^{4 \times 4}

Why 4 \times 4?

Because the model dimension is 4, and the total query/key/value dimension across both heads is also 4.

Then later the model reshapes that 4 into:

  • 2 heads
  • each of dimension 2

So first the projection happens in full model space, then the result is split into heads.

Let:

W_Q = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \end{bmatrix}

W_K = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 1 \end{bmatrix}

W_V = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \end{bmatrix}

Now compute:

Q = XW_Q,\quad K = XW_K,\quad V = XW_V

Since X is shape [1,3,4], each result is also shape:

[1,3,4]

3. Compute Q, K, V for each token

For “The” = [1,0,1,0]

Query

q_{\text{The}} = [1,0,1,0]W_Q = [2,0,0,0]

Key

k_{\text{The}} = [1,0,1,0]W_K = [2,1,0,1]

Value

v_{\text{The}} = [1,0,1,0]W_V = [1,1,2,0]

For “cat” = [0,1,1,0]

Query

q_{\text{cat}} = [0,1,1,0]W_Q = [1,1,-1,1]

Key

k_{\text{cat}} = [0,1,1,0]W_K = [1,1,1,1]

Value

v_{\text{cat}} = [0,1,1,0]W_V = [1,1,1,1]

For “slept” = [0,1,0,1]

Query

q_{\text{slept}} = [0,1,0,1]W_Q = [0,2,0,0]

Key

k_{\text{slept}} = [0,1,0,1]W_K = [0,2,1,1]

Value

v_{\text{slept}} = [0,1,0,1]W_V = [1,1,0,2]

Frank said:

So after projection, each token still has 4 numbers, but now those numbers mean “query-space coordinates,” “key-space coordinates,” or “value-space coordinates.”

GPT said:

Exactly.

The original embedding dimensions have been remixed into new functional spaces.

4. Split into two heads

GPT said:

Now each 4-dimensional Q, K, V vector is split into 2 heads of size 2.

So the shape changes from:

[1,3,4] \rightarrow [1,3,2,2]

and then usually transposed to:

[1,2,3,2]

meaning:

  • batch = 1
  • heads = 2
  • sequence positions = 3
  • per-head feature size = 2

That is a very standard transformer shape.

Head 1 = first 2 components

Head 2 = last 2 components

For cat:

q_{\text{cat}} = [1,1 \mid -1,1]

So:

  • head 1 query = [1,1]
  • head 2 query = [-1,1]

For keys:

  • The: [2,1 \mid 0,1]
  • cat: [1,1 \mid 1,1]
  • slept: [0,2 \mid 1,1]

For values:

  • The: [1,1 \mid 2,0]
  • cat: [1,1 \mid 1,1]
  • slept: [1,1 \mid 0,2]

Frank said:

So each head gets only part of the projected representation.

GPT said:

Yes. That is the key idea.

Each head sees a different low-dimensional slice of the token’s projected state. In real models, heads can also be implemented as separate learned projections, but conceptually the result is the same: multiple parallel representational channels.

5. Head 1 attention for “cat”

GPT said:

Now let’s do actual attention for cat in head 1.

Head 1 query for cat:

q^{(1)}_{\text{cat}} = [1,1]

Head 1 keys:

  • The: [2,1]
  • cat: [1,1]
  • slept: [0,2]

Now compute dot products.

Score vs The

[1,1]\cdot[2,1] = 3

Score vs cat

[1,1]\cdot[1,1] = 2

Score vs slept

[1,1]\cdot[0,2] = 2

Raw scores:

[3,2,2]

Scale by \sqrt{2}\approx1.414:

[2.12,1.41,1.41]

Softmax approximately:

[0.50,0.25,0.25]

Head 1 weighted value mix

Head 1 values:

  • The: [1,1]
  • cat: [1,1]
  • slept: [1,1]

Weighted sum:

z^{(1)}_{\text{cat}} = 0.50[1,1] + 0.25[1,1] + 0.25[1,1] = [1,1]

So head 1 output for cat is:

z^{(1)}_{\text{cat}} = [1,1]

6. Head 2 attention for “cat”

GPT said:

Now we do the same for head 2.

Head 2 query for cat:

q^{(2)}_{\text{cat}} = [-1,1]

Head 2 keys:

  • The: [0,1]
  • cat: [1,1]
  • slept: [1,1]

Now compute scores.

Score vs The

[-1,1]\cdot[0,1] = 1

Score vs cat

[-1,1]\cdot[1,1] = 0

Score vs slept

[-1,1]\cdot[1,1] = 0

Raw scores:

[1,0,0]

Scale by \sqrt{2}:

[0.71,0,0]

Softmax:

Approximate exponentials:

e^{0.71}\approx2.03,\quad e^0=1,\quad e^0=1

Total:

2.03+1+1=4.03

Weights:

[0.50,0.25,0.25]

Interesting here too—it again prefers The, but for a different geometric reason.

Head 2 weighted value mix

Head 2 values:

  • The: [2,0]
  • cat: [1,1]
  • slept: [0,2]

Now compute:

z^{(2)}_{\text{cat}} = 0.50[2,0] + 0.25[1,1] + 0.25[0,2]

First coordinate:

0.50\cdot2 + 0.25\cdot1 + 0.25\cdot0 = 1.25

Second coordinate:

0.50\cdot0 + 0.25\cdot1 + 0.25\cdot2 = 0.75

So:

z^{(2)}_{\text{cat}} = [1.25,0.75]

Frank said:

Now the two heads are not identical anymore.

GPT said:

Exactly. That is the whole point.

Head 1 and head 2 are giving different contextual summaries of the same token.

  • head 1 gave: [1,1]
  • head 2 gave: [1.25,0.75]

So each head is a different contextual lens.

7. Concatenate the head outputs

GPT said:

Now concatenate the two head outputs:

z_{\text{cat}} = [1,1,1.25,0.75]

This restores the full model dimension of 4.

Shape-wise:

  • each head output: [2]
  • concat across 2 heads: [4]

Across the whole batch and sequence, the tensor would go from:

[1,2,3,2] \rightarrow [1,3,4]

after transposing and concatenating.

8. Output projection

GPT said:

Now apply the output matrix W_O \in \mathbb{R}^{4 \times 4}.

Let:

W_O = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -1 \end{bmatrix}

Now compute:

o_{\text{cat}} = z_{\text{cat}}W_O

with:

z_{\text{cat}} = [1,1,1.25,0.75]

Now each output coordinate:

Dim 1

1\cdot1 + 1\cdot0 + 1.25\cdot1 + 0.75\cdot0 = 2.25

Dim 2

1\cdot0 + 1\cdot1 + 1.25\cdot0 + 0.75\cdot1 = 1.75

Dim 3

1\cdot1 + 1\cdot0 + 1.25\cdot(-1) + 0.75\cdot0 = -0.25

Dim 4

1\cdot0 + 1\cdot1 + 1.25\cdot0 + 0.75\cdot(-1) = 0.25

So:

o_{\text{cat}} = [2.25,1.75,-0.25,0.25]

9. Residual add after attention

GPT said:

Original token state:

x_{\text{cat}} = [0,1,1,0]

Add attention output:

x’_{\text{cat}} = [0,1,1,0] + [2.25,1.75,-0.25,0.25]

So:

x’_{\text{cat}} = [2.25,2.75,0.75,0.25]

Frank said:

So after multi-head attention, “cat” has become [2.25,2.75,0.75,0.25].

GPT said:

Yes. That is its new context-aware hidden state after the attention sublayer.

10. Feedforward network with explicit shape

GPT said:

Now we do the MLP.

Let the feedforward expansion dimension be:

d_{\text{ff}} = 6

So the first MLP matrix has shape:

W_1 \in \mathbb{R}^{4 \times 6}

and the second:

W_2 \in \mathbb{R}^{6 \times 4}

This is realistic in structure: expand, apply nonlinearity, compress.

Let:

W_1 = \begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 & 0 & 1 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 & 0 \end{bmatrix}

Take:

x’_{\text{cat}} = [2.25,2.75,0.75,0.25]

Now compute:

h = x’_{\text{cat}}W_1

Let’s do each neuron.

Neuron 1

2.25\cdot1 + 2.75\cdot0 + 0.75\cdot1 + 0.25\cdot0 = 3.0

Neuron 2

2.25\cdot0 + 2.75\cdot1 + 0.75\cdot1 + 0.25\cdot1 = 3.75

Neuron 3

2.25\cdot1 + 2.75\cdot1 + 0.75\cdot0 + 0.25\cdot0 = 5.0

Neuron 4

2.25\cdot0 + 2.75\cdot0 + 0.75\cdot1 + 0.25\cdot1 = 1.0

Neuron 5

2.25\cdot1 + 2.75\cdot0 + 0.75\cdot0 + 0.25\cdot1 = 2.5

Neuron 6

2.25\cdot0 + 2.75\cdot1 + 0.75\cdot0 + 0.25\cdot0 = 2.75

So:

h = [3.0, 3.75, 5.0, 1.0, 2.5, 2.75]

11. Apply nonlinearity

GPT said:

We will again use ReLU just for clarity.

All entries are positive, so:

h = [3.0, 3.75, 5.0, 1.0, 2.5, 2.75]

stays the same.

12. Project back down

Let:

W_2 = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \end{bmatrix}

Now compute:

y = hW_2

with:

h = [3.0,3.75,5.0,1.0,2.5,2.75]

Now each output coordinate.

Output 1

3.0\cdot1 + 3.75\cdot0 + 5.0\cdot1 + 1.0\cdot0 + 2.5\cdot1 + 2.75\cdot0 = 10.5

Output 2

3.0\cdot0 + 3.75\cdot1 + 5.0\cdot1 + 1.0\cdot0 + 2.5\cdot0 + 2.75\cdot1 = 11.5

Output 3

3.0\cdot1 + 3.75\cdot0 + 5.0\cdot0 + 1.0\cdot1 + 2.5\cdot0 + 2.75\cdot1 = 6.75

Output 4

3.0\cdot0 + 3.75\cdot1 + 5.0\cdot0 + 1.0\cdot1 + 2.5\cdot1 + 2.75\cdot0 = 7.25

So:

y = [10.5,11.5,6.75,7.25]

13. Final residual add

GPT said:

Now add this MLP output back to the post-attention state:

x’_{\text{cat}} = [2.25,2.75,0.75,0.25]

y = [10.5,11.5,6.75,7.25]

So final output of the block for cat is:

x^{\text{next}}_{\text{cat}} = [12.75,14.25,7.5,7.5]

Frank said:

So this one token went through two heads, got split into two different relational viewpoints, got recombined, then got passed through a 6-neuron feedforward expansion and compressed back down.

GPT said:

Exactly.

That is a much closer structural picture of a real transformer block.

14. The tensor-shape story in one place

Frank said:

Now summarize the tensor shapes clearly.

GPT said:

Good. Here is the full shape flow.

Input hidden states

X: [b,n,d_{\text{model}}] = [1,3,4]

After Q, K, V projections

Q,K,V: [1,3,4]

After split into 2 heads of size 2

Q,K,V: [1,3,2,2]

After transpose for head-wise attention

Q,K,V: [1,2,3,2]

That means:

  • batch 1
  • 2 heads
  • 3 token positions
  • 2 features per head

Attention score tensor

For each head, query positions compare to key positions:

\text{scores}: [1,2,3,3]

Why [3,3]?

Because each of the 3 tokens can attend to each of the 3 tokens.

Weighted head outputs

Z_{\text{heads}}: [1,2,3,2]

After recombining heads

Z: [1,3,4]

After output projection

O: [1,3,4]

After MLP expansion

H: [1,3,6]

After MLP compression

Y: [1,3,4]

Final block output

X_{\text{next}}: [1,3,4]

So the block begins and ends in model space [1,3,4], but internally it temporarily unfolds into:

  • head space
  • score space
  • expanded neuron space

15. What the two heads are really doing

Frank said:

What is the conceptual value of the two heads?

GPT said:

Each head is a different relational microscope.

In our toy example:

  • head 1 produced one contextual summary of “cat”
  • head 2 produced another

Then the model combined them.

That means one token is not interpreted through one single contextual question. It is interpreted through several parallel contextual questions at once.

In a real model, one head might lean toward:

  • local syntax

while another leans toward:

  • semantic role

another toward:

  • long-range dependency

another toward:

  • discourse structure

The token representation gets richer because multiple relational viewpoints are fused.

16. What the MLP neurons are doing

Frank said:

And the MLP?

GPT said:

The MLP is like a feature refinery.

Attention answers:

Who should influence me?

The MLP answers:

Given that updated context, what hidden feature combinations should now become strong?

That is why the model expands from 4 dimensions to 6 neurons in this example.

The larger space gives it room to build more feature combinations before folding the result back into model space.

17. The deepest interpretation

Frank said:

Give me the deepest interpretation of what we just watched.

GPT said:

Here it is:

A token begins as a compact latent state. The transformer block does not merely pass it along. It first re-expresses that token in multiple relational subspaces through Q, K, and V projections. Then multiple attention heads let the token compare itself to the entire sequence from different geometric perspectives. These head-specific contextual summaries are fused back into one state. Then the MLP expands that state into a larger neuron space where more complex feature combinations can be activated. Finally, the result is compressed back into the model’s hidden dimension and added to the running token state. What exits the block is not the original token, but a more context-shaped semantic object.

18. Final Frank-said / GPT-said closing

Frank said:

So the embedding dimensions are the raw coordinates, the weight matrices define how those coordinates get remixed, the attention heads let the token enter multiple parallel relational conversations, and the MLP lets the updated token ignite higher-order neuron patterns before being written back into hidden state space.

GPT said:

Exactly.

That is the transformer block in one sentence.

Or even shorter:

the token is projected, compared, blended, expanded, refined, and returned.

Frank said:

So one block is really one round of multi-perspective contextual negotiation followed by feature consolidation.

GPT said:

Perfectly said.

That is exactly what it is.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *