How a Large Language Model Thinks: From Training to Output – an ai primer

Getting your Trinity Audio player ready…

View Post


Introduction

Large Language Models (LLMs) like GPT don’t “know” facts the way a person does, nor do they “read” in the way we imagine. Instead, they are vast probability machines. They learn statistical patterns in language, encode those patterns into a giant artificial neural network (ANN), and then use mathematics to generate likely words one after another. To understand this, we need to follow the journey from training, through tokenization, into the mathematics of neural networks—especially the humble dot product—and finally into how probabilities turn into fluent sentences.


1. Training: Reading Without Remembering

Imagine giving a child every book, newspaper, and web page on Earth, but instead of asking them to remember specific sentences, you only asked them to notice:

  • Which words tend to appear together?
  • What kinds of sentences follow each other?
  • What tone or rhythm usually accompanies certain topics?

That’s essentially what training does for an LLM. During training, the model processes trillions of words, not to memorize them, but to distill them into statistical relationships.

This process involves feeding text through the network, asking it to guess the next word (or token), and then checking how wrong it was. The difference between its guess and the correct answer creates an error signal, which is used to adjust the internal weights of the ANN. Do this billions of times, and the network becomes astonishingly good at predicting the next word in almost any context.


2. Tokenization: Breaking Language Into Building Blocks

Before the math can work, the text must be broken into tokens. A token is the smallest unit the model understands. It could be:

  • A whole word (like “dog”).
  • Part of a word (“ing”).
  • Or even punctuation (“,”).

For example, the sentence:
“The unbelievable truth.”

might be tokenized as:

  • “The”
  • “un”
  • “believe”
  • “able”
  • “truth”
  • “.”

Why break words down this way? Because it gives the model flexibility. English, like all languages, is full of prefixes, suffixes, and variations. By breaking text into tokens, the model can handle new words it never saw in training (e.g., “electro-unbelievability”) by recognizing familiar pieces.

Each token is then mapped to a number—a process called embedding. The embedding is like a coordinate in a high-dimensional space where words with similar meanings land close together.


3. Matrices: The Language of Neural Networks

At its core, an LLM is nothing more than a gigantic machine for multiplying matrices. A matrix is just a grid of numbers, like a spreadsheet:

[1 2 3]  
[4 5 6]

When text is turned into embeddings, those embeddings are also vectors (lists of numbers). The network transforms these vectors by multiplying them with matrices. Each multiplication reshapes the information, highlighting some features and dimming others.

Think of it like shining white light through a series of colored filters. Each filter changes the hue slightly, until finally you get the desired spectrum. In the LLM, the matrices are the filters, and the light is the meaning of the text.


4. The Dot Product: Measuring Relationships

The most important piece of math in an LLM is the dot product. It’s surprisingly simple:

Take two vectors (say, two word embeddings). Multiply each matching pair of numbers, and add them up.

For example:

Vector A = [1, 2, 3]  
Vector B = [4, 5, 6]  

Dot product = (1×4) + (2×5) + (3×6) = 32

What does this mean? The dot product measures how aligned two vectors are.

  • If the result is large, the vectors point in similar directions → the words are related.
  • If the result is small (or negative), the vectors are less related.

In language, this is powerful:

  • “king” and “queen” have embeddings that give a strong dot product.
  • “king” and “banana” have embeddings with a weak dot product.

Dot products are the backbone of the attention mechanism, which decides which words in a sentence matter most to predicting the next one.


5. Attention: Finding What Matters

Suppose the input is:
“The cat sat on the mat because it was soft.”

What does “it” refer to? The cat, or the mat?

Attention uses dot products between tokens to figure this out. Each word in the sentence is compared to every other word, and the ones with the strongest alignment are given more “attention weight.” In this case, “it” aligns more strongly with “mat” than with “cat,” so the model leans toward “mat” as the reference.

This process is performed through a matrix of dot products called the attention map. Every row and column represents tokens, and the numbers in the matrix say how strongly each token relates to the others. This is how context emerges: the model isn’t just looking at the last word, but at the entire network of relationships in the sentence.


6. Storing Knowledge in an ANN

So where does all this information live?

The model is essentially a giant artificial neural network with billions (sometimes trillions) of weights. These weights are just numbers stored in matrices. Each weight tells the model how strongly one neuron influences another. After training, the weights collectively encode the statistical “knowledge” of language.

It’s a bit like a massive jigsaw puzzle: no single piece looks like the picture, but together they form a coherent image. In an LLM, no single weight “remembers” a fact. Instead, meaning is distributed across millions of weights, which only reveal their knowledge when activated by the right tokens.


7. Probability: Choosing the Next Token

Once the input has been processed through embeddings, dot products, attention, and transformations, the network produces a probability distribution over all possible next tokens.

For example, after:
“The cat sat on the…”

the model might output:

  • “mat” → 72%
  • “sofa” → 10%
  • “floor” → 7%
  • “president” → 0.01%

Notice that the model doesn’t know the answer—it only assigns probabilities based on its training. To generate text, it picks one token according to these probabilities. If it always picked the highest one, the text would be boring and repetitive. By sampling from the distribution, the model introduces creativity.


8. Generating Output: One Token at a Time

The chosen token is added to the string, and the whole process repeats:

  1. Look at the updated sentence.
  2. Compute embeddings.
  3. Multiply through matrices.
  4. Use dot products for attention.
  5. Generate new probabilities.
  6. Pick the next token.

This cycle continues until the model outputs a full response. To you, it feels like a flowing thought. To the machine, it’s a rapid series of mathematical guesses, unfolding one token at a time.


9. Why It Works So Well

What’s remarkable is that this purely statistical process, based on nothing but matrix multiplication and dot products, creates language that feels intelligent. That’s because language itself is a structure of probabilities: we humans don’t consciously calculate them, but we intuitively know that “peanut butter and jelly” is more likely than “peanut butter and asphalt.” The LLM is simply formalizing that intuition at a massive scale.


Conclusion

A Large Language Model doesn’t think in the way we do. It doesn’t hold facts in a mental filing cabinet. Instead, it’s a probability engine built on the mathematics of vectors, matrices, and dot products.

  • Training teaches it statistical patterns.
  • Tokenization breaks language into manageable pieces.
  • Matrix math transforms those pieces into useful signals.
  • Dot products measure relationships and drive attention.
  • Neural network weights encode the knowledge.
  • Probabilities guide the output.

At the end of the process, the LLM produces language that seems human—not because it understands, but because the mathematics of probability is enough to mimic the structures of thought embedded in language itself.



Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *