LLM training vs. inference

Getting your Trinity Audio player ready…

The big picture

Think of a large-language model (LLM) as a huge spreadsheet full of dials (its billions of weights).
Two very different activities happen with that spreadsheet:

ActivityEveryday analogy
TrainingTeaching a class of thousands of students all day, taking notes on everything they say, then rewriting the textbook each night.
Inference (using the model)Asking one of those students a single question and getting a quick answer.

Because the goals are different, the amount of math the computer must push through is wildly different.


1. What the computer does during training

  1. Reads a mountain of text at once – millions of words bundled into big batches.
  2. Runs the lesson forward – the model tries to predict the next word everywhere in that mountain.
  3. Checks its work – it compares every prediction with the real next word and figures out where it was wrong.
  4. Runs everything backward – it traces each mistake back through every layer to see which dials should move.
  5. Updates the dials – adjusts the spreadsheet so tomorrow it will do a little better.

Doing steps 2-4 means multiplying gigantic grids of numbers together three separate times (forward, backward-through-activations, backward-through-weights). And because we’re feeding in thousands of sentences at once, those grids are enormous.

Rough scale:
One training step for a modern 7-billion-parameter model works out to quadrillions of tiny “multiply-and-add” operations—enough arithmetic to keep hundreds of top-end GPUs busy for minutes.


2. What the computer does during inference

  1. Starts with a prompt – often just a handful of sentences.
  2. Generates one new token at a time.
    • For that single new token it only needs to multiply a single row of numbers by the spreadsheet.
    • It reuses (“caches”) all the hefty calculations it already did for the earlier tokens.
  3. Never goes backward – no error checking, no dial-turning, just forward math once.

Rough scale:
The model only does billions (not quadrillions) of multiply-and-adds per generated token. A single modern GPU can finish that in a few milliseconds.


3. Why training is so much heavier

FactorTrainingInference
How much text processed at once?Thousands of tokens in big batchesOne new token
Math passes per token?Three (forward + two kinds of backward)One (forward only)
Matrix size?Giant grids (because of the big batch)Skinny row × grid (like a quick lookup)
Total arithmetic per secondMillions of times largerRelatively small
BottleneckRaw floating-point horsepowerMemory bandwidth (grabbing weights & cached data fast enough)

So even though the types of math are the same (matrix multiplies and dot products), the scale is not. Training hurls an ocean of numbers at the GPU and asks it to splash through three times; inference sips a cup.


4. A one-sentence takeaway

Training an LLM is like rebuilding an entire library every night, while using the trained LLM is like asking the librarian to fetch you one quote—same skillset, but the first job is millions of times more labor-intensive.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *