How an AI Chooses Its Next Word: The Heat and Dice of Probability

Getting your Trinity Audio player ready…

When a large language model (LLM) like ChatGPT writes a sentence, it isn’t thinking in words the way we do. It’s doing math.
Every time it produces a new word, it’s standing at a crossroads of thousands of possible continuations, each one with its own probability.

Imagine a massive roulette wheel, with every possible word as a slice on that wheel — “the,” “cat,” “quantum,” “melancholy,” “therefore.”
The size of each slice reflects how likely that word is to come next, given everything that’s been said so far.
That’s what an LLM does at every step: it builds this probability landscape, spins the wheel, and lets the math decide which word lands next.

1. The Invisible Map of Probabilities

Under the hood, the model assigns a score to every possible next token (a token might be a word, a piece of a word, or even punctuation).
These scores are called logits — just raw numbers representing how well each word fits.

Then, it turns those scores into a set of probabilities using a softmax function.
If “the” fits best, it might get 60% probability. “cat” could get 25%, “dog” 10%, “quantum” 5%.
Together, these add up to 100%, forming a kind of map of linguistic possibilities.

At this moment, the model hasn’t chosen anything yet — it just knows how likely each next word could be.

2. Temperature: Turning Up or Down the Creative Heat

Now comes the concept of temperature — the model’s “creativity thermostat.”

Temperature tells the AI how adventurous to be when sampling from its probability map.

When the temperature is low (like 0.2), the model becomes cold and conservative.
It only picks the most probable words. You’ll get safe, predictable answers — like a cautious bureaucrat sticking to the script.
When the temperature is high (like 1.2), the model warms up.
It’s more willing to take risks, choosing unusual or surprising words.
The results may be poetic or chaotic — depending on your point of view.

A low temperature makes the probability peaks taller and sharper — the wheel’s biggest slices get even bigger.
A high temperature flattens them, giving smaller slices more chance to win.

In effect, temperature controls the width of the imagination.

3. Top-k: Trimming the Vocabulary Tree

But even with temperature, there’s still a long tail of unlikely words.
If the model considered every single possible token, it might occasionally pick something weird just because randomness says so.

That’s where top-k sampling steps in.

Imagine you tell the model:

“Forget everything except your top 50 most likely choices.”

That’s what top-k = 50 means.
The model looks only at the 50 most probable next tokens, ignores the rest, and picks randomly among those 50, weighted by their probabilities.

This keeps the model grounded — it won’t suddenly blurt out “rhinoceros” when the topic is “weather forecasts.”
But if you make k too small, say 1 or 2, it can get stuck repeating the same safe words over and over.
So top-k is like a vocabulary gatekeeper — letting through only the most reasonable words, but still leaving some room for variation.

4. Top-p: The Smarter Gate

While top-k keeps a fixed number of words, top-p (also called nucleus sampling) works by cumulative probability instead.

It says:

“Keep picking words until the total probability reaches p.”

If p = 0.9, the model keeps the smallest set of tokens whose combined probability adds up to 90%.

Sometimes that might be 3 words, sometimes 10, depending on how uncertain the model feels.
It’s a more flexible and adaptive version of top-k.
Top-p focuses the model’s attention on its “core of confidence” — the words that make up the bulk of what it really believes could come next.

5. Spinning the Wheel

Once temperature and top-k or top-p have shaped the probability map, the model samples from it.

This means it doesn’t just take the top choice — it literally performs a weighted random draw.
If “the” has 60% chance and “cat” has 25%, the model picks “the” roughly 60% of the time and “cat” 25% — just as a fair roulette wheel would.

That chosen token becomes part of the text, and the process starts again —
new context, new probabilities, new spin.

Word by word, the text unfolds like a guided game of chance — a dance between structure and randomness.

6. Balancing Order and Chaos

Why go through all this trouble?
Because pure determinism would make the model dull and robotic.
If it always picked the highest-probability token, it would repeat phrases endlessly, like a broken record.
But pure randomness would make it nonsensical.

By mixing probabilities with adjustable sampling (temperature, top-k, top-p), we get writing that feels alive — not too stiff, not too wild.
This is why ChatGPT can write essays, poems, or explanations that sound both coherent and human.

7. The Art Behind the Numbers

Each time you ask a model to “be creative,” “summarize clearly,” or “stay factual,” you’re really adjusting this invisible probability landscape.
You’re not telling it which words to use — you’re setting the temperature of its imagination and the boundaries of its vocabulary horizon.

It’s a mathematical orchestra:

Temperature sets the mood.
Top-k defines which instruments can play.
Top-p decides how loud the ensemble can get before it feels complete.

From this probabilistic music comes something astonishing — coherent sentences, stories, explanations, and even emotion.

All from a machine that never “knows” words, only the geometry of likelihood.

In Short

LLMs don’t choose words; they sample them from a probability distribution.
Temperature controls creativity by flattening or sharpening that distribution.
Top-k limits choices to the k most likely options.
Top-p limits them to the most probable cluster whose total chance meets a threshold.
The final word is drawn randomly from this shaped distribution — a roll of the weighted dice.

Every sentence you see is the sum of countless such probabilistic spins — guided by math, shaped by training, and tuned by temperature.