Inside an LLM: From Prompt to Prediction

Getting your Trinity Audio player ready…


Course Description

This course explains what happens inside a large language model after a user enters a prompt. Students move step by step through the inference pipeline: from raw text to tokens, embeddings, positional structure, self-attention, multilayer perceptrons, residual accumulation, final hidden states, logits, probabilities, and next-token generation. The course treats the LLM not as a magical chatbot but as a structured computational system that transforms language into geometry, compares activations to learned weight patterns, and recursively produces text one token at a time. This framing directly follows the conceptual structure of the source post. (LF Yadda – A Blog About Life)


Core Learning Goals

By the end of the course, students should be able to:

  1. Explain the difference between raw text, tokens, token IDs, embeddings, and hidden states.
  2. Describe why position must be added to token embeddings.
  3. Explain self-attention in plain English and in simplified mathematical form.
  4. Distinguish between attention as relevance finding and the MLP as feature transformation.
  5. Explain the role of residual connections in preserving and accumulating meaning.
  6. Describe how final hidden states become logits, probabilities, and selected output tokens.
  7. Explain the difference between:
  8. Build intuitive mental models for how an LLM “thinks” without anthropomorphizing it too much.

Big Course Theme

A good unifying sentence for the whole class is this:

An LLM does not retrieve a finished answer from memory; it repeatedly transforms a live context into a probability field over possible next tokens. This is exactly the logic of the article’s later blocks on final hidden state, logits, probabilities, and token selection. (LF Yadda – A Blog About Life)


Unit Structure

Unit 1 — Entering the Machine

Covers Blocks 1–4:

Essential Question

How does ordinary human text become machine-processable structure?

Key Ideas

  • Raw text is not yet machine meaning.
  • Tokenization is segmentation, not semantic understanding.
  • Token IDs are discrete addresses.
  • Embeddings convert symbolic IDs into vectors.
  • Positional information turns isolated token vectors into sequence-aware hidden states. (LF Yadda – A Blog About Life)

Sample Lesson Activities

  • Break a sentence into possible token fragments by hand.
  • Compare “dog bites man” vs “man bites dog” to show why order matters.
  • Use colored index cards to represent tokens, IDs, embeddings, and positional additions.

Sample Homework

Take three short sentences and describe:

  • what the raw text is,
  • what tokenization would do,
  • why token IDs alone are not enough,
  • why positional information must be added.

Unit 2 — The Transformer as a Meaning Engine

Covers Blocks 5–11:

  • Transformer Block
  • Query/Key/Value Projections
  • Self-Attention Score Computation
  • Softmax Attention Weights
  • Value Aggregation
  • Attention Output Projection
  • Residual Add (LF Yadda – A Blog About Life)

Essential Question

How does the model decide what parts of the prompt matter to what other parts?

Key Ideas

  • Transformer layers are repeated refinement modules.
  • Q/K/V are learned projections of the same hidden state into different roles.
  • Self-attention computes relevance between token positions.
  • Softmax turns raw relevance into normalized influence.
  • Value aggregation gathers context.
  • Residual addition preserves earlier representation while adding new context. (LF Yadda – A Blog About Life)

Sample Lesson Activities

  • Role-play attention:
    • one student is Query,
    • several are Keys,
    • several carry Values,
    • the class computes “who matters most.”
  • Have students assign mock attention scores to words in a sentence.
  • Draw the residual stream as a river receiving tributaries.

Sample Homework

Write a one-page explanation of attention without using the words “magic,” “understanding,” or “memory retrieval.”


Unit 3 — The MLP and Semantic Circuitry

Covers Blocks 12–15:

Essential Question

If attention gathers context, what does the MLP do with it?

Key Ideas

  • The MLP is not mainly about token-to-token comparison.
  • It compares live hidden states against learned neuron directions.
  • Activation functions decide which feature responses become strong or weak.
  • The MLP output is added back into the residual stream.
  • Repeated layers progressively refine context into deeper semantic structure. (LF Yadda – A Blog About Life)

Sample Lesson Activities

  • Treat neurons as “feature detectors.”
  • Give students hidden-state cards with features like “plural,” “question,” “animal,” “past tense,” and simulate which detectors fire.
  • Build a “semantic circuitry board” metaphor.

Sample Homework

Describe the difference between:

  • attention asking “where should I look?”
  • MLP asking “what internal features should now activate?”

Unit 4 — From Internal State to Output

Covers Blocks 16–19:

Essential Question

How does the model move from hidden computation to visible text?

Key Ideas

  • The final hidden state is a context-rich summary at the last position.
  • Unembedding compares that state against output token directions to produce logits.
  • Softmax converts logits into probabilities.
  • Decoding chooses one token.
  • The selected token becomes both output and new input for the next cycle. (LF Yadda – A Blog About Life)

Sample Lesson Activities

  • Give students a small fake vocabulary and a made-up logit table.
  • Let them compute rough probabilities and simulate greedy vs sampling output.
  • Show how one selected token changes the next step.

Sample Homework

Explain why logits are not yet probabilities, and why probabilities are not yet the final output.


Unit 5 — The Master Distinction

Covers the ending conceptual synthesis:

Essential Question

What kinds of “comparison” actually happen inside an LLM?

Key Ideas

The source post’s most important conceptual distinction is that there is not just one kind of comparison inside the model. Sometimes live hidden states are projected against learned parameter directions. Other times prompt-derived vectors compare directly with other prompt-derived vectors in self-attention. Understanding that distinction makes the whole pipeline much clearer. (LF Yadda – A Blog About Life)

Sample Lesson Activities

  • Sort operations into two bins:
    • against stored learned structure
    • against live prompt-derived context
  • Debate which is more important for “intelligence”: stored geometry or dynamic routing.

Final Writing Prompt

“Why an LLM is neither a database nor a simple Markov chain.”


Sample Weekly Schedule

Week 1

Introduction to LLM inference
Read Blocks 1–4
Class discussion: “Where does meaning begin?”

Week 2

Tokenization and embeddings
Lab: build a toy tokenizer
Mini-quiz

Week 3

Positional information and hidden states
Diagramming workshop

Week 4

Transformer blocks and Q/K/V
Attention simulation lab

Week 5

Self-attention, softmax, and value aggregation
Short written reflection

Week 6

Residual streams and cumulative meaning
Midterm concept map

Week 7

MLP and feature detectors
Neuron activation exercise

Week 8

Layer repetition and progressive contextualization
Compare early-layer vs late-layer roles

Week 9

Final hidden states and unembedding
Vocabulary competition exercise

Week 10

Probabilities and decoding
Sampling vs greedy lab

Week 11

Two kinds of comparison
Synthesis seminar

Week 12

Student presentations and final project


Sample Course Materials

Now the part you asked for specifically: sample course material, not just the outline.


Sample Lecture 1 Handout

Handout Title

What Happens the Moment You Enter a Prompt?

Opening Idea

When a person reads a sentence, it already feels meaningful. When a language model receives that same sentence, it begins with raw symbols, not meaning. The first phase is not understanding. It is conversion. The post explicitly frames the user prompt as the boundary between the outside symbolic world and the inside computational world. (LF Yadda – A Blog About Life)

Vocabulary

  • Prompt: the text supplied by the user
  • Tokenization: breaking text into model-recognizable pieces
  • Token ID: the numeric ID assigned to each token
  • Embedding: a learned vector retrieved for a token ID
  • Position encoding: additional information that tells the model where each token sits in the sequence (LF Yadda – A Blog About Life)

Mini Example

Sentence:
“The cat sat on the mat.”

Possible simplified flow:

  1. Raw text arrives.
  2. Tokenizer splits it into pieces.
  3. Each piece gets a token ID.
  4. Each token ID retrieves an embedding vector.
  5. Position information is added.
  6. The sequence is now ready to enter transformer layers. (LF Yadda – A Blog About Life)

Key Takeaway

The model does not begin with semantic understanding. It begins with structured conversion into vectors. (LF Yadda – A Blog About Life)


Sample Lecture 2 Mini-Script

Topic

Attention: How tokens decide which other tokens matter

Imagine every token in a sentence asking a question: “Who in the sentence matters to me right now?” That is what queries and keys help the model do. The hidden state at a token position is projected into three views: Query, Key, and Value. The query is what the token is looking for, the key is how another token advertises its relevance, and the value is the content it can contribute if chosen. The post explains that the Q/K/V stage prepares the ingredients for attention, and the self-attention stage then performs prompt-to-prompt comparison by dot-producting queries with keys. (LF Yadda – A Blog About Life)

The raw attention scores are then normalized by softmax, which turns relevance evidence into influence weights. Those weights are used to blend value vectors into a new contextual representation. That result is projected back into the residual stream and added to the earlier state rather than replacing it outright. (LF Yadda – A Blog About Life)

So attention is not “the model looking up the answer.” Attention is the model reorganizing the prompt internally according to relevance. (LF Yadda – A Blog About Life)


Sample Student Worksheet

Worksheet: Mapping the Inference Pipeline

Part A — Put These in Order

Number the following from earliest to latest:

  • logits
  • tokenization
  • token selection
  • embedding lookup
  • residual add
  • next-token probabilities
  • self-attention score computation
  • positional information added

Part B — Short Answer

  1. Why is tokenization not the same thing as semantic understanding?
  2. Why does the model need positional information?
  3. What is the difference between a query and a value?
  4. Why are residual connections important?
  5. Why is the final hidden state not the same as the token embedding?

Part C — Reflection

Complete this sentence:

“The model’s output is not retrieved from storage all at once; it is generated by…”

Expected answer direction:
“…repeatedly turning a contextual hidden state into logits, probabilities, and then a selected next token.” (LF Yadda – A Blog About Life)


Sample Quiz

Quiz 1: Inside an LLM

Multiple Choice

1. Tokenization is primarily:
A. a semantic search
B. a deterministic segmentation and ID assignment step
C. a probability calculation
D. an attention score
Answer: B (LF Yadda – A Blog About Life)

2. Embedding lookup does what?
A. Selects the most similar sentence from training data
B. Converts raw logits into probabilities
C. retrieves a learned vector for each token ID
D. chooses the next token
Answer: C (LF Yadda – A Blog About Life)

3. Self-attention score computation compares:
A. hidden states only to MLP neurons
B. queries to keys across token positions
C. logits to probabilities
D. residual streams to token IDs
Answer: B (LF Yadda – A Blog About Life)

4. Residual addition is important because it:
A. deletes prior information
B. replaces old meaning entirely
C. lets the model preserve prior state while adding new context
D. converts text into token IDs
Answer: C (LF Yadda – A Blog About Life)

5. Logits are:
A. already normalized probabilities
B. raw scores over vocabulary items
C. token IDs
D. positional vectors
Answer: B (LF Yadda – A Blog About Life)

Short Response

In 3–5 sentences, explain the difference between logits and next-token probabilities. (LF Yadda – A Blog About Life)


Sample In-Class Lab

Lab Title

Be the Transformer

Goal

Students physically simulate one attention pass.

Materials

  • index cards
  • markers
  • board space

Setup

Use the sentence:
“The animal didn’t cross the street because it was tired.”

Assign students roles:

  • Token cards: The / animal / didn’t / cross / the / street / because / it / was / tired
  • One student plays the current token: “it”
  • Other students hold simplified key/value labels such as:
    • animal → possible antecedent
    • street → another noun
    • tired → descriptive clue

Procedure

  1. “It” forms a Query: “I need my referent.”
  2. Candidate nouns present Keys.
  3. Students assign rough relevance scores.
  4. Softmax is approximated by converting those scores into rough weights.
  5. Values are blended.
  6. The class decides whether “it” most likely refers to “animal” or “street.”

Learning Point

This demonstrates how attention can help resolve relationships by comparing current-token needs to earlier-token relevance signals. That directly reflects the article’s explanation of self-attention, softmax weighting, and value aggregation. (LF Yadda – A Blog About Life)


Sample Discussion Questions

  1. At what exact point does “meaning” begin inside the model?
  2. Is an embedding “meaning,” or only the beginning of machine-usable meaning?
  3. Why is attention not enough by itself?
  4. Why is the MLP often underappreciated in popular explanations?
  5. Why does repeated layering matter?
  6. Does the model ever “know” a whole sentence in advance, or only produce one token at a time? (LF Yadda – A Blog About Life)

Sample Midterm Assignment

Prompt

Write a 1200–1800 word essay explaining the full LLM inference pipeline to an intelligent nontechnical reader. Your essay must include:

  • tokenization
  • embeddings
  • positional information
  • self-attention
  • MLP
  • residual connections
  • final hidden state
  • logits
  • probabilities
  • token selection

You must also explain the difference between:

Grading Criteria

  • accuracy
  • clarity
  • organization
  • ability to use analogy without losing correctness
  • proper distinction of pipeline stages

Sample Final Project Options

Option 1

Create a visual wall chart of the 19-block pipeline.

Option 2

Write a teacher’s guide titled:
“How to Explain an LLM Without Saying It Just Predicts the Next Word.”

Option 3

Build a toy spreadsheet model showing:

  • tokens
  • token IDs
  • mock embeddings
  • mock attention scores
  • weighted values
  • mock logits

Option 4

Produce a “Frank-said / GPT-said” classroom dialogue version of the course as an educational script.


Sample Teacher Notes

A strong teaching move is to repeat this contrast all semester:

  • symbolic stage: raw text, token pieces, token IDs
  • geometric stage: embeddings, hidden states, projections, attention space
  • decision stage: logits, probabilities, token selection

That three-part framing is faithful to the post’s overall structure from prompt entry, through layered semantic processing, to output competition and decoding. (LF Yadda – A Blog About Life)

Another strong teaching move is to keep returning to the article’s final distinction: not all “comparison” inside an LLM is the same. Some comparison is against frozen learned structure. Some is live token-to-token interaction within the prompt. That distinction is probably the single best conceptual spine for the whole course. (LF Yadda – A Blog About Life)


Suggested Required Materials

  • your original LFYadda post as the anchor text
  • whiteboard or slide deck
  • printed worksheets
  • colored token cards
  • optional simple Python notebooks for vector demos

Recommended Assessments

A balanced version could be:

  • 15% quizzes
  • 20% worksheets and labs
  • 20% discussion participation
  • 20% midterm essay
  • 25% final project

One-Page Course Summary for Students

This course teaches what happens inside a large language model after a prompt is entered.
You will learn how text is tokenized, turned into vectors, combined with positional information, processed through transformer layers, routed through self-attention, transformed by feedforward circuitry, compressed into a final hidden state, compared against vocabulary output directions, converted into probabilities, and finally decoded into a new token. The central lesson is that an LLM is a structured inference machine built from both frozen learned geometry and live context-sensitive interaction. (LF Yadda – A Blog About Life)



Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *