|
Getting your Trinity Audio player ready…
|
Course Description
This course explains what happens inside a large language model after a user enters a prompt. Students move step by step through the inference pipeline: from raw text to tokens, embeddings, positional structure, self-attention, multilayer perceptrons, residual accumulation, final hidden states, logits, probabilities, and next-token generation. The course treats the LLM not as a magical chatbot but as a structured computational system that transforms language into geometry, compares activations to learned weight patterns, and recursively produces text one token at a time. This framing directly follows the conceptual structure of the source post. (LF Yadda – A Blog About Life)
Core Learning Goals
By the end of the course, students should be able to:
- Explain the difference between raw text, tokens, token IDs, embeddings, and hidden states.
- Describe why position must be added to token embeddings.
- Explain self-attention in plain English and in simplified mathematical form.
- Distinguish between attention as relevance finding and the MLP as feature transformation.
- Explain the role of residual connections in preserving and accumulating meaning.
- Describe how final hidden states become logits, probabilities, and selected output tokens.
- Explain the difference between:
- prompt-to-weight comparison, and
- prompt-to-prompt comparison. (LF Yadda – A Blog About Life)
- Build intuitive mental models for how an LLM “thinks” without anthropomorphizing it too much.
Big Course Theme
A good unifying sentence for the whole class is this:
An LLM does not retrieve a finished answer from memory; it repeatedly transforms a live context into a probability field over possible next tokens. This is exactly the logic of the article’s later blocks on final hidden state, logits, probabilities, and token selection. (LF Yadda – A Blog About Life)
Unit Structure
Unit 1 — Entering the Machine
Covers Blocks 1–4:
- User Prompt
- Tokenization
- Embedding Lookup
- Positional Information Added (LF Yadda – A Blog About Life)
Essential Question
How does ordinary human text become machine-processable structure?
Key Ideas
- Raw text is not yet machine meaning.
- Tokenization is segmentation, not semantic understanding.
- Token IDs are discrete addresses.
- Embeddings convert symbolic IDs into vectors.
- Positional information turns isolated token vectors into sequence-aware hidden states. (LF Yadda – A Blog About Life)
Sample Lesson Activities
- Break a sentence into possible token fragments by hand.
- Compare “dog bites man” vs “man bites dog” to show why order matters.
- Use colored index cards to represent tokens, IDs, embeddings, and positional additions.
Sample Homework
Take three short sentences and describe:
- what the raw text is,
- what tokenization would do,
- why token IDs alone are not enough,
- why positional information must be added.
Unit 2 — The Transformer as a Meaning Engine
Covers Blocks 5–11:
- Transformer Block
- Query/Key/Value Projections
- Self-Attention Score Computation
- Softmax Attention Weights
- Value Aggregation
- Attention Output Projection
- Residual Add (LF Yadda – A Blog About Life)
Essential Question
How does the model decide what parts of the prompt matter to what other parts?
Key Ideas
- Transformer layers are repeated refinement modules.
- Q/K/V are learned projections of the same hidden state into different roles.
- Self-attention computes relevance between token positions.
- Softmax turns raw relevance into normalized influence.
- Value aggregation gathers context.
- Residual addition preserves earlier representation while adding new context. (LF Yadda – A Blog About Life)
Sample Lesson Activities
- Role-play attention:
- one student is Query,
- several are Keys,
- several carry Values,
- the class computes “who matters most.”
- Have students assign mock attention scores to words in a sentence.
- Draw the residual stream as a river receiving tributaries.
Sample Homework
Write a one-page explanation of attention without using the words “magic,” “understanding,” or “memory retrieval.”
Unit 3 — The MLP and Semantic Circuitry
Covers Blocks 12–15:
- MLP / Feedforward Layer
- MLP Activation
- MLP Output Returns to Residual Stream
- Repeat Across Many Layers (LF Yadda – A Blog About Life)
Essential Question
If attention gathers context, what does the MLP do with it?
Key Ideas
- The MLP is not mainly about token-to-token comparison.
- It compares live hidden states against learned neuron directions.
- Activation functions decide which feature responses become strong or weak.
- The MLP output is added back into the residual stream.
- Repeated layers progressively refine context into deeper semantic structure. (LF Yadda – A Blog About Life)
Sample Lesson Activities
- Treat neurons as “feature detectors.”
- Give students hidden-state cards with features like “plural,” “question,” “animal,” “past tense,” and simulate which detectors fire.
- Build a “semantic circuitry board” metaphor.
Sample Homework
Describe the difference between:
- attention asking “where should I look?”
- MLP asking “what internal features should now activate?”
Unit 4 — From Internal State to Output
Covers Blocks 16–19:
- Final Hidden State for Last Token
- Output Logits / Unembedding
- Next Token Probabilities
- Token Selection (LF Yadda – A Blog About Life)
Essential Question
How does the model move from hidden computation to visible text?
Key Ideas
- The final hidden state is a context-rich summary at the last position.
- Unembedding compares that state against output token directions to produce logits.
- Softmax converts logits into probabilities.
- Decoding chooses one token.
- The selected token becomes both output and new input for the next cycle. (LF Yadda – A Blog About Life)
Sample Lesson Activities
- Give students a small fake vocabulary and a made-up logit table.
- Let them compute rough probabilities and simulate greedy vs sampling output.
- Show how one selected token changes the next step.
Sample Homework
Explain why logits are not yet probabilities, and why probabilities are not yet the final output.
Unit 5 — The Master Distinction
Covers the ending conceptual synthesis:
- Type 1 comparison: prompt vs learned weights
- Type 2 comparison: prompt vs prompt in self-attention (LF Yadda – A Blog About Life)
Essential Question
What kinds of “comparison” actually happen inside an LLM?
Key Ideas
The source post’s most important conceptual distinction is that there is not just one kind of comparison inside the model. Sometimes live hidden states are projected against learned parameter directions. Other times prompt-derived vectors compare directly with other prompt-derived vectors in self-attention. Understanding that distinction makes the whole pipeline much clearer. (LF Yadda – A Blog About Life)
Sample Lesson Activities
- Sort operations into two bins:
- against stored learned structure
- against live prompt-derived context
- Debate which is more important for “intelligence”: stored geometry or dynamic routing.
Final Writing Prompt
“Why an LLM is neither a database nor a simple Markov chain.”
Sample Weekly Schedule
Week 1
Introduction to LLM inference
Read Blocks 1–4
Class discussion: “Where does meaning begin?”
Week 2
Tokenization and embeddings
Lab: build a toy tokenizer
Mini-quiz
Week 3
Positional information and hidden states
Diagramming workshop
Week 4
Transformer blocks and Q/K/V
Attention simulation lab
Week 5
Self-attention, softmax, and value aggregation
Short written reflection
Week 6
Residual streams and cumulative meaning
Midterm concept map
Week 7
MLP and feature detectors
Neuron activation exercise
Week 8
Layer repetition and progressive contextualization
Compare early-layer vs late-layer roles
Week 9
Final hidden states and unembedding
Vocabulary competition exercise
Week 10
Probabilities and decoding
Sampling vs greedy lab
Week 11
Two kinds of comparison
Synthesis seminar
Week 12
Student presentations and final project
Sample Course Materials
Now the part you asked for specifically: sample course material, not just the outline.
Sample Lecture 1 Handout
Handout Title
What Happens the Moment You Enter a Prompt?
Opening Idea
When a person reads a sentence, it already feels meaningful. When a language model receives that same sentence, it begins with raw symbols, not meaning. The first phase is not understanding. It is conversion. The post explicitly frames the user prompt as the boundary between the outside symbolic world and the inside computational world. (LF Yadda – A Blog About Life)
Vocabulary
- Prompt: the text supplied by the user
- Tokenization: breaking text into model-recognizable pieces
- Token ID: the numeric ID assigned to each token
- Embedding: a learned vector retrieved for a token ID
- Position encoding: additional information that tells the model where each token sits in the sequence (LF Yadda – A Blog About Life)
Mini Example
Sentence:
“The cat sat on the mat.”
Possible simplified flow:
- Raw text arrives.
- Tokenizer splits it into pieces.
- Each piece gets a token ID.
- Each token ID retrieves an embedding vector.
- Position information is added.
- The sequence is now ready to enter transformer layers. (LF Yadda – A Blog About Life)
Key Takeaway
The model does not begin with semantic understanding. It begins with structured conversion into vectors. (LF Yadda – A Blog About Life)
Sample Lecture 2 Mini-Script
Topic
Attention: How tokens decide which other tokens matter
Imagine every token in a sentence asking a question: “Who in the sentence matters to me right now?” That is what queries and keys help the model do. The hidden state at a token position is projected into three views: Query, Key, and Value. The query is what the token is looking for, the key is how another token advertises its relevance, and the value is the content it can contribute if chosen. The post explains that the Q/K/V stage prepares the ingredients for attention, and the self-attention stage then performs prompt-to-prompt comparison by dot-producting queries with keys. (LF Yadda – A Blog About Life)
The raw attention scores are then normalized by softmax, which turns relevance evidence into influence weights. Those weights are used to blend value vectors into a new contextual representation. That result is projected back into the residual stream and added to the earlier state rather than replacing it outright. (LF Yadda – A Blog About Life)
So attention is not “the model looking up the answer.” Attention is the model reorganizing the prompt internally according to relevance. (LF Yadda – A Blog About Life)
Sample Student Worksheet
Worksheet: Mapping the Inference Pipeline
Part A — Put These in Order
Number the following from earliest to latest:
- logits
- tokenization
- token selection
- embedding lookup
- residual add
- next-token probabilities
- self-attention score computation
- positional information added
Part B — Short Answer
- Why is tokenization not the same thing as semantic understanding?
- Why does the model need positional information?
- What is the difference between a query and a value?
- Why are residual connections important?
- Why is the final hidden state not the same as the token embedding?
Part C — Reflection
Complete this sentence:
“The model’s output is not retrieved from storage all at once; it is generated by…”
Expected answer direction:
“…repeatedly turning a contextual hidden state into logits, probabilities, and then a selected next token.” (LF Yadda – A Blog About Life)
Sample Quiz
Quiz 1: Inside an LLM
Multiple Choice
1. Tokenization is primarily:
A. a semantic search
B. a deterministic segmentation and ID assignment step
C. a probability calculation
D. an attention score
Answer: B (LF Yadda – A Blog About Life)
2. Embedding lookup does what?
A. Selects the most similar sentence from training data
B. Converts raw logits into probabilities
C. retrieves a learned vector for each token ID
D. chooses the next token
Answer: C (LF Yadda – A Blog About Life)
3. Self-attention score computation compares:
A. hidden states only to MLP neurons
B. queries to keys across token positions
C. logits to probabilities
D. residual streams to token IDs
Answer: B (LF Yadda – A Blog About Life)
4. Residual addition is important because it:
A. deletes prior information
B. replaces old meaning entirely
C. lets the model preserve prior state while adding new context
D. converts text into token IDs
Answer: C (LF Yadda – A Blog About Life)
5. Logits are:
A. already normalized probabilities
B. raw scores over vocabulary items
C. token IDs
D. positional vectors
Answer: B (LF Yadda – A Blog About Life)
Short Response
In 3–5 sentences, explain the difference between logits and next-token probabilities. (LF Yadda – A Blog About Life)
Sample In-Class Lab
Lab Title
Be the Transformer
Goal
Students physically simulate one attention pass.
Materials
- index cards
- markers
- board space
Setup
Use the sentence:
“The animal didn’t cross the street because it was tired.”
Assign students roles:
- Token cards: The / animal / didn’t / cross / the / street / because / it / was / tired
- One student plays the current token: “it”
- Other students hold simplified key/value labels such as:
- animal → possible antecedent
- street → another noun
- tired → descriptive clue
Procedure
- “It” forms a Query: “I need my referent.”
- Candidate nouns present Keys.
- Students assign rough relevance scores.
- Softmax is approximated by converting those scores into rough weights.
- Values are blended.
- The class decides whether “it” most likely refers to “animal” or “street.”
Learning Point
This demonstrates how attention can help resolve relationships by comparing current-token needs to earlier-token relevance signals. That directly reflects the article’s explanation of self-attention, softmax weighting, and value aggregation. (LF Yadda – A Blog About Life)
Sample Discussion Questions
- At what exact point does “meaning” begin inside the model?
- Is an embedding “meaning,” or only the beginning of machine-usable meaning?
- Why is attention not enough by itself?
- Why is the MLP often underappreciated in popular explanations?
- Why does repeated layering matter?
- Does the model ever “know” a whole sentence in advance, or only produce one token at a time? (LF Yadda – A Blog About Life)
Sample Midterm Assignment
Prompt
Write a 1200–1800 word essay explaining the full LLM inference pipeline to an intelligent nontechnical reader. Your essay must include:
- tokenization
- embeddings
- positional information
- self-attention
- MLP
- residual connections
- final hidden state
- logits
- probabilities
- token selection
You must also explain the difference between:
- prompt vs learned-weight comparison
- prompt vs prompt comparison. (LF Yadda – A Blog About Life)
Grading Criteria
- accuracy
- clarity
- organization
- ability to use analogy without losing correctness
- proper distinction of pipeline stages
Sample Final Project Options
Option 1
Create a visual wall chart of the 19-block pipeline.
Option 2
Write a teacher’s guide titled:
“How to Explain an LLM Without Saying It Just Predicts the Next Word.”
Option 3
Build a toy spreadsheet model showing:
- tokens
- token IDs
- mock embeddings
- mock attention scores
- weighted values
- mock logits
Option 4
Produce a “Frank-said / GPT-said” classroom dialogue version of the course as an educational script.
Sample Teacher Notes
A strong teaching move is to repeat this contrast all semester:
- symbolic stage: raw text, token pieces, token IDs
- geometric stage: embeddings, hidden states, projections, attention space
- decision stage: logits, probabilities, token selection
That three-part framing is faithful to the post’s overall structure from prompt entry, through layered semantic processing, to output competition and decoding. (LF Yadda – A Blog About Life)
Another strong teaching move is to keep returning to the article’s final distinction: not all “comparison” inside an LLM is the same. Some comparison is against frozen learned structure. Some is live token-to-token interaction within the prompt. That distinction is probably the single best conceptual spine for the whole course. (LF Yadda – A Blog About Life)
Suggested Required Materials
- your original LFYadda post as the anchor text
- whiteboard or slide deck
- printed worksheets
- colored token cards
- optional simple Python notebooks for vector demos
Recommended Assessments
A balanced version could be:
- 15% quizzes
- 20% worksheets and labs
- 20% discussion participation
- 20% midterm essay
- 25% final project
One-Page Course Summary for Students
This course teaches what happens inside a large language model after a prompt is entered.
You will learn how text is tokenized, turned into vectors, combined with positional information, processed through transformer layers, routed through self-attention, transformed by feedforward circuitry, compressed into a final hidden state, compared against vocabulary output directions, converted into probabilities, and finally decoded into a new token. The central lesson is that an LLM is a structured inference machine built from both frozen learned geometry and live context-sensitive interaction. (LF Yadda – A Blog About Life)
Leave a Reply