How Neural Networks Store and Learn Word Relationships

Getting your Trinity Audio player ready…

The Big Picture

Imagine you’re trying to teach a computer to understand that “king” and “queen” are related words, while “king” and “car” aren’t very similar. Neural networks do this by converting words into numbers (called vectors) and adjusting these numbers until similar words have similar patterns.

What is Cosine Similarity?

Think of cosine similarity like measuring the angle between two arrows:

If two arrows point in almost the same direction, they’re very similar (high cosine similarity)
If they point in completely different directions, they’re not similar (low cosine similarity)

For words, we represent each word as an arrow (vector) with multiple dimensions. Words with similar meanings should have arrows pointing in similar directions.

How Neural Networks Store Word Relationships

The Storage System

Neural networks don’t keep a giant table of “king is similar to queen.” Instead, they store each word as a list of numbers (like coordinates). For example:

“king” might be stored as [0.8, 0.6, 0.2]
“queen” might be stored as [0.7, 0.5, 0.3]
“car” might be stored as [0.1, 0.9, 0.8]

These numbers are stored in the network’s “weights” – essentially the network’s memory.

Reading the Relationships

When the network needs to know how similar two words are, it:

Looks up both words’ number patterns
Calculates how similar the patterns are (using cosine similarity)
Uses this similarity for whatever task it’s doing

The Learning Process (Backpropagation)

Starting Point

Initially, all words get random number patterns, so the network doesn’t know that “king” and “queen” should be similar.

The Training Cycle

The network learns through a repetitive process:

Make a Prediction: Given “king,” try to predict related words like “queen”
Check the Mistake: See how wrong the prediction was
Adjust the Numbers: Slightly change the number patterns for “king” and “queen” to make them more similar
Repeat: Do this millions of times with different word pairs

What Actually Changes

Each time the network learns from a mistake:

It moves “king” and “queen” closer together in the number space
It moves unrelated words like “king” and “car” further apart
The entire landscape of word relationships gets reorganized slightly

Why It’s Computationally Intensive

This process requires enormous amounts of calculation because:

Scale

Modern systems work with vocabularies of 50,000+ words
Each word might be represented by 300+ numbers
The network processes thousands of word pairs in each training batch

Repeated Calculations

For each training example, the network must:

Look up word patterns (fast)
Calculate predictions for thousands of possible words (slow)
Compute how wrong each prediction was (moderate)
Update the number patterns for all involved words (moderate)
Repeat this process millions of times

The Reshaping Process

Every single training cycle slightly reorganizes the entire “landscape” of word relationships. It’s like having a map with 50,000 cities, and after each piece of new information, you need to adjust the positions of many cities slightly to better reflect their real relationships.

A Simple Example

Let’s say we start with random numbers:

“king”: [0.1, 0.2, 0.3]
“queen”: [0.4, 0.5, 0.6]
“car”: [0.7, 0.8, 0.9]

Initially, “king” and “queen” aren’t very similar. But after training on examples like “king appears near queen in sentences,” the network adjusts:

“king”: [0.8, 0.6, 0.2]
“queen”: [0.7, 0.5, 0.3]
“car”: [0.1, 0.9, 0.8]

Now “king” and “queen” have much more similar number patterns, while “car” is quite different.

Key Insights

No Explicit Similarity Table: The network doesn’t store a table saying “king is 99% similar to queen.” Instead, it stores number patterns that, when compared, reveal these relationships.
Dynamic Computation: Each time you ask “how similar are these words?”, the network calculates it fresh from the stored number patterns.
Gradual Learning: The network doesn’t learn relationships overnight. It takes millions of small adjustments to get the word patterns right.
Computational Trade-off: While training is extremely expensive (requiring powerful computers and lots of time), using the trained network is relatively fast.

The Bigger Picture

This same principle applies to modern AI systems like ChatGPT, but scaled up enormously. Instead of just learning that “king” and “queen” are similar, they learn complex patterns about how words, phrases, and concepts relate to each other across millions of dimensions of meaning.

The “cosine similarity model” mentioned in the technical article is really just a fancy way of describing this process of representing meaning as numerical patterns and measuring similarity by comparing these patterns.