The Geometry of Stored Information in Neural Networks

Getting your Trinity Audio player ready…

Introduction: Information as Shape

When we think about information, we often picture it as symbols on a page, bits in a computer, or words in a book. But inside a large language model (LLM) — like GPT — information takes a very different form. It isn’t written in neat letters or stored in files. Instead, information lives in geometry: curved spaces, tangled surfaces, and hidden shapes buried in thousands of dimensions.

This essay will explore how neural networks store information as geometry. We’ll examine two key worlds:

Parameter space — the vast, high-dimensional landscape of weights and biases, sculpted by backpropagation during training.
Representation space — the hidden activations and embeddings that arise when the model processes input, where concepts form loops, curves, and other shapes called manifolds.

By the end, we’ll see that the geometry of these spaces is not incidental: it is the essence of how neural networks learn, store, and manipulate information. Information is shape, and learning is the art of shaping.

1. From Flat Symbols to Curved Spaces

Traditional computing treats information as flat and rigid. A number in a spreadsheet, a string in memory, a bit flipped to 1 or 0 — each is discrete and exact. Relationships among pieces of information are imposed externally by human logic or program design.

Neural networks, by contrast, don’t store knowledge in discrete tables. They create a continuous landscape, where related things live close together and unrelated things live far apart. This landscape is curved and high-dimensional. Geometry is not decoration here — it’s the medium of thought.

Take the word “king.” In an LLM, “king” is represented as a vector in a high-dimensional space. “Queen” lies nearby, but shifted along a “gender” direction. “President” might be close but aligned along a different direction. These aren’t arbitrary coordinates: they form structured shapes that reflect the meaning and relationships of concepts.

2. Backpropagation: Sculpting the Landscape

How do these shapes arise? The answer lies in backpropagation, the training algorithm that adjusts network weights.

The Loss Landscape

Every neural network has a loss function L(W)L(W), where WW denotes all the weights. This function measures how wrong the network is on its training data. If the network predicts “dog” when the answer is “cat,” the loss is high. If it predicts correctly, the loss is low.

Visualize this loss as a giant mountainous surface in a space with billions of dimensions. Each point on this surface corresponds to one possible setting of the network’s weights. Valleys correspond to good solutions; peaks correspond to bad ones.

The Gradient Descent Walk

Backpropagation computes the gradient of the loss — the steepest direction downhill — and updates weights: W←W−η∇WL(W).W \leftarrow W – \eta \nabla_W L(W).

This is gradient descent: sliding downhill step by step. Over millions of steps, the network’s parameters wander into valleys where predictions are good.

Geometry at Work

The key insight is that the loss landscape is itself a geometric manifold. It has ridges, valleys, plateaus, and curvatures. Backpropagation doesn’t just find numbers; it navigates geometry. The shapes of this landscape — how steep, how curved, how connected — govern how and what the network learns.

3. Representation Space: Shapes of Meaning

Training shapes not just the parameters but also the representations that arise when the network processes inputs.

Activations as Points

Feed the word “blue” into a trained LLM. At some hidden layer, the model produces a vector of activations Ψ(“blue”)\Psi(\text{“blue”}). Do the same for “red,” “green,” “yellow,” and you get other vectors. Plot them (after dimension reduction), and they don’t scatter randomly. They fall into a loop resembling the color wheel.

Manifolds of Concepts

Researchers have found many such manifolds:

Colors form loops.
Dates of the year form cycles.
Years line up in a curve resembling a timeline (sometimes warped logarithmically).
Days of the week form circular structures.

These are not coincidences. They show that the network encodes concepts as continuous geometric shapes in representation space.

Why Curved Shapes?

Why not just store “red” at point A, “blue” at point B, “green” at point C? Because neural networks rely on linear operations (matrix multiplications, dot products). To make linear operations powerful enough to capture nonlinear relationships, the network bends concept manifolds into higher dimensions. This bending allows simple linear probes (like logistic regression) to extract rich, complex meanings.

4. The Bridge: Chain Rule Geometry

Now we connect parameter space (training) and representation space (meaning).

Backpropagation applies the chain rule: ∂L∂W=∂L∂Ψ(x;W)⋅∂Ψ(x;W)∂W.\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \Psi(x; W)} \cdot \frac{\partial \Psi(x; W)}{\partial W}.

This equation is the bridge:

The first term ∂L∂Ψ\frac{\partial L}{\partial \Psi} reflects how the geometry of representations affects loss.
The second term ∂Ψ∂W\frac{\partial \Psi}{\partial W} reflects how weight changes warp the representation manifold.

Thus, gradients in parameter space sculpt the geometry of representation space. The loops and curves we see in embeddings are the shadows of optimization dynamics in weight space.

5. Riemannian Geometry: A Common Language

Both the loss landscape and representation manifolds can be described by Riemannian geometry, the mathematics of curved spaces.

Metrics

In representation space, we measure distances between vectors with cosine similarity. In parameter space, we measure distance with respect to the Hessian (second derivative of the loss). Both are metrics — ways of defining lengths and angles in curved spaces.

Geodesics

In representation space, the shortest path along a manifold (say, from “January” to “June”) is a geodesic. The paper we discussed proves that cosine similarity encodes these geodesics.
In parameter space, optimization paths approximate geodesics in the loss landscape — the most efficient routes through curved error surfaces.

Curvature

Curvature tells us how much a space bends.

In representation space, curvature explains distortions (e.g., years stretched logarithmically).
In parameter space, curvature explains optimization difficulty (sharp vs. flat minima).

The same geometry describes both.

6. Information as Geometry

What does it mean to say information is stored geometrically?

Closeness encodes similarity. If two vectors are close in representation space, their meanings are related.
Directions encode transformations. Moving along a particular direction corresponds to changing a feature (e.g., “gender” direction from king → queen).
Curvature encodes structure. Cycles, spirals, and folds represent deeper regularities, like periodic time or hierarchical categories.

Thus, geometry is not a metaphor but the literal mechanism by which information is stored and retrieved.

7. Case Study: Time as Geometry

Take “years of the 20th century.” In GPT-2 small, embeddings of year tokens don’t scatter randomly. They line up along a curve. The order of years corresponds to position along this curve. Yet the curve is warped: recent years are more spread out. This matches a logarithmic rescaling, reflecting the model’s training data bias (recent years appear more frequently in its text corpus).

Here we see:

Topology: the structure (a line) matches our concept of a timeline.
Geometry: distances are warped logarithmically.
Interpretability: by analyzing this manifold, we learn how the model “thinks” about time.

8. The Sculptor Analogy

Imagine a sculptor working clay:

The loss landscape is the feedback felt by the sculptor’s hands.
Backpropagation is the process of adjusting pressure to minimize resistance.
The representation manifolds are the shapes that emerge in the clay — spirals, curves, loops.

In this analogy, training is the act of sculpting, weights are the sculptor’s hands, and representations are the finished figures. The beauty of the result lies in the geometry of the clay.

9. Implications

Interpretability

Understanding representation manifolds may allow us to steer or edit models. If we can map the manifold of “political ideology” or “emotions,” we could nudge activations along desired paths.

Alignment

If harmful concepts occupy certain manifolds, interventions could reshape or prune them, rather than relying on crude output filters.

Research

The paper we summarized suggests “manifold-aware” sparse autoencoders. These tools could disentangle features not as atomic switches but as continuous curved spaces.

10. Limitations and Open Questions

Scalability: Current methods require manual guesses about manifolds (circle, line, etc.). Automating this remains unsolved.
Noise: High-dimensional geometry is hard to estimate in practice; projections can distort.
Human alignment: Manifolds may represent concepts in ways unfamiliar to us, making them hard to interpret.

Still, the framework of geometry provides a unifying perspective.

11. Broader Perspective: Geometry Across Scales

The geometric story of information extends beyond AI:

In physics, states of systems live on manifolds defined by conserved quantities.
In biology, gene expression and protein folding explore curved spaces of possible forms.
In cognition, human concepts may live in mental spaces with geometric relationships (psychologists have long modeled similarity this way).

AI’s manifolds are thus part of a wider truth: information lives in shape.

Conclusion: The Geometry of Thought

We began by asking how neural networks store information. The answer is not in symbols, tables, or discrete logic, but in geometry.

In parameter space, backpropagation navigates curved loss landscapes.
In representation space, concepts crystallize as manifolds — loops, lines, spirals, warped surfaces.
Riemannian geometry provides the common language: metrics, geodesics, curvature.

This view reframes learning itself: to learn is to sculpt geometry. To store knowledge is to bend space. To reason is to navigate manifolds.

As we continue to probe the hidden structures of AI, we may find that geometry is not just a way to describe information, but the deepest reality of information itself.

Information is shape. Learning is shaping. And intelligence is geometry in motion.