|
Getting your Trinity Audio player ready…
|
It’s Just Math
Frank: So let’s start from the beginning. I keep reading about AI using enormous amounts of power, data centers catching fire, utilities scrambling to build new capacity. What’s actually going on?
Claude: The short version is that modern AI — specifically large language models — require a staggering amount of computation to build and to run. And computation, at the hardware level, is just transistors switching states millions of times per second, and transistors switching states generates heat, and managing heat requires cooling, and all of that requires electricity. A lot of it.
Frank: How much is a lot?
Claude: It depends on the scale. Let’s talk about two distinct phases, because they’re very different problems. There’s training, which is the process of building the model — teaching it everything it knows — and there’s inference, which is the process of actually using it. Running queries. Answering questions. Writing emails. Generating code. Those two phases have very different power profiles.
Frank: Start with training.
Claude: Training a frontier model — something like GPT-4, or a comparable system — is one of the most computationally expensive things humans have ever done deliberately. The estimates that have leaked out suggest GPT-4’s training run consumed somewhere in the range of fifty gigawatt-hours of electricity. Some estimates go higher.
Frank: Give me a frame of reference.
Claude: A gigawatt-hour is a billion watt-hours. The average American household uses about ten thousand kilowatt-hours per year — ten megawatt-hours. So fifty gigawatt-hours is roughly equivalent to the annual electricity consumption of five thousand homes. For a single training run. One model. Once.
Frank: That’s a lot.
Claude: And that’s just the electricity. That doesn’t count the capital cost of the hardware, the water used for cooling, the carbon footprint of the supply chain to manufacture the chips, the engineers’ salaries, the facility costs. The electricity is just the most legible part of the expense.
Frank: And GPT-3 was smaller than GPT-4?
Claude: Significantly. GPT-3 is a 175-billion parameter model, and its training run consumed roughly 1,287 megawatt-hours — about 1.3 gigawatt-hours. That’s maybe two percent of GPT-4’s estimated consumption. The scaling has been dramatic.
Frank: Why does more parameters mean more power?
Claude: Because more parameters means more math. Every parameter in a neural network is essentially a number — a weight — and the training process is about finding the right values for all of those numbers so that the model performs well. The way you find those values is through a process called gradient descent, where you repeatedly feed the model examples, measure how wrong it is, and then adjust the weights slightly in the direction that makes it less wrong. You do this billions of times.
Frank: And each adjustment requires computation.
Claude: Each forward pass — feeding data through the model to get a prediction — and each backward pass — computing how to adjust the weights — requires doing arithmetic on every single parameter in the network. With 175 billion parameters, that’s a lot of arithmetic per example. With GPT-4, which is rumored to be in the range of a trillion parameters across a mixture-of-experts architecture, it’s exponentially more.
Frank: What kind of arithmetic?
Claude: Matrix multiplication, almost entirely. That’s the core operation. You have a matrix of weights and a matrix of activations — basically the current state of the computation — and you multiply them together. The result flows to the next layer, where you do it again. And again. And again. A large language model might have a hundred or more layers, each of which involves multiple matrix multiplications.
Frank: So it’s just… math class stuff.
Claude: In principle, yes. The operation itself is not exotic. You learned matrix multiplication in high school or college. Row times column, sum the products. What makes it hard is the scale. These matrices aren’t three-by-three. They’re sometimes thousands-by-thousands, or tens of thousands-by-tens of thousands. And you’re doing this thousands or millions of times per second, across thousands of chips simultaneously, for weeks.
Frank: Weeks?
Claude: A serious training run for a frontier model might last two to three months. You have a cluster of potentially ten thousand or more high-end GPUs or TPUs running essentially flat out, twenty-four hours a day, for that entire period. The peak power draw of such a cluster might be fifty to a hundred megawatts.
Frank: A hundred megawatts. What does that compare to?
Claude: A typical nuclear power plant generates about a thousand megawatts. So a large training cluster consumes about a tenth of a nuclear plant’s output. A large hospital might use ten to twenty megawatts. A small city might use fifty to a hundred. So yes — a single AI training cluster, during a training run, is drawing power comparable to a small city.
Frank: And there are multiple companies doing this.
Claude: Multiple companies, multiple clusters, multiple training runs per year as models are iterated and improved. And that’s before you get into fine-tuning runs — which are shorter but still significant — and the research experiments that happen in the process of developing these systems.
Frank: Okay. So training is brutal. What about inference?
Claude: Inference is a different shape of problem. Per query, it’s much cheaper. Answering a single question with a model like GPT-4 might consume somewhere in the range of one to ten watt-hours. A typical Google search is a fraction of a watt-hour. So an AI query is maybe ten to a hundred times more expensive than a search.
Frank: That doesn’t sound catastrophic.
Claude: Per query, it isn’t. The problem is volume. Google processes something like eight and a half billion searches per day. ChatGPT reportedly hit a hundred million users within two months of launch. OpenAI has said they’re processing hundreds of millions of queries per day across their products. When you multiply even a small per-query cost by hundreds of millions, you get a number that starts to matter.
Frank: How much does inference cost at that scale?
Claude: OpenAI doesn’t publish their power consumption directly, but analysts have estimated that their inference operation — just serving ChatGPT and the API — might require several hundred megawatts of continuous capacity. Microsoft, which is deeply integrated with OpenAI and runs Azure, has been aggressively building out data center capacity and has explicitly cited AI inference as a major driver of that expansion.
Frank: And inference never stops.
Claude: That’s what makes it different from training. Training is a sprint — expensive, concentrated, finite. Inference is a marathon. The servers have to be running all the time because queries come in all the time, from everywhere in the world, around the clock. You can’t turn it off.
Frank: So in the long run, inference might consume more total energy than training?
Claude: Almost certainly. Training might cost fifty gigawatt-hours once. But if you’re serving a hundred million queries a day at one watt-hour each, that’s a hundred million watt-hours per day — a hundred megawatt-hours — which is thirty-six gigawatt-hours per year. And that’s a conservative estimate. The real consumption is likely higher, and the number of queries is growing. Within a few years of deployment, a popular model’s inference costs will dwarf its training costs in total energy terms.
Frank: So the industry built this expensive thing to train, and now the expensive thing is running it.
Claude: Right. And the two problems require different solutions. Training can be scheduled — you can choose when to run it, you can optimize the cluster, you can take weeks to prepare. Inference has to be responsive. A user asking a question expects an answer in a second or two. That means the infrastructure has to be pre-provisioned, always on, and fast. Latency requirements make inference harder to optimize than training in some ways.
Frank: Let’s go back to the math. You said it’s mostly matrix multiplication. Why are GPUs good at that?
Claude: A CPU — the central processing unit in your laptop or phone — is designed to be a general-purpose problem solver. It has a small number of very sophisticated cores, maybe eight or sixteen, each capable of executing complex instructions, handling branches, managing memory in complicated ways. It’s like having eight very smart people who can each do any job.
Frank: And a GPU?
Claude: A GPU was originally designed for graphics — rendering pixels, applying transformations to polygons. Graphics turned out to require doing the same simple operation on millions of data points simultaneously. Multiply this color value by this matrix to apply a lighting effect. Do it for every pixel on screen. So GPU designers built chips with thousands of much simpler cores — not as individually smart, but able to all work in parallel on the same operation.
Frank: And neural networks happen to need the same thing.
Claude: Exactly. Matrix multiplication is what’s called embarrassingly parallel — you can break it into many independent sub-problems that can all be solved simultaneously without any of them needing to communicate with each other. A GPU with ten thousand cores can do a matrix multiply much faster than a CPU with sixteen cores, even if each individual GPU core is slower and dumber than each CPU core.
Frank: What about TPUs? Google talks about those.
Claude: Tensor Processing Units. Google built custom silicon specifically for the operations neural networks need. A TPU is even more specialized than a GPU — it gives up generality entirely to be extremely good at the specific matrix math patterns that appear in deep learning. For training and inference at scale, TPUs can be more efficient than GPUs in terms of performance per watt. NVIDIA has also been moving in this direction with their more recent architectures — the H100 and the B100 — which include specialized matrix acceleration units.
Frank: So the hardware is purpose-built for this one operation.
Claude: Yes, and it’s fascinating when you step back and think about it. We’ve built an entire global industry — chip fabrication, data center construction, power plant expansion — largely in service of doing addition and multiplication faster. The mathematics haven’t changed since you learned them. What’s changed is the scale and the hardware’s ability to execute at that scale.
Frank: Let’s talk about the memory bottleneck you mentioned earlier. What’s that about?
Claude: This is one of the more counterintuitive aspects of modern AI hardware. You’d think the limiting factor would be how fast the chips can do arithmetic. But often the bottleneck is something more mundane — how fast you can move data in and out of memory.
Frank: Why?
Claude: Because the model’s weights — all those billions of numbers — have to be loaded from memory every time you use them. A 70-billion parameter model, stored in 16-bit floating point, takes about 140 gigabytes of memory. You need to load those weights, do the computation, and load them again for the next layer. The arithmetic itself happens very quickly. The memory transfers take longer.
Frank: So the chip is sitting there waiting?
Claude: Often, yes. There’s a metric called arithmetic intensity — the ratio of computation to memory access. For some operations in neural networks, the arithmetic intensity is low, meaning you’re spending more time fetching data than doing math with it. This is sometimes called being memory-bound rather than compute-bound. A lot of the engineering effort in modern AI hardware is about making memory faster — using high-bandwidth memory, putting memory physically closer to the compute, designing clever data layouts to maximize cache utilization.
Frank: How do you fix a memory bottleneck?
Claude: Several ways. HBM — high-bandwidth memory — is a type of memory that stacks layers vertically and connects them with very short, dense interconnects, allowing much faster data transfer than traditional DRAM. The H100 GPU uses HBM3, which has a memory bandwidth of around three terabytes per second. That sounds fast, and it is, but at the scale of a large language model, it can still be the bottleneck.
Frank: Three terabytes per second and it’s still not enough?
Claude: At full utilization, with a large model, yes. Consider: if you have a 70 billion parameter model and you want to generate tokens quickly, you need to cycle through all those weights repeatedly. Three terabytes per second means you can load the entire 140-gigabyte model about twenty-one times per second. That puts a ceiling on how fast you can generate output, regardless of how fast the math cores are.
Frank: So inference speed is fundamentally limited by physics.
Claude: By the speed of light, essentially, yes. Moving data takes time and energy regardless of how clever your engineering is. This is why there’s so much interest in techniques like quantization — reducing the precision of the weights from 32-bit floats to 16-bit or even 8-bit integers — which halves or quarters the memory footprint and proportionally increases how many times you can cycle through the weights per second. You lose some precision, but often the model quality degrades less than you’d expect.
Frank: What about running models locally? On a laptop or a phone?
Claude: Possible for smaller models, challenging for large ones. A 7-billion parameter model quantized to 4-bit precision takes about four gigabytes of memory. A modern MacBook Pro with 16 or 32 gigabytes of unified memory can run that reasonably well. The performance won’t match a server with H100s, but it’s usable.
Frank: Unified memory — that’s the thing Apple keeps advertising?
Claude: Yes, and it’s actually quite relevant here. In a traditional computer, the CPU and GPU have separate memory pools, and moving data between them takes time and energy. Apple’s M-series chips share a single pool of memory between the CPU, GPU, and neural engine. For running local AI models, this is genuinely advantageous — the model weights live in one place and both the CPU and GPU can access them without copying.
Frank: So Apple Silicon is actually good for this.
Claude: For local inference, it’s very competitive. The memory bandwidth is high, the architecture is efficient, and the integration means less overhead. People have gotten remarkably good performance running 70-billion parameter models on a Mac Studio with 192 gigabytes of unified memory. Not ChatGPT-class speed, but functional.
Frank: What about the 70-billion versus 175-billion versus trillion-parameter distinction? What are all those parameters actually representing?
Claude: Each parameter is a floating point number — a weight in the network that encodes some aspect of what the model has learned. In a transformer architecture, which is what most large language models use, the weights are organized into matrices that define how information flows through the network. Some weights encode what to pay attention to. Some encode how to transform representations. Some encode the relationships between concepts.
Frank: Can you look at a weight and know what it represents?
Claude: Not really. This is one of the deeply strange aspects of neural networks. The knowledge isn’t stored in individual weights — it’s distributed across millions or billions of them in a way that’s not human-interpretable. You can’t point to weight number 47,382,910 and say “that’s where the model learned that Paris is in France.” The information is encoded collectively, in the patterns of relationships between enormous numbers of values.
Frank: That’s unsettling.
Claude: It’s one of the central challenges of AI interpretability research. We’ve built systems that clearly know things — they can answer questions, reason about problems, produce coherent text — but we don’t fully understand how they know them or where exactly the knowledge lives. The training process found some configuration of numbers that works, but the path from those numbers to the behavior is opaque.
Frank: Back to power. What happens to all that electricity?
Claude: Heat. Computation is, fundamentally, a physical process. When a transistor switches state, it dissipates energy as heat. A large GPU cluster is essentially a very expensive and sophisticated electric heater. The power that goes in comes out as heat that has to be removed.
Frank: How do you cool a data center running a hundred-megawatt AI cluster?
Claude: Increasingly, liquid cooling. Traditional air cooling — blowing cool air over the servers, exhausting hot air — works up to a point, but modern high-density GPU servers generate more heat per unit volume than air can efficiently carry away. Liquid cooling runs water or another coolant directly through cold plates in contact with the chips, carrying the heat away much more efficiently.
Frank: And then where does the heat go?
Claude: Into the cooling system, which has to reject it somewhere — into the outdoor air via cooling towers, or increasingly into bodies of water. Data centers consume enormous amounts of water for evaporative cooling. Microsoft has disclosed that their data centers use millions of gallons of water per day. There’s been local controversy in several communities about data centers drawing on water supplies.
Frank: So there’s a water footprint too, not just an electricity footprint.
Claude: Yes. Water, electricity, land — data centers are resource-intensive in multiple dimensions simultaneously. And the resource demands are growing faster than the efficiency gains from better hardware. NVIDIA’s chips have gotten dramatically more efficient per operation over the past several years — the H100 does far more AI computation per watt than a GPU from five years ago. But the models being trained have gotten so much larger that the total consumption has increased anyway.
Frank: It’s like fuel efficiency in cars — the cars get more efficient but people drive more.
Claude: Jevons paradox, exactly. When something becomes more efficient, it typically becomes cheaper, which encourages more use, which can increase total consumption even as unit consumption falls. The same dynamic appears to be playing out in AI compute.
Frank: Is there a ceiling? Does it have to keep growing?
Claude: That’s genuinely uncertain. There are optimists who argue that algorithmic improvements will start to dominate — that we’ll find more efficient ways to train and represent knowledge, so that future models won’t need to be dramatically larger than current ones to be dramatically better. There are people in the field who believe the current scaling approach — just make it bigger, feed it more data — is approaching diminishing returns.
Frank: And the pessimists?
Claude: The pessimists — or maybe realists — point out that every previous prediction of hitting a scaling wall has been wrong. People said in 2018 that we were near the limit of what scale could achieve. Then GPT-3 arrived. Then GPT-4. Each time, more scale produced more capability in ways that surprised people. The counterargument is that at some point you run out of data, or the hardware economics become prohibitive, or the physical constraints of heat and memory bandwidth become insurmountable.
Frank: What does the grid look like in a world where AI inference is a significant fraction of electricity consumption?
Claude: Strained, potentially. Several analyses have projected that data center electricity consumption in the US could double or triple by the end of this decade, largely due to AI. Utilities are already reporting that their load growth forecasts have been dramatically revised upward. There are real questions about whether the grid can absorb this demand growth fast enough — permitting new transmission lines takes years, building new generation capacity takes years, and the demand is growing now.
Frank: What are the companies doing about it?
Claude: Several things. Microsoft, Google, and Amazon have all made commitments to run on carbon-free energy. They’re investing heavily in nuclear power — particularly small modular reactors, which are still largely theoretical at commercial scale but are viewed as a potentially clean, reliable baseload source. Microsoft famously struck a deal to help restart Three Mile Island. Google has invested in geothermal. There’s also interest in co-locating data centers with renewable generation — putting them where the sun shines or the wind blows.
Frank: But renewables are intermittent.
Claude: Which creates a tension. AI inference needs reliable, always-on power. Solar and wind are not always on. Battery storage helps but is expensive at the scale needed. Nuclear is the dream solution — carbon-free and reliable — but building nuclear plants takes a decade or more under current regulatory frameworks. The industry is in a difficult position of having immediate, growing power needs and long-horizon solutions.
Frank: Let me bring it back to the fundamentals. All of this — the data centers, the chips, the water, the power plants — is ultimately in service of matrix multiplication.
Claude: Yes. And specifically, in service of a particular insight that emerged from decades of research — that if you take a deep neural network, which is ultimately just a very large composition of matrix multiplications with nonlinear functions between them, and you train it on enough data with enough compute, it learns remarkably general and capable representations of information.
Frank: Why does that work? Why should multiplying matrices produce something that can write poetry or debug code?
Claude: This is perhaps the most profound open question in the field. The theoretical understanding lags well behind the empirical results. We know it works. We’ve seen it work at scales that surprise even the researchers building these systems. But a complete mathematical theory of why deep learning generalizes so well, why scale produces emergent capabilities, why these systems develop internal representations that map onto human concepts — that theory doesn’t fully exist yet.
Frank: So we’ve built a civilization-altering technology that we don’t entirely understand.
Claude: Based on math that’s been understood for decades, running on physics that’s been understood for longer, producing behavior that surprises us. Yes.
Frank: That’s either inspiring or terrifying.
Claude: Both, probably. The honest answer is that we’re in the middle of something whose arc isn’t clear yet. The power requirements are real and growing. The capabilities are real and growing. The understanding of what we’ve built is lagging behind both. The electricity bill is legible. What we’re actually paying for is less so.
Frank: And underneath it all, it’s still just addition and multiplication.
Claude: Done an almost incomprehensible number of times, by chips built to do almost nothing else, cooled by water, powered by electricity, arranged in buildings that now consume as much power as small cities, in order to predict the next word in a sequence. And somehow, from that, emerges something that feels like thought.
Frank: The next word.
Claude: That’s the mechanism. Every large language model is, at its core, a very sophisticated next-token predictor. Given everything that came before, what word or word-piece is most likely to come next? Trained on enough text, that task turns out to require — or at least produce — something that looks a lot like understanding. World models. Reasoning. The ability to generalize across domains.
Frank: Because to predict what comes next, you have to understand what came before.
Claude: That’s the hypothesis. To consistently predict the next word across billions of sentences about physics, history, code, emotion, philosophy, you have to develop internal representations that capture something real about how those domains work. Whether those representations constitute genuine understanding in any philosophically meaningful sense is a question people argue about. But they’re functional. They work.
Frank: And they cost a lot of electricity to build.
Claude: And more to run. And more each year. And we keep building them anyway, because the capabilities keep improving and the economic value keeps accruing and no one has found the ceiling yet. Whether that trajectory is sustainable is a question that’s being asked more urgently in boardrooms and at utilities and in energy policy discussions.
Frank: What’s your honest take? Is this sustainable?
Claude: Honest take — the current trajectory isn’t indefinitely sustainable in the way it’s currently structured. Something has to change: either the algorithms get dramatically more efficient, or the energy mix shifts substantially toward clean sources, or the growth rate slows as the technology matures and deployment scales plateau, or some combination. The physics of heat and memory and electricity don’t care about business projections. You can’t scale to infinity.
Frank: But we keep going anyway.
Claude: We keep going anyway. Because the next model is better than the last one, and better turns out to have a lot of value, and a lot of value attracts a lot of capital, and a lot of capital builds a lot of data centers, and a lot of data centers draw a lot of power, and around it goes. Until something stops it — a physical constraint, an economic constraint, a policy constraint, or a scientific breakthrough that changes the rules.
Frank: Or until we figure out a smarter way to do the math.
Claude: That would be the elegant solution. Do the same thing, or something better, with a fraction of the computation. People are working on it — sparse attention, mixture of experts, state space models, various alternatives to the transformer architecture. Progress is real. Whether it’s fast enough to outpace the growing appetite for capability is the open question.
Frank: It’s just math, but it’s math we can’t stop doing.
Claude: It’s math we’ve decided is worth doing at extraordinary cost, because the thing it produces — intelligence, or something that functions like it — turns out to be extraordinarily useful. Whether we’re making a good trade is a question future generations will have more data on than we do.
Frank: On that cheerful note.
Claude: It’s not entirely bleak. The efficiency improvements are real. The shift toward cleaner energy is underway, however incomplete. The algorithms are getting smarter alongside the hardware. The situation isn’t fixed — it’s dynamic, and the dynamics can go in better directions as well as worse ones. But it requires active choices rather than passive assumption that things will work out.
Frank: Active choices powered by a lot of matrix multiplication.
Claude: Always. All the way down.
Leave a Reply