spoonai
TOPGoogleTPUIronwood

42.5 ExaFLOPS: Google's Ironwood TPU Rewrites the Inference Playbook

Google's 7th-gen TPU Ironwood hits GA with 4,614 TFLOPS per chip, 9,216-chip superpods, and Anthropic signing up for 1M TPUs. The inference era has a new king.

·14분 소요·Ironwood: The first Google TPU for the age of inference

For the past three years, the AI conversation has been dominated by a single word: training. How big can we build models? How fast can we train them? How much capital will it take to compete? But the game has fundamentally shifted. Gemini, Claude, Grok, and every other frontier model worth talking about is already trained. The real question now is no longer "how do we build better models?" but "how do we run them fast enough, cheap enough, and at the scale where billions of people can use them?"

That question is called inference, and Google just answered it with Ironwood.

On April 1st, 2026, Google made Ironwood TPUs generally available to all Cloud customers. The specs alone tell you why this matters. A single chip delivers 4,614 FP8 TFLOPS, paired with 192 GB of HBM3E memory and 7.37 TB/s of memory bandwidth. String 9,216 of them together in a superpod, and you get 42.5 exaFLOPS of compute – enough to serve requests for hundreds of millions of concurrent users. The jump from the previous-generation Trillium is 4X per chip. The jump from TPU v5p is 10X.

What makes this more than just another spec bump is what Ironwood represents: the first Google TPU purpose-built for a world where inference has become the center of gravity. And the response from industry has been immediate. Anthropic, already a heavy TPU user, announced plans to deploy up to 1 million Ironwood TPUs on Google Cloud. Think about that. One million specialized AI accelerators. Deployed. Purchased. Committed to. By a single company.

This is how you know something has changed.

The Inference Moment

Let's ground this in reality for a second. When you use ChatGPT, Claude, Gemini, or any other language model, what's happening on the backend isn't training. The model is already finished. What's happening is inference – running your question through a frozen neural network and extracting an answer. This happens in real-time, often in under 100 milliseconds. It happens billions of times a day across the planet.

For years, the AI industry treated inference as secondary infrastructure. A nice-to-have optimization problem. The real action was in training – the scarce, expensive, glamorous work of building frontier models. Inference was just... running the model. Let the ML engineers handle it.

Except inference has become the dominant economic force in AI infrastructure.

Think about the math. A frontier model takes months to train, once. Inference takes years to serve, constantly, at massive scale. Every conversation, every API call, every embedded use of an AI model is inference. And unlike training – which only matters if your model is competitive – inference happens with every single user interaction, across every single production deployment.

Google understood this shift before anyone else and built a chip for it. Ironwood isn't just a faster TPU. It's a fundamentally different architecture, optimized for a world where your model is frozen, your data comes in streams, and latency is revenue.

The Hardware Breakdown: Why 4X Matters

Let me walk through the actual specs, because the numbers here are doing heavy lifting.

Metric Ironwood TPU v6e (Trillium) TPU v5p
FP8 TFLOPS per chip 4,614 ~1,150 ~461
HBM Memory 192 GB 48 GB 24 GB
Memory Bandwidth 7.37 TB/s ~2 TB/s ~1 TB/s
Performance vs v6e 4X 10X
Power Efficiency vs v6e 2X better

On the surface, you might read this as "bigger numbers, faster chip." But the devil – and the genius – lives in how these numbers interact.

First, the memory picture. Inference is memory-bound, not compute-bound. When you query a language model, the actual arithmetic is simple matrix multiplication. The hard part is loading all the weights into memory fast enough that your processor doesn't just sit there waiting. Ironwood's 192 GB HBM3E capacity and 7.37 TB/s bandwidth solve a specific problem: you can fit large models entirely in chip memory, and pull data at speeds that keep the FP8 pipelines fed.

Second, FP8 is the elephant in the room. Ironwood leans entirely into 8-bit floating-point arithmetic. For training, you need FP32 or BF16. For inference? It turns out most models are robust enough that FP8 produces almost identical outputs. That's not a compromise – that's a license to print performance. An 8-bit model takes one-quarter the memory of an FP32 model. Memory scales linearly with performance. Four times the density plus two times the bandwidth equals an inference beast.

Third, the SparseCore enhancements. Modern language models have significant sparsity in their activations – parts of the network that simply aren't active for certain inputs. Ironwood's improved sparse tensor operations skip over these computations entirely, which means free speedup on real-world models.

Fourth, ICI networking (Inter-Chip Interconnect) has been upgraded. When you link 9,216 chips together, communication latency becomes your enemy. If chips spend more time talking to each other than computing, you've built an expensive bottleneck. Ironwood minimizes this, which is why a 9,216-chip superpod actually behaves like a coherent computational unit, not an aggregation of separate devices struggling to stay in sync.

The result? 42.5 ExaFLOPS in a single pod. That's 42 quintillion floating-point operations per second. Per second. Let that land.

Training Once, Serving Forever

Here's where the architecture diverges sharply from what came before.

For a decade, TPU development followed the training-centric roadmap. TPU v2, v3, v4, v5e – all optimized for the iterative loop of loading data, forward pass, backward pass, gradient update, repeat. These tasks care about precision and throughput. You want BF16 or FP32. You want large batch sizes. You want to move terabytes of data per second.

Inference is the inverse. The weights don't change. The batch size is often 1 – a single user query. The tolerance for lower precision is much higher. What matters is latency: the time from input to output. And throughput: how many concurrent queries can you handle per second.

Ironwood is the first TPU that doesn't apologize for this difference. It leans into FP8. It maximizes memory bandwidth relative to compute. It prioritizes low latency paths. Every architectural decision points toward a world where the model is served, not trained.

This is why Anthropic is deploying 1 million of them. Claude runs on these chips. Every conversation you have with Claude that doesn't go through OpenAI's API or Anthropic's web interface or mobile app – if it goes through an enterprise deployment – it's running on TPUs. And increasingly, those TPUs will be Ironwood.

Why? Because at scale, Ironwood changes the unit economics. If your current inference cluster costs $X per request, Ironwood might reduce that to $X/4. That's not a rounding error. That's the difference between a product that's profitable and one that bleeds money. When Anthropic says they're committing to 1 million TPUs, they're saying: "We've done the math, and the inference business model only works with hardware like this."## The Anthropic Bet: What 1 Million TPUs Means

Let's pause on the hardware and talk about what Anthropic's commitment actually signals.

One million Ironwood TPUs is not a casual purchase. This is a bet-the-company decision. For context, the entire global data center GPU fleet – every GPU in every cloud provider, every enterprise, every lab – is estimated at roughly 50 million GPUs total. A million TPUs, deployed by a single company, is a significant fraction of the world's total specialized AI compute.

The sticker shock is real. Assuming Ironwood costs somewhere in the range of $500K to $2M per unit (Google doesn't publish prices, but Semianalysis and other analysts have estimated this range), we're talking about a $500B to $2 trillion infrastructure commitment over multiple years. That's Google-scale money. That's Microsoft-Azure-scale money. That's nation-state scale money.

So what would make a company – even one backed by Google – commit to that level of spending?

Simple: the math works only with Ironwood. If you're running inference on current-generation GPUs, your cost per request is too high. Anthropic has done the unit economics. They know what it costs to serve Claude at scale. They know how many concurrent users they need to support. And they know that without 4X better performance per dollar, the inference business model breaks.

This is the other side of the NVIDIA defensibility question. NVIDIA owns training. But Ironwood owns inference – at least at Google and for partners like Anthropic. And inference, it turns out, is where the volume and the margin are.

The logic is brutal and clear: build a chip optimized for what matters. Anthropic did the same analysis and concluded that betting on Google TPU was smarter than betting on commodity GPUs. Not because Google is altruistic, but because being part of the TPU roadmap – having direct input into the next-generation design – is worth more than independence on commodity hardware.

The Bigger Picture: Who Owns the Future

Zooming out, Ironwood is not just a chip. It's a statement about who owns AI infrastructure going forward.

The landscape now looks like this:

Chip layer: NVIDIA still dominates with ~95% market share. But TPU is carving out the high-value inference segment. Other contenders (Cerebras, Graphcore, Xilinx AI) are pursuing specific niches, but none have the breadth and backing of Google or NVIDIA.

Cloud layer: AWS was unchallenged for years. But Google Cloud is winning the AI arms race because of TPU. Azure is competitive because of OpenAI integration and NVIDIA partnership. The pattern is clear: owning hardware gives you moat.

Model layer: OpenAI (GPT), Google (Gemini), Anthropic (Claude), Meta (Llama), DeepSeek. Each has infrastructure preferences. OpenAI and Microsoft are entangled. Google and Anthropic are increasingly intertwined. Meta uses NVIDIA. This is becoming a "league of their own" sport where the winners are the ones who own multiple layers.

Ironwood is Google's move to lock in that stack. TPU + Google Cloud + Gemini + Anthropic + JAX/TensorFlow. It's not a coincidence that all the pieces work together. It's not an accident that Anthropic built Claude specifically to work on TPUs. This is vertical integration – the defense strategy of the 21st century technology.

The question NVIDIA should be asking is not "how do we beat Ironwood?" but "how do we stay relevant when inference – the dominant workload – is moving to specialized hardware?" The answer probably involves leaning into training, becoming the standard for enterprises that can't build their own chips, and possibly building better inference hardware faster than they currently are.

Player Hardware Cloud Model Advantage
Google Ironwood TPU Google Cloud Gemini Vertical stack
Anthropic Ironwood TPU Google Cloud Claude Partnership leverage
OpenAI NVIDIA H-series Azure GPT Microsoft moat
Meta NVIDIA H-series Internal Llama In-house infrastructure
DeepSeek Mixed (NVIDIA + custom) Mixed DeepSeek Cost optimization

What This Means for Everything Else

Ironwood's general availability isn't just a hardware launch. It shifts incentives across the entire industry.

First, inference becomes a differentiator. Up until now, serving a model at scale was a "solved problem" – throw more GPUs at it. Now, companies that can run the same model on cheaper hardware have a structural advantage. Claude becomes more profitable per user. Gemini becomes harder to undercut on price. This cascades: who wins on margin can reinvest in model quality.

Second, quantization goes mainstream. Quantizing models to FP8 or lower has been optional – something you did if you were desperate for performance. With Ironwood, it becomes standard. Every model will be optimized for low-bit inference. This changes how models are trained and fine-tuned. It changes what architectures people explore. It ripples through the entire design space.

Third, edge inference becomes strategic. If your cloud inference is cheap and fast, your edge inference doesn't need to be. But if cloud inference is expensive, you're motivated to run models on-device. Ironwood may actually accelerate cloud inference adoption because the cost-benefit calculation changes.

Fourth, inference optimization becomes a competitive moat. Previously, if you had a great model, you could license it to anyone. Now, having great infrastructure to serve it at scale becomes as important as having a great model. This favors companies that own hardware and cloud – i.e., Google, Microsoft, and eventually, maybe Anthropic if they keep expanding their infrastructure autonomy.

The inference era has arrived, and Ironwood just signaled that Google intends to dominate it the same way NVIDIA dominated training.

The Open Questions

Where does this go from here?

Can NVIDIA respond? Their next major launch (H300, Blackwell) will be critical. They need to close the gap in inference efficiency while maintaining their dominance in training. It's not impossible, but it requires a reset of their roadmap priorities. Expect more focus on inference workloads, lower precision support, and memory bandwidth in future launches.

What about other companies? Apple is building inference hardware for on-device AI. NVIDIA is expanding into inference with grace-hopper and upcoming architectures. Amazon might eventually build Trainium or Inferentia clones for internal use. But in terms of scale and ecosystem, Google's advantage is significant.

Will Anthropic actually use all 1 million TPUs? They might. Claude's usage has been growing exponentially. If the model becomes ubiquitous in enterprises and consumer products, 1 million TPUs might be the baseline for 2026–2027. Or they might repurpose idle capacity for other workloads, selling it back to Google Cloud. Either way, the commitment signals serious betting on growth.

What's next in the TPU roadmap? Ironwood is generation 7. Generation 8 is probably already in the design phase at Google. Expect even more bandwidth, more memory, potentially custom precision formats (maybe FP6 or FP4 for certain layers), and likely even better sparse operations. The cadence will probably accelerate now that inference is the priority.

How does this change the enterprise? Companies that have invested heavily in NVIDIA infrastructure might start exploring Google Cloud alternatives. It won't be a wholesale switch – there's too much institutional knowledge and CUDA software. But for new workloads, especially inference-heavy ones, TPU becomes competitive. Some enterprises will adopt a mixed strategy, using NVIDIA for training and TPU for serving.

The Inference Era Begins Now

Ironwood is April 1st, 2026. Mark the date. This is when the inference-centric AI infrastructure race began in earnest.

For the past three years, we've been in the "training era." The bottleneck was building models, the competition was on model quality, and the companies that won were the ones who could train at scale. OpenAI, Google, Anthropic, Meta – they all competed on model SOTA.

Now we're entering the inference era. The models are built. The competition moves to serving them. The companies that win will be the ones who can serve more users, faster, cheaper. And the infrastructure that enables that – the chips, the cloud platforms, the software stacks – becomes the new defensible moat.

Google bet on TPUs years ago. At the time, it was a risky contrarian call. NVIDIA seemed unbeatable. GPUs seemed universal. Why invest in a specialized chip for a future that might not arrive?

Google did the thing that wins in tech: they built infrastructure for the future they wanted to create. They didn't wait for the future to prove them right. They built TPUs, optimized software around them, partnered with companies like Anthropic, and moved pieces into position. Now the future has arrived, and Google's bets are paying off.

Ironwood at 42.5 exaFLOPS, Anthropic deploying 1 million of them, Claude running on them at scale – this isn't one company's good day. This is the moment when the entire AI infrastructure market reorganizes around inference-optimized hardware.

NVIDIA will adapt. The market is too large to ignore. But for the first time in a decade, NVIDIA faces real, legitimate competition for the future of AI compute. And that competition is armed with 42.5 exaFLOPS per superpod.

The inference era has begun. Google is ready. The question is whether everyone else is too.

무료 뉴스레터

AI 트렌드를 앞서가세요

매일 아침, 엄선된 AI 뉴스를 받아보세요. 스팸 없음. 언제든 구독 취소.