Helpful Mindset Shifts for GPU Users Looking at TPUs

Disclaimer: The views and opinions expressed in this post are my own and do not necessarily reflect the official policy or position of my employer.

As someone who’s worked with TPUs for a while, I’ve seen a common pattern in people coming from the GPU world: they’ll take a look at a TPU chip’s specs, then look at a comparable GPU, and see numbers for the GPU chip that are obviously higher and scratch their head. Let’s look at TPU v5p (the training-optimized chip released in 2023) compared to the Nvidia H200 (the flagship AI GPU also released in 2023):

Metric TPU v5p H200
HBM Capacity 95 GB 141 GB
HBM Bandwidth 2.8 TB/s 4.8 TB/s
Max FP/BF16 FLOPS ~459 ~990

Note: Though more recently the gap in per-chip specs is converging with the TPU v7 generation, it doesn’t illustrate my point as well. :)

The key insight here is to understand the greater context and design origins of each chip: the GPU was traditionally meant to be sold as an individual chip and needed to be flexibly installed into all kinds of different hardware. On the other hand, the TPU was designed from the beginning to be one component in a massive vertically integrated ML hardware stack.

The better comparison is between GPU clusters and TPU pods. A pod is a slice of a TPU pod in which all chips are interconnected via high-bandwidth ICI (Inter-Chip Interconnect). You can select from various topologies for your pod—from 1 (1x1), to 16 (4x4) up to 100s or even 1000s of chips depending on the TPU version. So to extend the above example, the more apt comparison to make for training workloads would be 8 H200s vs. a 4x4 TPU v5p pod:

Metric TPU v5p 4x4x1 H200x8 Cluster
HBM Capacity 1.5TB 1.1TB
ICI Bandwidth 1200GB/s 900 GB/s
Max FP/BF16 FLOPS ~7.3 PetaFLOPS ~7.9 PetaFLOPS
On-demand price ~$67/hr ~$80/hr

There are a few other things people miss when coming from the GPU world. First is that the memory architecture is different on TPU: it uses systolic arrays, meaning data is automatically streamed through registers without needing to explicitly fetch data from registers for every operation. The How to Scale your Model book has an excellent overview of how systolic arrays work. In the end, this means TPUs generally need less HBM bandwidth than GPUs for the same computation.

The second is networking: to scale beyond 8 H200s, you need to send data across the data center network (DCN), which is 1-2 orders of magnitude slower than ICI.

Metric TPU v5p (per chip) H200 (per GPU)
ICI Bandwidth 1200 GB/s (bidirectional) 900 GB/s (bidirectional)
DCN Bandwidth 6.25 GB/s (50 Gbps) 50 GB/s (400 Gbps)

Looking at the table above, it’s clear that GPU design prioritizes DCN speed because it’s much more commonly used. On the other hand, TPUs are generally used together within one slice, which are all connected by ICI.

TPUs have ICI baked directly into the silicon and each chip is connected to all of its neighbors, forming a high-bandwidth 2D or 3D torus mesh. To achieve different meshes TPU racks use optical switches, meaning a TPU pod can physically reconfigure its topology in seconds to match the ML workload – e.g., from a 3D torus to a twisted ring, or to scale from a pod of 1 to a slice of 8960 chips (in the case of v5p).

To sum things up, the main mindset shift when working with TPUs is to zoom out from looking at individual chip specs and look at your entire cluster of chips and interconnect as a whole, and how it fits your specific ML workload. For production ML workloads, stats like per-chip peak FLOPS and HBM size start taking a backseat to metrics like MFU, goodput, and compute per dollar efficiency.

If you’re interested in diving deeper into how TPUs work, read the chapter How to Think About TPUs in How to Scale Your Model, or check out tinytpu.