NVIDIA B300 (Blackwell Ultra) vs H200

We're expanding our GPU fleet with two of the most significant GPUs NVIDIA has recently shipped: the Blackwell Ultra B300 and the Hopper H200, so we figured we'd take the opportunity to talk about them.

The two GPUs sit at opposite ends of a spectrum. One is the new frontier-class flagship (the B300) running new Blackwell architecture, while the H200 is a proven workhorse, which over the past year has become the default for production inference.

Head to Head

Spec	NVIDIA H200 (SXM)	NVIDIA B300 (Blackwell Ultra)
Architecture	Hopper (GH100 die)	Blackwell Ultra (dual-reticle)
GPU memory	141 GB HBM3e	288 GB HBM3e
Memory bandwidth	4.8 TB/s	8 TB/s
Dense FP8	~1,979 TFLOPS	Higher (FP4-optimized)
Dense FP4	—	~15 PFLOPS
FP64	Strong (HPC-capable)	Minimal (de-prioritized)
Power (TGP)	700W	1,400W
Cooling	Air or liquid	Liquid-oriented
Best for	Cost-efficient production inference, ≤100B models, FP64 HPC	Frontier-scale inference, long-context reasoning, MoE, agentic workloads

The H(Hopper)200:

The H200 was shipped as the H100's big brother, with an upgraded memory subsystem and the same core architecture (Hopper). The H200 ships with 141 GB of HBM3e and 4.8 TB/s of memory bandwidth, a significant step up from the H100's 80 GB of HBM3 and 3.35 TB/s (roughly a 76% jump in capacity and a 43% jump in bandwidth).

Because the tensor core die is unchanged compared to its predecessor, the H100, raw compute is effectively identical at around 1,979 TFLOPS of dense FP8. The H200's advantage lies in what's gated by memory, not computation. A good example of this would be a 70B-parameter model, which at FP16 needs roughly 140 GB just for weights. This workload would require 2x H100's, vs 1x H200. This also ties into autoregressive decoding (the token-by-token generation that dominates inference latency), which is vastly memory-bandwidth-bound, making the 43% bandwidth bump in the H200 translate directly to higher effective throughput.

This is what makes the H200 the golden child for production inference on models up to the ~70-100B range, long-context serving with large KV caches, and large-batch training.

The B(Blackwell)300:

The B300 (officially named the Blackwell Ultra) is a different core architecture, using dual-reticle design (2x compute dies in a single GPU, with 208 billion transistors and 160 streaming multiprocessors) and it's unmistakably optimized for inference at scale.

The jump from the B200 to the B300 looks very similar to the jump from the H100 to the H200; 288 GB of HBM3e (built from 12-high stacks, versus the B200's 8-high), 8 TB/s of memory bandwidth, and roughly 15 petaFLOPS of dense FP4 compute per GPU. This is greater than a 50% memory jump compared to the B200, and a meaningful throughput uplift.

The significance in the B300 lays in it's 288 GB memory, which allows models, KV caches, and activations that would've needed multi-GPU sharding to now run on a single GPU (double the memory of an H200).

One notable caveat is that the B300 is weighted (heavily) toward low-precision formats like NVFP4 and FP8, and attention-style workloads, meaning a dramatically more powerful FP64 performance compared to the hopper series. To us, this choice signals that NVIDIA is building this chip for transformers and inference, and not traditional double-precision scientific HPC.