There are exactly three NVIDIA datacenter GPUs that matter for AI workloads in 2026: the A100 (Ampere, 2020), the H100 (Hopper, 2022), and the H200 (Hopper refresh, 2024).
Pick wrong, and you are either burning capex on overkill or buying yourself a memory bottleneck that throttles your real performance.
This guide compares all three on specs, price, real cloud rates across 22+ providers, and the workloads each one actually wins. By the end, you will have a decision framework you can apply to your specific case.
Which GPU should you actually choose?
| If your workload is… | Best choice | Why |
|---|---|---|
| Training a 70B+ foundation model from scratch | H100 | FP8 Transformer Engine + 989 TFLOPS BF16 |
| Training 7B–30B model | H100 (or A100 if existing fleet) | A100 still acceptable below FP8 thresholds |
| Fine-tuning open-source models with LoRA / QLoRA | A100 80GB | More than sufficient, 60–70% cheaper |
| Long-context inference (≥ 64K context, large models) | H200 | Memory bandwidth is the bottleneck, not compute |
| High-throughput inference of small/medium models | H100 | Best compute density per dollar |
| Research, dev, exploratory work | A100 | Cheap, available, easy to acquire |
| Budget-constrained academic / startup | A100 (refurb/used) | $8–15K street price |
If you read no further, that table covers most of the decision. The rest of this article explains why, with the numbers behind it.
Specifications side-by-side
Compute alone does not tell you what to buy. Memory, memory bandwidth, FP8 support, NVLink, and TDP all matter, and the right answer depends on which constraint binds for your workload.
| Specification | A100 80GB SXM4 | H100 SXM5 | H200 SXM5 |
|---|---|---|---|
| Architecture | Ampere | Hopper | Hopper |
| Process node | TSMC 7nm (N7) | TSMC 4N | TSMC 4N |
| Released | 2020 | 2022 | 2024 |
| Memory | 80 GB HBM2e | 80 GB HBM3 | 141 GB HBM3e |
| Memory bandwidth | 2.0 TB/s | 3.35 TB/s | 4.8 TB/s |
| FP16 Tensor (dense) | 312 TFLOPS | 989 TFLOPS | 989 TFLOPS |
| BF16 Tensor (dense) | 312 TFLOPS | 989 TFLOPS | 989 TFLOPS |
| FP8 Tensor (dense) | not supported | 1,979 TFLOPS | 1,979 TFLOPS |
| FP8 Tensor (sparse) | not supported | 3,958 TFLOPS | 3,958 TFLOPS |
| TDP | 400W | 700W | 700W |
| NVLink bandwidth | 600 GB/s | 900 GB/s | 900 GB/s |
| Form factor | SXM4 (8-GPU HGX) | SXM5 (8-GPU HGX) | SXM5 (8-GPU HGX) |
| OEM price (2026, approx.) | $8–15K (used/refurb) | $25–30K | $32–40K |
| Cloud $/hr (typical, 2026) | $1.10–1.80 | $1.99–3.50 | $2.50–4.50 |
Three observations from this table matter most:
1. H100 and H200 have the same compute. The Tensor Core throughput is identical: 989 TFLOPS BF16 and 1,979 TFLOPS FP8 dense. H200 is not a faster compute card. It is a wider memory pipe.
2. H100 is 3.2× the compute of A100 at 1.75× the power. For raw training throughput on modern transformer workloads, H100 is roughly twice as power-efficient as A100.
3. A100 has no FP8. If your model uses FP8 mixed precision, A100 falls back to FP16, which increases memory pressure and reduces throughput.
NVIDIA A100: the workhorse that will not disappear
The A100 launched in 2020. Six years later, it is still everywhere, and not just because it has been depreciated. The A100 remains a good GPU for several use cases that do not need what H100 offers.
Where A100 still wins:
- Fine-tuning open-source models. A 7B or 13B model fine-tune with LoRA or QLoRA fits comfortably in 80GB and does not need FP8. An A100 80GB at roughly $1.20/hr in the cloud is significantly cheaper than an H100 and finishes the job at the same wall-clock time for many fine-tunes.
- Inference of small and medium models. Llama 3 8B, Mistral 7B, and smaller fine-tunes do not saturate H100, and A100 throughput is often sufficient.
- Research and exploration. Academic and small-team work usually has real budget constraints and moderate compute requirements.
- Refurbished / secondary market. A 2-year-old A100 80GB on the secondary market can list around $8–10K. For a startup with capital constraints, that can be the difference between owning compute and not.
Where A100 has fallen behind:
- No FP8. Modern training runs often use FP8 mixed precision via NVIDIA’s Transformer Engine. A100 falls back to FP16, increasing memory pressure and reducing throughput.
- Memory bandwidth. 2.0 TB/s was strong in 2020. By 2026, it can become the binding constraint for many LLM inference workloads, especially with longer contexts.
- HBM2e generation. Newer memory technology in H100 and H200 has better density and bandwidth.
A100s are best understood as the acquired-fleet option: if you already have them, keep using them for the workloads they are good at. If you are buying GPUs from scratch in 2026 and price is not the binding constraint, you probably would not buy A100s first.
For a deeper look at the A100/H100 trade-off specifically, see A100 vs H100: Cost, Performance, and When Each Makes Sense.
NVIDIA H100: the 2026 training standard
If 2026 has a default GPU for AI training, it is the H100. Frontier-model training capacity and high-end inference capacity have largely converged around H100-class systems.
The H100 thesis in three points:
The Transformer Engine is real. FP8 training can deliver significantly higher throughput than FP16 on transformer architectures. The Transformer Engine handles mixed precision automatically, which is one of the main reasons newer training runs moved from A100 to H100.
Memory bandwidth jumped to 3.35 TB/s. HBM3 with a 5,120-bit memory bus delivers a major bandwidth increase over A100. For inference workloads where weights flow through memory continuously, this matters.
NVLink 4 at 900 GB/s. For multi-GPU training jobs, inter-GPU bandwidth matters as much as compute. The NVLink bump over A100 reduces communication bottlenecks in tensor-parallel and pipeline-parallel setups.
H100 cloud pricing reality check:
The same H100 SXM5 SKU can price very differently across providers.
| Provider tier | H100 80GB SXM5 $/hr | Notes |
|---|---|---|
| Hyperscaler on-demand | $3.50–$5.00 | Premium for ecosystem integration |
| Tier-1 specialty | $2.50–$3.50 | Common middle-of-market range |
| Long-tail and regional | $1.99–$2.50 | Where many best deals live |
| Reserved (1–3 year) | $1.50–$2.20 | Often discounted from on-demand |
The spread between hyperscaler on-demand and long-tail providers for the exact same hardware is one of the most important facts in AI infrastructure pricing. It is also why GPU prices differ 30%+ for the same hardware.
For OEM purchase, H100 SXM5 systems in 8-GPU HGX configurations often land in the $250–320K range from major OEMs, which works out to roughly $25–30K per GPU after server overhead. PCIe variants can be cheaper, but they lose NVLink bandwidth.
For a single-GPU economics breakdown, see The Real Cost of an H100 GPU.
NVIDIA H200: memory-first inference
The H200 is the GPU that confuses people. It is named like a successor, but it is not a faster compute card. The compute throughput is identical to H100. The H200 thesis is entirely about memory.
What H200 actually changes:
- 141 GB HBM3e memory vs H100’s 80 GB, a 76% capacity increase
- 4.8 TB/s memory bandwidth vs H100’s 3.35 TB/s, a 43% bandwidth increase
- Same TDP and same NVLink bandwidth as H100
The H200 exists because, for many real 2026 workloads, memory is the bottleneck, not compute.
Specifically:
Long-context inference. Running a 70B model with a 128K context window means a KV cache of tens of gigabytes per concurrent request. Memory bandwidth determines how fast that cache can be read on every token generation step. H200’s bandwidth advantage can show up directly in inference latency at long contexts.
Large model inference at higher batch sizes. Running a 70B+ model on a single H100 leaves limited memory for batching multiple requests. H200’s extra memory lets you serve more concurrent users per GPU, which can improve throughput per dollar even though the per-token compute is the same.
Mixture-of-experts models. MoE architectures have many parameters but only activate a subset per token. The many-parameters part creates memory pressure. The active-subset part can leave compute underused. H200 is built for that asymmetry.
Where H200 does not help:
If your workload is compute-bound, such as training, dense models with short contexts, or small-batch inference, H200 may not justify the premium. The compute is the same as H100. You would be paying for memory you do not use.
H200 OEM pricing in 2026 sits around $32–40K per GPU. Cloud rates typically land around $2.50–4.50/hr, with the same kind of cross-provider spread you see on H100.
For more on H200 specifically, see H200 Price: What It Actually Costs in 2026 and H100 vs H200 Cost: Is the Upgrade Worth It?.
Cost per effective FLOP
Spec sheets are not enough without normalization. The comparison that actually drives buying decisions is cost per useful FLOP for your workload’s precision.
For BF16 training workloads:
| GPU | OEM capex (per GPU) | BF16 TFLOPS | $/TFLOPS-yr (capex only) |
|---|---|---|---|
| A100 80GB | $10,000 (used/refurb) | 312 | $32 |
| H100 SXM5 | $28,000 | 989 | $28 |
| H200 SXM5 | $36,000 | 989 | $36 |
H100 wins on BF16 compute density. H200 looks worse here, but only because BF16 compute is the wrong metric for H200.
For memory-bandwidth-bound inference workloads:
| GPU | Capex | Memory bandwidth | $/(GB/s) |
|---|---|---|---|
| A100 80GB | $10,000 | 2,000 GB/s | $5.00 |
| H100 SXM5 | $28,000 | 3,350 GB/s | $8.36 |
| H200 SXM5 | $36,000 | 4,800 GB/s | $7.50 |
On memory bandwidth per dollar, H200 can beat H100. That is the metric that matters for the workloads H200 was built for.
General rule for choosing between H100 and H200:
Pick H200 if your memory bandwidth demand is the bottleneck.
Pick H100 if your workload is mostly compute-bound.
In practice: long-context inference of large models tilts toward H200. Training and short-context inference tilt toward H100.
Cloud pricing across providers
Hardware capex is one input. Cloud $/hr is the other. For most teams, cloud is how they actually consume these GPUs, so the cross-provider variance determines real cost.
A snapshot of the spread for each GPU on Mercatus GPU Index:
| GPU | Hyperscaler $/hr | Specialty $/hr | Long-tail $/hr | Spread |
|---|---|---|---|---|
| A100 80GB | $1.50–$3.00 | $1.10–$1.50 | $0.80–$1.10 | 3.75× |
| H100 80GB | $3.50–$5.00 | $2.50–$3.50 | $1.99–$2.50 | 2.5× |
| H200 141GB | $4.50–$7.00 | $3.50–$4.50 | $2.50–$3.50 | 2.8× |
A 2.5–3.75× spread for identical hardware is not a misprint. Hyperscalers bundle ecosystem services, specialty providers focus on price-performance, and long-tail providers often operate with lower power costs and lower sales overhead.
For buyers, the implication is straightforward: do not accept hyperscaler on-demand pricing for these GPUs unless you have a specific reason to. The savings from comparing providers are routinely meaningful enough to change the economics of a workload.
For continuously updated cross-provider H100, A100, and H200 pricing, see Mercatus GPU Index.
Workload decision framework
Choosing between A100, H100, and H200 should not start from what is newest. It should start from the binding constraint of the workload.
Training a foundation model from scratch (≥ 70B parameters)
Choose: H100
You need FP8 throughput, NVLink bandwidth for tensor parallelism, and dense compute. Modern foundation model training runs on H100 and H100-equivalent clusters for a reason: the throughput per dollar lands better than alternatives. H200’s extra memory is less useful here because training is usually compute-bound and memory pressure is handled through parallelism strategies.
Training mid-size models (7B–30B), full precision
Choose: H100; A100 acceptable if existing fleet
H100 is the right answer if you are buying new. A100 still works if you already have it. The FP8 disadvantage is real, but not always catastrophic for this size class.
For full estimates of training cost across model sizes, see How Much Does It Cost to Train an LLM?.
Fine-tuning open-source models (LoRA / QLoRA)
Choose: A100 80GB
This is the A100’s strongest remaining use case. LoRA fine-tunes do not need FP8 and often do not saturate H100. The cost difference matters. Use H100 only if you are already paying for it or if speed is more important than cost.
Inference: large models, long context
Choose: H200
This is the H200’s purpose. The KV cache pressure at long context tilts memory bandwidth into the binding constraint. H200’s bandwidth advantage can translate into better per-token latency and higher serving efficiency.
Inference: small/medium models, throughput-focused
Choose: H100
This is more compute-bound territory. H200’s memory advantage may not activate. H100’s compute density per dollar usually wins.
Inference: very large models, single-node serving
Choose: H200 in 8-GPU HGX configuration
A very large model at FP8 can require hundreds of gigabytes of memory just for weights. H200’s 141 GB per GPU gives an 8-GPU node much more headroom for weights, KV cache, and activations than H100.
Research and dev
Choose: A100
Research and dev workloads are variable, lower-scale, and sensitive to cost. A100s on the secondary market or through lower-cost providers are often the right answer. Save the H100 budget for workloads that actually need it.
Total cost of ownership at scale
Per-GPU capex is only one component of total cost. At cluster scale, supporting infrastructure, power, cooling, colocation, networking, and operations can add 50–80% on top of hardware capex over a 3-year ownership horizon.
For a 100-GPU H100 cluster, the rough breakdown looks like this:
| Cost component | Year 1 | 3-year total | % of TCO |
|---|---|---|---|
| Hardware (100 H100s + servers) | $3.0M | $3.0M (depreciated) | ~50% |
| Power | $114K–$300K | $340K–$900K | ~10–15% |
| Colocation | $200K–$400K | $600K–$1.2M | ~10–20% |
| Networking, storage, ops | $150K–$300K | $450K–$900K | ~8–15% |
For the institutional version of this analysis, see Total Cost to Own 100 H100 GPUs.
The point for this article: when you compare A100/H100/H200 hardware capex, you are looking at one slice of the picture. At scale, choosing the wrong GPU is not only a hardware capex mistake. It can become a multi-million-dollar TCO mistake across power, colocation, operations, and depreciation cycles.
Should you buy or rent?
If your H100 utilization will sit above roughly 65% for at least 18 months, owning can become cheaper than renting at typical 2026 prices. Below that threshold, cloud rentals often win on pure cost. Above roughly 85% utilization for 24 months, owning can become dramatically cheaper.
The decision hinges on workload predictability, capital availability, and operational appetite.
Full breakdown: Buy vs Rent GPUs: When Does Owning Become Cheaper?.
One scenario this guide does not usually cover: what if you own GPUs and have spare capacity?
Most owners do. Utilization rarely sits at 100%. You are paying for the GPUs whether they run or sit idle. Selling spare capacity as inference tokens can recover idle cost without disrupting the primary workload.
This is the supply-side of the open AI compute economy: anyone with H100s, H200s, or A100s can list inference capacity on Mercatus and earn from idle GPU-hours that would otherwise be sunk cost.
Become a Provider
How GPU choice translates to token prices
Everything above is about the supply side of AI compute: the GPUs that produce the tokens buyers ultimately pay for. The connection between this article and what buyers pay per million tokens is direct.
Token prices are shaped by underlying GPU economics, provider markup, and market structure inefficiencies.
When the same H100 SKU prices 2.5× differently across providers, it is not because the hardware is different. It is because there is no transparent market where prices clear. Buyers cannot easily route to the cheapest qualified provider, and sellers cannot easily reach buyers without major sales investment.
That is the gap an open AI compute market closes.
For the full thesis, see The Open AI Compute Economy: Why Tokens Are the Next Open Market.
For day-to-day work, two practical product surfaces matter:
- For GPU price discovery: Mercatus GPU Index tracks live H100/H200/A100 cloud pricing across 22+ providers, with cross-provider rankings and historical trends.
- For token-level access: Mercatus Spot Market gives buyers a unified API across models and providers, with pay-as-you-go access and forward contracts to lock in cost over time.
Frequently Asked Questions
Is H200 worth the premium over H100?
Only if memory bandwidth is your bottleneck. For training and short-context inference, H100 wins on cost per FLOP. For long-context inference of large models and serving workloads with high concurrent batch sizes, H200’s memory bandwidth advantage can justify the premium.
Can A100s still train modern foundation models?
Yes, but with caveats. A100s lack FP8 support, so they fall back to FP16, which increases memory pressure and reduces throughput compared to H100 for transformer training. For under-30B models, A100 fleets are still usable. For 70B+ frontier-style models, H100 is the practical default.
What is the difference between SXM and PCIe variants?
SXM is the form factor used in 8-GPU HGX server systems with NVLink interconnect. PCIe variants are slot-card form factors that usually have weaker multi-GPU interconnect options. For multi-GPU training, SXM is usually the right choice. PCIe can be fine for single-GPU inference workloads.
How fast do these GPUs depreciate?
H100 has held value well compared with older generations. A100s have depreciated more aggressively. H200 is still newer, so the depreciation curve is less clear. For deeper analysis, see GPU Depreciation: How Fast Do H100s Lose Value?.
Should I wait for B100/B200?
Blackwell supply is still constrained and pricing is high. For many teams, buying H100/H200 today and upgrading on a 2-year cycle may make more sense than waiting. If the workload is urgent and H100/H200 economics work, waiting can be more expensive than deploying.
Track Blackwell availability through GPU Index.
How does GPU choice affect my LLM API costs as a buyer?
You do not choose GPUs directly as an API buyer, but the hardware choice your provider makes is reflected in the per-token price. Providers running H100/H200 fleets have different cost structures than providers running A100 fleets, and that flows through to what they charge.
Mercatus Token Index helps surface the resulting price differences.
Where do I find the cheapest H100 / H200 / A100 cloud pricing right now?
Mercatus GPU Index tracks live pricing across 22+ providers. Long-tail and specialty providers often offer materially lower rates than hyperscaler on-demand for the same SKUs.
Can I list my own GPU capacity for sale?
Yes. If you operate H100, H200, or A100 capacity, owned or reserved, you can sell idle inference capacity on Mercatus. List your endpoint, set your prices, and reach buyers without building a sales channel from scratch.
Become a Provider
Methodology
Specifications in this article are sourced from NVIDIA’s official datasheets for the A100 80GB SXM4, H100 SXM5, and H200 SXM5. Cloud pricing ranges and cross-provider spreads are derived from Mercatus GPU Index, which tracks H100/H200/A100 on-demand and reserved pricing across 22+ providers. OEM hardware pricing reflects public quotes from major distributors as of May 2026.
Last verified: 2026-05-04.
Methodology and data dictionary: docs.mercatus-ai.com/methodology
