Free

Inference Cost Index — Q1 2026

Quarterly benchmarking of inference costs across major cloud providers and GPU generations. Covers pricing, throughput, and cost-per-token for leading open and closed models.

What's Inside

  • Cost-per-token benchmarks across 8 cloud providers
  • GPU generation comparison: H100 vs B200 vs TPU v6
  • Open model vs closed API pricing analysis
  • Regional pricing variations and arbitrage opportunities

Who It's For

  • CTOs evaluating inference infrastructure investments
  • FinOps teams tracking AI compute spend
  • Founders choosing between self-hosted and API-based architectures

Get this report — free with your email

Subscribe to Control Plane and get instant access to this report plus weekly AI infrastructure intelligence.

Executive Summary

Inference costs continued their rapid decline in Q1 2026, driven by two reinforcing forces: the initial rollout of NVIDIA Blackwell B200 instances and aggressive price competition among cloud GPU providers. The average cost-per-million-tokens for a 70B-parameter model dropped 28% quarter-over-quarter, the steepest single-quarter decline we have tracked since launching this index.

The most significant development is the widening gap between self-hosted and API pricing. Organizations running their own inference infrastructure on B200 hardware are now paying $0.08-0.12 per million input tokens for Llama 4 70B, while comparable API pricing from leading providers sits at $0.30-0.60. This 3-5x gap represents the “infrastructure premium” — the cost of convenience, managed uptime, and avoided engineering complexity. For high-volume consumers, the math increasingly favors self-hosting.

Key Findings

Cloud Provider Pricing. AWS, GCP, and Azure all introduced B200-based instances in Q1, but availability remained limited to select regions. On-demand pricing landed 15-25% above H100 instances on a per-hour basis, but 35-45% below on a per-token basis due to the throughput gains. CoreWeave and Lambda offered the most competitive B200 reserved pricing, with 12-month commitments coming in 20-30% below hyperscaler on-demand rates.

GPU Generation Comparison. We benchmarked four GPU configurations across a standardized workload (Llama 4 70B, 8K context, mixed batch sizes):

  • B200 SXM (FP4): $0.08/M input tokens, $0.24/M output tokens
  • H100 SXM (FP8): $0.14/M input tokens, $0.42/M output tokens
  • H100 PCIe (FP8): $0.19/M input tokens, $0.57/M output tokens
  • TPU v6 (BF16): $0.11/M input tokens, $0.33/M output tokens

Google’s TPU v6 remains highly competitive, particularly for teams already invested in the JAX ecosystem. The B200 takes the overall cost crown, but only when FP4 quantization is viable for the target workload.

Open vs. Closed Model Economics. The cost advantage of open models continues to grow. Running Llama 4 70B on self-hosted B200s costs roughly 5x less than calling GPT-4o through the OpenAI API, and 3x less than Claude via the Anthropic API. This gap was 2-3x a year ago. Closed model providers are competing on quality, latency, and developer experience rather than price — a sustainable strategy as long as their models maintain a quality edge, but one that becomes harder as open model quality converges.

Regional Arbitrage. We observed meaningful pricing differences across regions. US-East and EU-West remain the most expensive zones. Southeast Asia (Singapore, Jakarta) and the Middle East (Doha, Riyadh) offer 15-25% discounts on equivalent GPU instances, driven by new datacenter capacity and government subsidies. For latency-tolerant batch workloads, routing inference to these regions represents easy savings.

Methodology

All benchmarks were conducted using standardized configurations on production-grade instances. We measured end-to-end cost including instance cost, network egress, and storage for model weights. Throughput was measured at P50 and P99 latency targets representative of production serving. Pricing data was collected between January 1-31, 2026, from public pricing pages and direct quotes from provider sales teams. Full methodology documentation is available in the appendix.

Get the next report first

Subscribe to Control Plane for weekly analysis and early access to premium reports.