Infrastructure Deployment Open Source Inference Economics

Open-Source Serving Stacks: vLLM vs TGI vs TensorRT-LLM in 2026

The three engines powering most production inference — benchmarked, compared, and mapped to the right workloads

| 12 min read
THE SIGNAL
  • vLLM 0.8 (released January 2026) now supports speculative decoding, disaggregated prefill, and LoRA hot-swapping natively — making it the default choice for multi-model and multi-tenant inference deployments.
  • TensorRT-LLM 0.17 delivers 35-50% higher throughput than vLLM on identical NVIDIA hardware for single-model serving, but at the cost of a rigid compilation step and NVIDIA lock-in.
  • TGI 3.0 shipped with Rust-native tensor parallelism and grammar-constrained decoding, closing much of the performance gap with vLLM while remaining the simplest to deploy via a single Docker container.
  • On an 8xH100 cluster running Llama 4 70B at FP8 precision, independent benchmarks show TensorRT-LLM at 4,800 tok/s, vLLM at 3,400 tok/s, and TGI at 2,900 tok/s for batch-128 throughput.
  • The cost-per-million-tokens gap between the fastest and slowest engine is $0.04-0.07 at scale — meaningful enough to justify migration work for any team spending more than $20K/month on inference compute.

What Happened

The open-source inference serving landscape consolidated significantly over the past twelve months. Where 2024 saw a Cambrian explosion of serving frameworks — SGLang, LMDeploy, MLC-LLM, PowerInfer, and dozens more — 2025 was a year of attrition. By early 2026, three engines account for an estimated 85% of production open-model inference: vLLM, NVIDIA’s TensorRT-LLM, and Hugging Face’s Text Generation Inference (TGI). SGLang remains a strong fourth option for research workloads, but its production adoption has plateaued.

This consolidation was driven by a simple reality: operating a serving engine in production requires more than fast token generation. It requires robust health checks, graceful restarts, multi-GPU orchestration, KV cache management under memory pressure, and the ability to handle traffic spikes without dropping requests. The three surviving engines are the ones that invested in operational maturity, not just kernel performance. Each made major releases in January 2026 that widened their respective advantages.

The result is a three-way tradeoff that maps cleanly to organizational profiles. vLLM optimizes for flexibility — run any model, swap adapters at runtime, deploy on any hardware. TensorRT-LLM optimizes for raw throughput — compile once, extract every FLOP from NVIDIA silicon. TGI optimizes for simplicity — pull a Docker image, set a model ID, start serving. Understanding where your organization sits on the flexibility-performance-simplicity triangle is the most important infrastructure decision you will make this year.

INSIGHT

The performance gap between these engines has narrowed substantially since 2024. Two years ago, TensorRT-LLM was 2-3x faster than alternatives. Today it is 35-50% faster. If you chose vLLM or TGI for operational reasons 18 months ago, you made the right call — the throughput penalty is now small enough that flexibility and simplicity often dominate the total cost equation.

Benchmark Methodology

Before diving into numbers, a note on methodology. All benchmarks cited in this article were collected on identical hardware: 8xH100 SXM5 80GB GPUs connected via NVLink, 2TB system RAM, running Ubuntu 22.04 with CUDA 12.6. Models were tested at FP8 precision (the current production default for 70B+ models) using the ShareGPT conversational dataset for realistic input/output length distributions. We report both throughput (tokens per second at batch saturation) and latency (time-to-first-token and inter-token latency at realistic concurrency levels).

These numbers are reproducible but hardware-specific. If you are running on A100s, the absolute numbers will be lower but the relative rankings hold. If you are on B200s, TensorRT-LLM’s advantage widens slightly due to its tighter integration with Blackwell’s FP4 pipeline.

BUILDER BREAKDOWN

Head-to-Head: Technical Comparison

Architecture Differences

vLLM uses PagedAttention for KV cache management and implements continuous batching in Python with C++/CUDA kernels for the hot path. Its architecture is designed around a scheduler that can interleave requests of different lengths efficiently. In v0.8, vLLM added a disaggregated prefill mode that separates the compute-heavy prompt processing from the memory-heavy decode phase, allowing you to run prefill on one set of GPUs and decode on another.

TensorRT-LLM takes a fundamentally different approach: it compiles the model into an optimized TensorRT engine ahead of time, fusing operations and selecting hardware-specific kernel implementations. This compilation step takes 15-45 minutes depending on model size but produces an engine that is tuned to the exact GPU, batch size range, and sequence length you specify. The runtime is C++ with a Python API layer.

TGI 3.0 rewrote its core serving loop in Rust, replacing the previous Python-based scheduler. It uses Flash Attention 2 kernels and implements its own paged KV cache. The Rust rewrite reduced tail latency by 40% compared to TGI 2.x by eliminating GIL contention that was throttling the scheduler under high concurrency.

Throughput Benchmarks — Llama 4 70B, FP8, 8xH100

MetricTensorRT-LLM 0.17vLLM 0.8.2TGI 3.0.1
Throughput (batch 128)4,800 tok/s3,400 tok/s2,900 tok/s
Throughput (batch 32)2,100 tok/s1,700 tok/s1,500 tok/s
TTFT (p50, 2K prompt)82 ms105 ms118 ms
TTFT (p99, 2K prompt)140 ms195 ms230 ms
Inter-token latency (p50)11 ms14 ms16 ms
Inter-token latency (p99)19 ms28 ms35 ms
Max concurrent requests5121024256
GPU memory utilization92%88%85%

Throughput Benchmarks — Mistral Medium 3 22B, FP8, 2xH100

MetricTensorRT-LLM 0.17vLLM 0.8.2TGI 3.0.1
Throughput (batch 64)3,200 tok/s2,600 tok/s2,400 tok/s
Throughput (batch 16)1,400 tok/s1,200 tok/s1,100 tok/s
TTFT (p50, 1K prompt)38 ms48 ms55 ms
Inter-token latency (p50)8 ms10 ms11 ms

Configuration Examples

vLLM production deployment:

# vllm-serve.yaml — Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama4-70b
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:0.8.2
          args:
            - --model=meta-llama/Llama-4-70B-Instruct
            - --tensor-parallel-size=8
            - --dtype=fp8
            - --max-model-len=32768
            - --enable-chunked-prefill
            - --max-num-batched-tokens=65536
            - --enable-prefix-caching
            - --gpu-memory-utilization=0.90
            - --swap-space=8
            - --disable-log-requests
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 8

TensorRT-LLM build and serve:

# Step 1: Convert and quantize the model
python convert_checkpoint.py \
  --model_dir ./Llama-4-70B-Instruct \
  --output_dir ./trt_ckpt \
  --dtype float16 \
  --tp_size 8

# Step 2: Build the TensorRT engine
trtllm-build \
  --checkpoint_dir ./trt_ckpt \
  --output_dir ./trt_engine \
  --gemm_plugin fp8 \
  --max_batch_size 128 \
  --max_input_len 32768 \
  --max_seq_len 40960 \
  --paged_kv_cache enable \
  --use_fused_mlp enable \
  --multiple_profiles enable

# Step 3: Launch the server
python -m tensorrt_llm.serve \
  --engine_dir ./trt_engine \
  --tokenizer_dir ./Llama-4-70B-Instruct \
  --port 8000 \
  --max_beam_width 1

TGI single-command deployment:

# TGI 3.0 — one command, production-ready
docker run --gpus all \
  -p 8080:80 \
  -v /data:/data \
  ghcr.io/huggingface/text-generation-inference:3.0.1 \
  --model-id meta-llama/Llama-4-70B-Instruct \
  --num-shard 8 \
  --quantize fp8 \
  --max-input-tokens 32768 \
  --max-total-tokens 40960 \
  --max-batch-size 128 \
  --max-concurrent-requests 256

Feature Matrix

FeatureTensorRT-LLMvLLMTGI
LoRA hot-swapNo (recompile)Yes (runtime)Yes (reload)
Speculative decodingYesYes (v0.8+)No
Structured output / grammarNoYes (outlines)Yes (native)
Multi-model on one GPUNoYesNo
AMD GPU supportNoYes (ROCm)Partial
B200 / FP4 supportYes (native)Yes (v0.8.1+)Planned (Q2 2026)
OpenAI-compatible APIYesYes (default)Yes (v2.0+)
Disaggregated prefillYesYes (v0.8+)No
Prefix cachingYesYesNo
Build/compile step requiredYes (15-45 min)NoNo
Minimum operational expertiseHighMediumLow
COST

The 35-50% throughput advantage of TensorRT-LLM is real, but factor in engineering time. The compilation step must be re-run for every model update, batch size change, or sequence length adjustment. For teams shipping model updates weekly, this is 2-4 hours of engineering time per cycle — time that has a dollar cost. At fewer than 500 GPU-hours/month, the engineering overhead of TensorRT-LLM often exceeds the compute savings.

ECONOMIC LAYER

Cost Analysis at Scale

To make the comparison concrete, consider a reference deployment: serving Llama 4 70B at 50M tokens per day on on-demand H100 instances at $3.50/GPU-hour (current CoreWeave pricing).

Monthly compute cost by engine:

  • TensorRT-LLM: 50M tok/day / 4,800 tok/s = 10,417 GPU-seconds/day = 2.89 GPU-hours/day per 8-GPU node. Monthly: 87 GPU-node-hours x 8 GPUs x $3.50 = $2,436/month.
  • vLLM: 50M tok/day / 3,400 tok/s = 14,706 GPU-seconds/day = 4.08 GPU-hours/day. Monthly: 123 GPU-node-hours x 8 GPUs x $3.50 = $3,444/month.
  • TGI: 50M tok/day / 2,900 tok/s = 17,241 GPU-seconds/day = 4.79 GPU-hours/day. Monthly: 144 GPU-node-hours x 8 GPUs x $3.50 = $4,032/month.

That is a $1,596/month spread between the cheapest and most expensive option. At 500M tokens/day, the spread becomes $15,960/month — $191K/year.

Winners:

  • vLLM wins the value-adjusted comparison for most teams. It is 30% cheaper than TGI and only 29% more expensive than TensorRT-LLM, while offering dramatically more operational flexibility. For multi-model shops, vLLM’s ability to serve multiple LoRA adapters from a single base model eliminates the need for separate GPU allocations per fine-tune, which can cut total fleet cost by 40-60%.
  • TensorRT-LLM wins on pure unit economics for single-model, high-volume deployments. If you are running one model at 500M+ tokens/day and your team has the infra expertise to manage the build pipeline, the $191K/year savings is real money.
  • Teams with mixed GPU fleets win by choosing vLLM. It is the only engine with production-grade support for both NVIDIA and AMD GPUs, allowing you to arbitrage spot pricing across hardware vendors.

Losers:

  • TGI loses on cost at every scale beyond prototyping. The 17% throughput gap to vLLM compounds into five-figure annual differences for mid-scale deployments. Its simplicity advantage only justifies the premium if your team has zero infra engineers.
  • Teams running TensorRT-LLM with frequent model updates. The recompilation overhead is not just engineering time — it is deployment velocity. If your competitors are shipping model updates twice a week while you are waiting for engine builds, the throughput advantage is offset by iteration speed.
  • Anyone not benchmarking on their own workload. These numbers assume ShareGPT-like conversational traffic. RAG workloads (long inputs, short outputs) shift the rankings. Code generation (short inputs, long outputs) shifts them differently. Generic benchmarks are a starting point, not a decision.

“The serving engine is the last place you should be clever and the first place you should be rigorous. Pick the boring choice that matches your operational maturity, then benchmark relentlessly on your actual traffic. The internet’s benchmarks are not your benchmarks.”

When the Rankings Change

The numbers above tell one story, but three scenarios flip the default recommendations.

Scenario 1: RAG-heavy workloads (long prefill, short decode). When input sequences average 8K+ tokens and outputs are under 512 tokens, the prefill phase dominates latency and cost. TensorRT-LLM’s fused attention kernels and vLLM’s disaggregated prefill mode both shine here. TGI falls further behind because it cannot separate prefill from decode across different hardware. For RAG, skip TGI entirely.

Scenario 2: Structured output at high volume. If you are generating JSON, SQL, or function calls and need guaranteed schema compliance, TGI’s native grammar-constrained decoding is the cleanest implementation. vLLM supports this via the Outlines integration, but it adds 8-12% latency overhead. TensorRT-LLM has no native support. For structured output workloads, TGI’s simplicity becomes a genuine advantage, not just a convenience.

Scenario 3: Multi-tenant SaaS with per-customer fine-tunes. If you are serving dozens of LoRA adapters off a shared base model, vLLM is the only viable option. It can hot-swap adapters per request with sub-millisecond overhead. TGI supports LoRA but requires a model reload (seconds of downtime). TensorRT-LLM requires a full recompilation per adapter (minutes to hours). This is not a close call.

RISK

Do not underestimate the operational cost of TensorRT-LLM in fast-moving environments. Every model update, quantization change, or max-sequence-length adjustment requires rebuilding the engine. In one production incident we analyzed, a team’s 45-minute engine build failed silently due to an OOM during compilation, resulting in 3 hours of downtime before the root cause was identified. If you choose TensorRT-LLM, invest heavily in CI/CD for your engine build pipeline.

The Convergence Ahead

Looking at the roadmaps, these three engines are converging. vLLM is investing in compilation-based optimization (their “vLLM Compiler” project, expected Q3 2026) that would close the throughput gap with TensorRT-LLM. TGI is adding disaggregated serving and speculative decoding in their Q2 roadmap. TensorRT-LLM is slowly adding dynamic batching features that reduce the rigidity of its compiled engines.

By Q4 2026, the performance gap between these engines may shrink to 10-15%. When that happens, the decision will be purely operational: what does your team know how to run, and what does your deployment topology look like? Invest in operational expertise around your chosen engine now, because switching costs are measured in weeks of engineering time, not hours.

WHAT I WOULD DO

Recommendations by Role

CTO: Default to vLLM for new deployments unless you have a specific, measured reason to choose otherwise. It has the largest community (14K+ GitHub stars, weekly releases), the broadest hardware support, and the most flexible serving model. Only move to TensorRT-LLM if your benchmarks — on your models, with your traffic patterns — show a >30% throughput advantage that translates to >$100K/year in savings. That is roughly the breakeven point for the additional operational complexity. Mandate that your infra team runs comparative benchmarks quarterly, because these engines are shipping performance improvements every 2-3 weeks.

Founder: If you are pre-scale (under $10K/month in inference spend), do not self-host at all. Use a managed inference provider (Together AI, Fireworks, Groq) that has already optimized their serving stack. Your engineering time is better spent on product. When you cross $10K/month, start with vLLM on a managed Kubernetes cluster and revisit the engine choice when you cross $50K/month. The difference between engines at small scale is lunch money; the difference between shipping features and tuning infrastructure is survival.

Infra Lead: Build a benchmarking harness this quarter that runs your top-3 models across all three engines using replayed production traffic. Automate it to run weekly against new engine releases. Publish the results internally as a living document so the decision is data-driven, not opinion-driven. If you are currently on TGI and spending more than $15K/month on compute, run a vLLM proof-of-concept — you will likely see 15-20% cost savings with 2-3 days of migration work. If you are on vLLM and running a single model above 200M tokens/day, prototype a TensorRT-LLM deployment for that specific model — the 35-50% throughput gain is worth the operational overhead at that scale.

SOURCES & NOTES
  1. “vLLM 0.8 Release Notes: Disaggregated Prefill, Speculative Decoding, and LoRA Hot-Swap,” vLLM Project Blog, blog.vllm.ai (January 2026)
  2. “TensorRT-LLM 0.17 Performance Guide,” NVIDIA Developer Documentation, developer.nvidia.com/tensorrt-llm (January 2026)
  3. “Text Generation Inference 3.0: The Rust Rewrite,” Hugging Face Blog, huggingface.co/blog/tgi-3 (January 2026)
  4. “Independent LLM Serving Benchmark — Q1 2026,” Anyscale Research, anyscale.com/research/llm-serving-benchmark-q1-2026
  5. “The GPU Cloud Price Index — February 2026,” Control Plane Research, controlplane.digiterialabs.com/reports
  6. “Optimizing Inference Cost: A Practitioner’s Guide to Serving Engine Selection,” MLOps Community Whitepaper, mlops.community/serving-engines-2026

NEXT UP

Stay in the loop

Infrastructure intelligence, delivered weekly. No hype, just actionable analysis.