What Open-Weight LLMs Actually Fit on a Single H200

We spent a week trying to stand up open-weight 2026 coding models as backends for Cline and similar agent-style coding tools. Hardware: a single H200 with 141 GB of VRAM. Every model we evaluated had a clean published spec, a benchmark number, and a recommended quantization. Almost every model also had a deployment surprise that broke the spec sheet math.

This post is the gap between what the model card says and what the GPU actually allocates.

The Naive Sizing Math

Open-weight model cards report parameter count and recommend a quantization. The mental shortcut is straightforward: parameters times bytes-per-element equals VRAM footprint. Devstral 2 at 123B parameters in FP8 (half a byte per parameter) “should” need about 62 GB. Qwen3.6-27B in FP8 “should” need about 14 GB. A 1.6T MoE model in 4-bit “should” need around 800 GB.

That math is wrong on dense models served with vLLM’s online quantization. It is correct on models with a properly prepared static FP8 checkpoint. The difference between those two paths is the first place reality diverges from the spec sheet.

Online FP8 Quantization Doubles the Footprint

Devstral 2 ships only as BF16 (~250 GB on disk). To run it on a 141 GB GPU you have to quantize at load time with vLLM’s --quantization fp8 flag. We did exactly that. vLLM reported back:

INFO gpu_model_runner.py:4959 Model loading took 119.5 GiB memory and 21.5 seconds

119.5 GB, not 62 GB. The online quantizer keeps layer norms, embeddings, and certain attention layers at BF16 for numerical stability. Real footprint is roughly twice the naive estimate. That left 5 GB of KV cache headroom on a 141 GB GPU, which capped the practical context window at 31K tokens.

The fix is to use a pre-quantized FP8 checkpoint when one exists. Qwen ships Qwen/Qwen3-Coder-Next-FP8 as a true static FP8 release: 82 GB on disk, roughly 82 GB on the GPU, no surprise. If a model has no static FP8 release on Hugging Face, treat the BF16 download size as the GPU floor for online FP8 serving and budget hardware accordingly.

Coding Agents Eat 100K+ Context Routinely

Once a model is loaded, the next surprise is how much KV cache an interactive coding agent consumes. Cline (and similar tools like Aider and Continue.dev) inject a 25-30K token system prompt before any user instruction. Add a few file reads, a couple of tool results, and a single user turn, and the request prompt easily exceeds 65K tokens.

We initially set --max-model-len 65536 on the vLLM endpoint because that seemed generous. It was not. vLLM returned 400 Bad Request with a VLLMValidationError on roughly one in three multi-turn requests. Cline does not surface that error in its UI. It simply shows a spinner that never resolves. From the user’s perspective, the model “freezes.”

The fix is to raise --max-model-len to the model’s native maximum (or at least 131072) and budget KV cache memory accordingly. Use --kv-cache-dtype fp8 to halve KV memory consumption. On a single H200, a 27B model can support its full native 262K context with room to spare. A 123B model can support 32K-64K once you account for weights.

Hybrid Mamba Models Stall Mid-Response

Qwen3.6-27B is a hybrid architecture: most layers are standard attention, but several are selective state-space (Mamba) layers. The benchmarks are strong, the VRAM footprint is tiny, and the model is Apache 2.0. On paper it is the obvious pick for a single-GPU coding endpoint.

In practice, vLLM 0.21’s Triton kernel cache for Mamba operators is shape-specific. Each new combination of sequence length, batch size, and prefix cache state triggers a 5-30 second JIT compile during inference. The vLLM logs are explicit about it:

WARNING jit_monitor.py:103 Triton kernel JIT compilation during inference:
  _causal_conv1d_fwd_kernel. This causes a latency spike; consider extending
  warmup to cover this shape/config.

For a batch-inference workload this is fine, since each shape JITs once and amortizes across the batch. For an interactive coding agent where every conversation has a different prompt length, the JIT pauses stack with the model’s reasoning trace overhead to produce 10-80 second freezes between user turns. The cache populates over time and the experience improves, but the first day of real use is rough.

If you need predictable interactive latency on a single GPU today, a pure-attention model (Codestral 25.08, Devstral 2, Qwen3-Coder-Next) will feel dramatically smoother than a hybrid-Mamba model with equivalent benchmark scores. Mamba serving will improve, but it is not there yet.

What Actually Fits on a Single H200

Here is the matrix as of May 2026, sorted by SWE-bench Verified score where published:

Model	Architecture	Native quant	Real VRAM	Fits 141 GB?
GLM-5.1	744B MoE	FP8	~860 GB	No
DeepSeek V4-Pro	1.6T MoE	FP4	~800 GB	No
Kimi K2.5	1T MoE	INT4	~600 GB	No
MiniMax M2.7	230B MoE	FP8	~220 GB	No
DeepSeek V4-Flash	284B MoE	FP4	~149 GB	No, by 8 GB
Devstral 2	123B dense	online FP8	~119 GB	Tight, no headroom
Qwen3-Coder-Next	80B MoE	static FP8	~82 GB	Yes
Kimi-Dev-72B	72B dense	online FP8	~73 GB	Yes
Qwen3.6-27B	27B hybrid	static FP8	~27 GB	Yes
Codestral 25.08	22B dense	static FP8	~22 GB	Yes

The pattern is consistent. Any frontier-class model that is competitive with Anthropic Opus on hard tasks requires 8x H100 or 8x H200 hardware. Single-GPU territory tops out at roughly 80B active parameters with current vLLM and 2x H100 / 2x H200 hardware does not exist as a public cloud SKU at any provider we evaluated. The choice is one GPU or eight, with nothing in between.

Multi-GPU Is the Only Path to Frontier Models

DeepSeek V4-Pro and Kimi K2.5 are the open-weight models that genuinely compete with Opus 4.7 on SWE-bench Verified. Both require an 8x H200 node with NVLink for tensor parallelism. You cannot fake this with multiple single-GPU instances because the model weights have to be sharded across GPUs with low-latency interconnect.

What This Means for Open-Weight LLM Plans

Three takeaways for anyone planning open-weight LLM deployment in 2026:

Treat published “FP8 footprint” as a floor, not an estimate, for any model without a static FP8 release. Online quantization through vLLM will roughly double it. Always check for *-FP8 repos on Hugging Face before sizing hardware.
Budget KV cache for the client, not the model. Cline-class agents need at least 131K context to avoid silent overflow. That dictates more KV cache memory than the model card suggests.
Hybrid Mamba and reasoning-trace models are not interactive-ready on vLLM 0.21. The benchmarks are real, but per-shape JIT plus visible reasoning tokens make TTFB unpredictable. Pure-attention models are smoother for agent workflows even when their benchmark numbers are lower.

Single-GPU open-weight serving is genuinely useful for offline batch work, privacy-sensitive code that cannot leave your infrastructure, or cheap iteration on contained tasks. It is not a one-to-one replacement for Anthropic Opus, and the marketing benchmarks will not tell you that.

The comparison is this: a single H200 running Qwen3-Coder-Next or Devstral 2 gets you 70-80% of Opus quality on routine work, 40-60% on hard work, and zero of Opus on long-horizon multi-file refactors. If that is a fit, the economics can work. If you need Opus-class output on hard tasks, you need either Opus or an 8x H200 cluster running V4-Pro. The middle ground does not exist yet.

What Open-Weight LLMs Actually Fit on a Single H200

The Naive Sizing Math

Online FP8 Quantization Doubles the Footprint

Coding Agents Eat 100K+ Context Routinely

Hybrid Mamba Models Stall Mid-Response

What Actually Fits on a Single H200

Multi-GPU Is the Only Path to Frontier Models

What This Means for Open-Weight LLM Plans

More from the blog

A Field Guide to How Practitioners Talk About LLMs

Q-Day: The Plan Is Crypto Agility

47 Days: The End of Manual Certificate Management

Want to discuss this topic?