What Open-Weight LLMs Actually Fit on a Single H200
Marketing benchmarks tell you a 123B model fits in 62 GB at FP8. The real footprint after vLLM loads it is 120 GB. Here's what 2026 open-weight LLMs actually need when you serve them as coding agent backends.
We spent a week trying to stand up open-weight 2026 coding models as backends for Cline and similar agent-style coding tools. Hardware: a single H200 with 141 GB of VRAM. Every model we evaluated had a clean published spec, a benchmark number, and a recommended quantization. Almost every model also had a deployment surprise that broke the spec sheet math.
This post is the gap between what the model card says and what the GPU actually allocates.
The Naive Sizing Math
Open-weight model cards report parameter count and recommend a quantization. The mental shortcut is straightforward: parameters times bytes-per-element equals VRAM footprint. Devstral 2 at 123B parameters in FP8 (half a byte per parameter) “should” need about 62 GB. Qwen3.6-27B in FP8 “should” need about 14 GB. A 1.6T MoE model in 4-bit “should” need around 800 GB.
That math is wrong on dense models served with vLLM’s online quantization. It is correct on models with a properly prepared static FP8 checkpoint. The difference between those two paths is the first place reality diverges from the spec sheet.
Online FP8 Quantization Doubles the Footprint
Devstral 2 ships only as BF16 (~250 GB on disk). To run it on a 141 GB GPU you have to quantize at load time with vLLM’s --quantization fp8 flag. We did exactly that. vLLM reported back:
INFO gpu_model_runner.py:4959 Model loading took 119.5 GiB memory and 21.5 seconds
119.5 GB, not 62 GB. The online quantizer keeps layer norms, embeddings, and certain attention layers at BF16 for numerical stability. Real footprint is roughly twice the naive estimate. That left 5 GB of KV cache headroom on a 141 GB GPU, which capped the practical context window at 31K tokens.
The fix is to use a pre-quantized FP8 checkpoint when one exists. Qwen ships Qwen/Qwen3-Coder-Next-FP8 as a true static FP8 release: 82 GB on disk, roughly 82 GB on the GPU, no surprise. If a model has no static FP8 release on Hugging Face, treat the BF16 download size as the GPU floor for online FP8 serving and budget hardware accordingly.
Coding Agents Eat 100K+ Context Routinely
Once a model is loaded, the next surprise is how much KV cache an interactive coding agent consumes. Cline (and similar tools like Aider and Continue.dev) inject a 25-30K token system prompt before any user instruction. Add a few file reads, a couple of tool results, and a single user turn, and the request prompt easily exceeds 65K tokens.
We initially set --max-model-len 65536 on the vLLM endpoint because that seemed generous. It was not. vLLM returned 400 Bad Request with a VLLMValidationError on roughly one in three multi-turn requests. Cline does not surface that error in its UI. It simply shows a spinner that never resolves. From the user’s perspective, the model “freezes.”
The fix is to raise --max-model-len to the model’s native maximum (or at least 131072) and budget KV cache memory accordingly. Use --kv-cache-dtype fp8 to halve KV memory consumption. On a single H200, a 27B model can support its full native 262K context with room to spare. A 123B model can support 32K-64K once you account for weights.
Hybrid Mamba Models Stall Mid-Response
Qwen3.6-27B is a hybrid architecture: most layers are standard attention, but several are selective state-space (Mamba) layers. The benchmarks are strong, the VRAM footprint is tiny, and the model is Apache 2.0. On paper it is the obvious pick for a single-GPU coding endpoint.
In practice, vLLM 0.21’s Triton kernel cache for Mamba operators is shape-specific. Each new combination of sequence length, batch size, and prefix cache state triggers a 5-30 second JIT compile during inference. The vLLM logs are explicit about it:
WARNING jit_monitor.py:103 Triton kernel JIT compilation during inference:
_causal_conv1d_fwd_kernel. This causes a latency spike; consider extending
warmup to cover this shape/config.
For a batch-inference workload this is fine, since each shape JITs once and amortizes across the batch. For an interactive coding agent where every conversation has a different prompt length, the JIT pauses stack with the model’s reasoning trace overhead to produce 10-80 second freezes between user turns. The cache populates over time and the experience improves, but the first day of real use is rough.
If you need predictable interactive latency on a single GPU today, a pure-attention model (Codestral 25.08, Devstral 2, Qwen3-Coder-Next) will feel dramatically smoother than a hybrid-Mamba model with equivalent benchmark scores. Mamba serving will improve, but it is not there yet.
What Actually Fits on a Single H200
Here is the matrix as of May 2026, sorted by SWE-bench Verified score where published:
| Model | Architecture | Native quant | Real VRAM | Fits 141 GB? |
|---|---|---|---|---|
| GLM-5.1 | 744B MoE | FP8 | ~860 GB | No |
| DeepSeek V4-Pro | 1.6T MoE | FP4 | ~800 GB | No |
| Kimi K2.5 | 1T MoE | INT4 | ~600 GB | No |
| MiniMax M2.7 | 230B MoE | FP8 | ~220 GB | No |
| DeepSeek V4-Flash | 284B MoE | FP4 | ~149 GB | No, by 8 GB |
| Devstral 2 | 123B dense | online FP8 | ~119 GB | Tight, no headroom |
| Qwen3-Coder-Next | 80B MoE | static FP8 | ~82 GB | Yes |
| Kimi-Dev-72B | 72B dense | online FP8 | ~73 GB | Yes |
| Qwen3.6-27B | 27B hybrid | static FP8 | ~27 GB | Yes |
| Codestral 25.08 | 22B dense | static FP8 | ~22 GB | Yes |
The pattern is consistent. Any frontier-class model that is competitive with Anthropic Opus on hard tasks requires 8x H100 or 8x H200 hardware. Single-GPU territory tops out at roughly 80B active parameters with current vLLM and 2x H100 / 2x H200 hardware does not exist as a public cloud SKU at any provider we evaluated. The choice is one GPU or eight, with nothing in between.
Multi-GPU Is the Only Path to Frontier Models
DeepSeek V4-Pro and Kimi K2.5 are the open-weight models that genuinely compete with Opus 4.7 on SWE-bench Verified. Both require an 8x H200 node with NVLink for tensor parallelism. You cannot fake this with multiple single-GPU instances because the model weights have to be sharded across GPUs with low-latency interconnect.
What This Means for Open-Weight LLM Plans
Three takeaways for anyone planning open-weight LLM deployment in 2026:
- Treat published “FP8 footprint” as a floor, not an estimate, for any model without a static FP8 release. Online quantization through vLLM will roughly double it. Always check for
*-FP8repos on Hugging Face before sizing hardware. - Budget KV cache for the client, not the model. Cline-class agents need at least 131K context to avoid silent overflow. That dictates more KV cache memory than the model card suggests.
- Hybrid Mamba and reasoning-trace models are not interactive-ready on vLLM 0.21. The benchmarks are real, but per-shape JIT plus visible reasoning tokens make TTFB unpredictable. Pure-attention models are smoother for agent workflows even when their benchmark numbers are lower.
Single-GPU open-weight serving is genuinely useful for offline batch work, privacy-sensitive code that cannot leave your infrastructure, or cheap iteration on contained tasks. It is not a one-to-one replacement for Anthropic Opus, and the marketing benchmarks will not tell you that.
The comparison is this: a single H200 running Qwen3-Coder-Next or Devstral 2 gets you 70-80% of Opus quality on routine work, 40-60% on hard work, and zero of Opus on long-horizon multi-file refactors. If that is a fit, the economics can work. If you need Opus-class output on hard tasks, you need either Opus or an 8x H200 cluster running V4-Pro. The middle ground does not exist yet.
More from the blog
Want to discuss this topic?
Our team is available to talk about AI strategy, security, and digital transformation.