Prefill vs decode
a.k.a. Prompt processing vs token generation · Prefill phase / decode phase
Two very different phases of LLM inference: prefill processes the prompt in parallel, decode generates one token at a time.
Prefill vs decode
Why it matters
Almost every serving cost and latency question comes down to which phase you're in. Prefill is compute-bound and fast per token because the GPU runs the whole prompt as one matmul. Decode is memory-bound and slow per token because each step depends on the previous one. Time-to-first-token (TTFT) is mostly prefill cost. Time-per-output-token (TPOT) is mostly decode cost. Talking about throughput without naming the phase is meaningless.
You'll hear it when
Someone says 'we're decode-bound', meaning the KV cache reads are the bottleneck and adding GPUs won't help linearly. Or 'prefill amortizes over the batch', meaning long shared system prompts are cheap to process if many users hit them at once.