Skip to main content
Living Glossary

The LLM Lexicon

A working vocabulary for talking about large language models the way practitioners actually do. Browse 25 terms across serving, training, prompting, and evaluation, or flip through them as flashcards. New entries land as the field shifts.

32 of 32 terms
Serving & Infra

Prefill vs decode

a.k.a. Prompt processing vs token generation · Prefill phase / decode phase

Two very different phases of LLM inference: prefill processes the prompt in parallel, decode generates one token at a time.

Prefill vs decode

Why it matters

Almost every serving cost and latency question comes down to which phase you're in. Prefill is compute-bound and fast per token because the GPU runs the whole prompt as one matmul. Decode is memory-bound and slow per token because each step depends on the previous one. Time-to-first-token (TTFT) is mostly prefill cost. Time-per-output-token (TPOT) is mostly decode cost. Talking about throughput without naming the phase is meaningless.

You'll hear it when

Someone says 'we're decode-bound', meaning the KV cache reads are the bottleneck and adding GPUs won't help linearly. Or 'prefill amortizes over the batch', meaning long shared system prompts are cheap to process if many users hit them at once.

Serving & Infra

KV cache

a.k.a. Attention cache · Key-value cache

The stored attention keys and values for every token already in the sequence. Read on every decode step, the silent driver of inference memory cost.

KV cache

Why it matters

After model weights, KV cache is the second-largest consumer of GPU memory and often the binding constraint on context length and batch size. A 70B model in FP8 can cost 80 GB of weights, then 0.5 to 2 MB per token of KV cache, which means a single 128K-context request eats 60 to 250 GB on its own. Every conversation about 'max context' on a given GPU is really a conversation about KV cache math.

You'll hear it when

Someone says they 'turned on FP8 KV cache' to fit a longer context, or that 'the KV cache is paged' so vLLM can fit more concurrent requests. Also: 'prefix caching hits the KV', meaning a shared system prompt was already in cache from a previous request.

Serving & Infra

Quantization

a.k.a. FP8 / FP4 / INT4 · AWQ / GPTQ / GGUF

Storing model weights and activations in fewer bits than the training precision so they fit on cheaper hardware.

Quantization

Why it matters

Every open-weight model release lives or dies by its quantized versions. BF16 is the training-time default at 2 bytes per parameter; FP8 cuts that to 1 byte; FP4 and INT4 cut it to half a byte. The catch is that not all quantizations are equal: a static, pre-quantized FP8 checkpoint behaves very differently from online FP8 done at load time, and the marketing 'FP8 footprint' number is often the floor, not the reality.

You'll hear it when

Someone says 'we run the AWQ INT4 build' or 'we online-quant to FP8 with vLLM'. The version they pick changes both the GPU footprint and the quality ceiling.

Serving & Infra

MoE vs dense

a.k.a. Mixture of experts · Sparse model · Active parameters

Two ways to scale a model: dense activates every parameter for every token, MoE activates a small fraction routed by a gating network.

MoE vs dense

Why it matters

MoE is how 2026 open weights crossed a trillion parameters. A 1T-parameter MoE model with 32B active parameters per token has the memory footprint of a 1T model but the compute footprint of a 32B model. The catch is that all the weights still have to live in VRAM, and routing imbalance can stall a GPU. When someone says 'it's a 230B model', the next question is always 'how many active?'

You'll hear it when

Someone calls a model '670B-A37B', meaning 670B total parameters with 37B active per token. Or says the model 'has 8 experts, top-2 routing', meaning the gating network picks 2 of 8 experts per token.

Serving & Infra

Context window

a.k.a. Max model length · Max sequence length

The maximum number of tokens, prompt plus output, the model can hold at once. Marketing number on the box; runtime budget once you serve it.

Context window

Why it matters

Published context windows (200K, 1M, 10M) are training-time maxima. The window you actually get at serve time is whatever fits in KV cache after weights are loaded. Coding agents and document chat eat context faster than people expect: a 30K-token system prompt plus a few file reads is a normal first turn for Cline. Serving with too small a window produces silent overflow errors that look like a frozen UI.

You'll hear it when

Someone says they 'set max-model-len to 131072' on a vLLM endpoint, or 'the context blew up past 200K and the agent stopped responding'. Also: 'we use sliding window attention for the last 4K tokens', meaning the model only attends to a window rather than the whole history.

Heard from

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."
— Andrej Karpathy , X (Twitter) (2025)
"A model's pre-trained weights are like a hazy recollection of something read a year ago, while information provided in the context window is like working memory — perfectly and immediately accessible."
— Andrej Karpathy , Dwarkesh Podcast (2025)
Serving & Infra deeper cut

Speculative decoding

a.k.a. Spec decoding · Draft model decoding · Medusa / EAGLE / Lookahead

Use a tiny fast model to guess several future tokens, then have the big model verify them all in one parallel step. Free speedup when guesses are right.

Speculative decoding

Why it matters

Decode is memory-bound and underuses the GPU's compute. Speculative decoding spends the slack on parallel verification of cheap guesses. When the small model's guess matches what the big model would produce, you get multiple tokens for the cost of one forward pass. Real-world speedups are 1.5x to 3x on greedy decoding, more on structured tasks. This is the single biggest reason modern serving stacks feel fast.

You'll hear it when

Someone says 'we run a 1B draft model in front of the 70B target' or 'our acceptance rate is 0.65', meaning 65% of guessed tokens survive verification. Also: 'Medusa heads beat external drafts on our traffic'.

Serving & Infra deeper cut

Tensor parallelism

a.k.a. TP · Model parallelism · TP / PP / EP

Splitting individual weight matrices across multiple GPUs so the same forward pass runs in parallel on each one. The only way to run frontier models that don't fit on one GPU.

Tensor parallelism

Why it matters

Above ~80B active parameters, no single GPU has enough VRAM for the weights plus KV cache. Tensor parallelism shards each matmul row-wise or column-wise across N GPUs, runs the math in parallel, and uses NVLink to combine the partial results. The catch is that every layer needs an all-reduce, so you need high-bandwidth interconnect; commodity PCIe will not keep up. This is why 'multi-GPU' really means 'multi-GPU on the same NVLink island'.

You'll hear it when

Someone says 'we run TP=8 on an H200 node', meaning the model is sharded across 8 GPUs over NVLink. Or 'EP for experts, TP for the rest', meaning experts go on different GPUs entirely while non-expert weights are tensor-parallel.

Serving & Infra

Test-time compute

a.k.a. Inference-time compute · Thinking compute · TTC scaling

Spending more compute at inference (reasoning tokens, search, self-consistency) to get better answers from a fixed model. The third scaling axis after data and parameters.

Test-time compute

Why it matters

Through 2024 the scaling laws were about pre-training: more data, more parameters, more flops at training time. OpenAI's o1 launch reframed the field by showing that the same base model could get dramatically better at hard problems if you let it think longer at inference. That single insight redirected roadmaps. Test-time compute is the axis you can scale per request, per user, per problem; you don't need a bigger training run to get smarter answers, you just need a bigger thinking budget.

You'll hear it when

Someone says 'we gave it a 30K thinking budget' or 'test-time compute closed most of the gap to GPT-5'. Also: 'high-effort mode' (a thin user-facing wrapper around 'use more test-time compute').

Heard from

"We're no longer bottlenecked by pretraining. We can now scale inference compute too."
— Noam Brown (OpenAI) , X (Twitter) — o1 launch thread (2024)
Training & Alignment

Pre-training

a.k.a. Base model training · Foundation training

The expensive first stage where a model learns to predict the next token on trillions of tokens of internet text and code.

Pre-training

Why it matters

Pre-training is where capability is set. A pre-trained base model can complete text but cannot follow instructions; that's what later stages do. The vast majority of an open-weight release's cost, often 95%+ of the total flops, is pre-training. Everything else, SFT, RLHF, DPO, steering, is shaping a model that already knows what it knows.

You'll hear it when

Someone says 'the base model has the capability but it won't surface it for free' or 'we annealed the last 10% of pre-training on math', meaning they shifted the data mix late in training to bias what the model is good at.

Heard from

"Pre-training as we know it will unquestionably end. We have but one internet. Data is not growing because we have but one internet. You could even say that data is the fossil fuel of AI."
— Ilya Sutskever , NeurIPS 2024 Test-of-Time keynote (2024)
"A one terabyte zip file — it's full of compressed knowledge from the internet."
— Andrej Karpathy , Deep Dive into LLMs like ChatGPT (YouTube) (2025)
"We're not building animals. We're building ghosts or spirits, because we're not doing training by evolution. We're doing training by imitation of humans and the data that they've put on the Internet."
— Andrej Karpathy , Dwarkesh Podcast (2025)
Training & Alignment

SFT

a.k.a. Supervised fine-tuning · Instruction tuning

Teach a base model to follow instructions by training it on curated prompt-and-response pairs.

SFT

Why it matters

SFT turns a next-token predictor into something resembling an assistant. The model is fine-tuned on tens of thousands to millions of (prompt, ideal response) pairs that demonstrate the behavior you want: answer this kind of question this way, format like this, refuse like this. Almost every chat-style model you've used had SFT as the second stage of its training. SFT is also the cheapest stage to do well: a few GPU-days on a 70B model can change behavior dramatically.

You'll hear it when

Someone says 'we SFT'd it on internal docs to make it answer in our voice' or 'the base was strong but the SFT data was thin'.

Heard from

"It's full of compressed knowledge from the internet, but it's the human touch in post-training that gives it a soul."
— Andrej Karpathy , Deep Dive into LLMs like ChatGPT (YouTube) (2025)
Training & Alignment

RLHF

a.k.a. Reinforcement learning from human feedback · Preference optimization (classic)

Train a reward model on human preference comparisons, then use RL to tune the LLM against that reward.

RLHF

Why it matters

RLHF is what made ChatGPT feel like ChatGPT. It's the stage that takes an SFT'd model from 'follows instructions' to 'follows instructions in the way people prefer'. The pipeline has three parts: collect pairwise preferences from humans, train a reward model, optimize the LLM with PPO against that reward. RLHF is also the most fragile part of the stack: reward hacking, mode collapse, and over-refusal all live here.

You'll hear it when

Someone says 'the RLHF made it sycophantic' or 'we replaced PPO with DPO because the reward model was overfitting'. Also: 'RLAIF' (the H is replaced by AI-generated preferences from a stronger model).

Heard from

"RLHF is just barely RL... RL is powerful. RLHF is not."
— Andrej Karpathy , X (Twitter) (2024)
Training & Alignment

DPO

a.k.a. Direct preference optimization · Reference-free preference optimization (variants)

Optimize directly against preference pairs without training a reward model or running PPO. Same outcome, far less infrastructure.

DPO

Why it matters

DPO replaced classic PPO-based RLHF in most open-weight releases because it does the same job with a fraction of the engineering. You take the SFT model, the same preference dataset you would have used for RLHF, and a closed-form loss that directly increases the model's probability of preferred completions relative to rejected ones. No reward model, no PPO loop, no rollout management. The catch is that DPO is sensitive to how the SFT model was initialized and to the temperature of the preference data.

You'll hear it when

Someone says 'we ran DPO over 80K preference pairs' or 'the DPO drove the refusal rate up too far'. Also: 'IPO/KTO/ORPO' are variants that change the loss shape but keep the same idea.

Training & Alignment

LoRA

a.k.a. Low-rank adaptation · QLoRA · PEFT

Fine-tune a frozen large model by training only two tiny low-rank matrices per weight matrix. Cheap, fast, swappable.

LoRA

Why it matters

LoRA made fine-tuning a 70B model possible on a single workstation GPU. Instead of updating the full weight matrix W, you freeze it and train a low-rank delta BA, where B is d×r and A is r×k with r in the range 4-64. The trainable parameters drop by 1000x or more, the memory footprint drops with them, and the adapter file is small enough to swap at runtime. QLoRA adds 4-bit quantization of the frozen base, dropping the memory floor even further.

You'll hear it when

Someone says 'we LoRA'd it over a few thousand examples' or 'we hot-swap adapters per customer'. Also: 'QLoRA on a single 4090' is a common boast for hobbyist fine-tuning of 70B-class models.

Training & Alignment deeper cut

Distillation

a.k.a. Teacher-student training · Knowledge distillation

Train a smaller 'student' model to imitate a larger 'teacher' model's outputs, getting most of the teacher's capability at a fraction of the size.

Distillation

Why it matters

Distillation is how an 8B-parameter open model can plausibly match a 70B model from two years ago. The student trains on outputs generated by the teacher (text, logits, or both) rather than on raw web text, so each training example carries higher signal density. The technique powers most of the modern small-model wave and most synthetic-data pipelines.

You'll hear it when

Someone says 'this is a distill of Opus' or 'we used GPT-4o as the teacher for our SFT mix'. Also: 'logit distillation vs text distillation', a distinction about whether the student trains on the teacher's full probability distribution or just its sampled outputs.

Heard from

"Imperfect replicas, a kind of statistical distillation of humanity's documents with some sprinkle on top."
— Andrej Karpathy , Dwarkesh Podcast (2025)
"Intelligence is compression."
— Ilya Sutskever , NeurIPS / Dwarkesh appearances 2023-2024 (2024)
Training & Alignment

The Bitter Lesson

a.k.a. Sutton's bitter lesson · Scale beats cleverness

Rich Sutton's claim that general methods which exploit computation outperform handcrafted methods that exploit human knowledge, over and over, by a large margin.

The Bitter Lesson

Why it matters

The Bitter Lesson is a 2019 essay that has become the most-cited justification in modern AI. Every time a team faces a choice between hand-engineering a clever algorithm and just throwing more compute at a general method, the bitter-lesson argument says compute wins eventually. Pre-training, RLHF replacement by simpler methods, the rise of MoE, the dominance of attention over older architectures: all are bitter-lesson trajectories in retrospect. The essay is short and worth reading directly.

You'll hear it when

Someone says 'that's a bitter-lesson decision' to defend choosing the more general, more compute-heavy approach over the bespoke one. Or 'we tried the clever thing and the bitter lesson got us', meaning the simpler scaled approach overtook the engineered one.

Heard from

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
— Rich Sutton , The Bitter Lesson (2019)
Training & Alignment

Scaling hypothesis

a.k.a. Scaling laws · Big Blob of Compute Hypothesis

The claim that a small number of inputs (compute, data, model size, training time) explain most LLM capability gains, and that scaling them up reliably gets you more capability.

Scaling hypothesis

Why it matters

The scaling hypothesis was the bet that drove the 2020-2023 LLM boom. It says: don't innovate on architecture, scale what works. The Kaplan and Chinchilla papers turned this into quantitative laws (loss as a power-law function of compute and data). The hypothesis has been amended several times (data quality matters, test-time compute is a new axis, pre-training data is finite), but the core remains the operating assumption of every frontier lab. When Amodei or Sutskever talks about 'scaling', this is the frame they're inside.

You'll hear it when

Someone says 'we're scaling-hypothesis pilled' or 'the scaling laws don't predict that capability'. Also: 'data-constrained scaling' (what happens when you hit the data ceiling).

Heard from

"All the cleverness, all the techniques... doesn't matter very much. There are only a few things that matter... it's a hypothesis I still hold."
— Dario Amodei , Dwarkesh Podcast (2023)
Training & Alignment

Shooting the vector

a.k.a. Activation steering · Representation engineering · Steering vectors

Pushing the model in a direction by adding a fixed vector to its hidden activations at inference time.

Shooting the vector

Why it matters

Steering is the cheapest way to change model behavior without retraining. You compute a direction in activation space that corresponds to a trait (helpful, refusing, sycophantic, hallucinating), then add or subtract it from the residual stream at runtime. Anthropic's interpretability team and Goodfire have shipped public demos. Practitioners talk about it because it sits between prompting (free, fragile) and fine-tuning (expensive, permanent).

You'll hear it when

Someone says they got a model to refuse less by 'shooting the refusal vector at -2', or that a hallucination probe lights up before generation so they can 'shoot the truth vector' to suppress it.

Prompting & Agents

System prompt

a.k.a. System message · Preamble · Context priming

The instructions you put before any user message to set the model's role, rules, tools, and constraints. The single most consequential piece of text in an LLM app.

System prompt

Why it matters

System prompts shape everything downstream. A coding agent's system prompt is often 20-30K tokens of role definition, tool schemas, formatting rules, and behavior policies. Get it wrong and every user turn inherits the mistake. The system prompt is also the most common point of failure for prompt injection and the easiest lever for changing model behavior without retraining.

You'll hear it when

Someone says 'we moved that rule into the system prompt' or 'the system prompt got too long and the model started ignoring the early instructions'. Also: 'we leaked our system prompt' (a security or competitive concern).

Heard from

"[We can give the AI] a set of principles (a 'constitution') against which it can evaluate its own outputs."
— Anthropic (Bai et al.) , Constitutional AI: Harmlessness from AI Feedback (2022)
Prompting & Agents

In-context learning

a.k.a. ICL · Few-shot prompting · Zero-shot prompting

Shaping model behavior at inference time by showing examples in the prompt, without any weight changes. The reason large models seemed to 'learn' from a few examples.

In-context learning

Why it matters

In-context learning is what made GPT-3 feel magical. You put a few input-output examples in the prompt, then a new input, and the model produces an output that follows the pattern. No fine-tuning, no training run, just text. The catch is that ICL is brittle: example order matters, label distribution matters, and at long contexts the effect fades. It also competes with explicit instruction-following in modern instruction-tuned models.

You'll hear it when

Someone says 'we got it to 80% with 5-shot' or 'zero-shot was fine on this task'. Also: 'the few-shots got contaminated', meaning the examples in the prompt accidentally biased the output.

Heard from

"A LLM is a repository of many (millions) of vector programs mined from human-generated data, learned implicitly as a by-product of language compression. A 'vector program' is just a very non-linear function that maps part of the latent space unto itself."
— François Chollet , X (Twitter) (2023)
Prompting & Agents

Chain of thought

a.k.a. CoT · Reasoning trace · Thinking tokens

Letting (or forcing) the model to write its reasoning out before its final answer, which improves accuracy on multi-step problems.

Chain of thought

Why it matters

CoT is the cheapest accuracy boost in prompting. Telling a model to 'think step by step' (or just letting it generate a working scratchpad) improves performance on math, code, and logic tasks by 10-50 points on hard benchmarks. The modern 'reasoning model' family (o1, R1, DeepSeek-R1) is CoT trained into the model itself, with the reasoning trace hidden from the user but still consuming output tokens. Every cost and latency conversation about reasoning models is really a CoT-token conversation.

You'll hear it when

Someone says 'we let it think for 30K tokens before the answer' or 'the reasoning trace blew the context'. Also: 'CoT distillation' (training a smaller model on the reasoning of a larger one).

Heard from

"o1 is trained with RL to 'think' before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We're no longer bottlenecked by pretraining."
— Noam Brown (OpenAI) , X (Twitter) — o1 launch thread (2024)
Prompting & Agents

Tool calling

a.k.a. Function calling · Tool use · Action selection

The model emits a structured request to invoke an external function, the harness runs it, the result goes back into context, and the loop continues.

Tool calling

Why it matters

Tool calling is the substrate of every agent product. The model is given JSON schemas for tools it can call (search the web, read a file, run a query, send a message), and it can choose to emit a tool call instead of a text answer. The harness executes the tool, returns the output, and the model continues. Without tool calling, an LLM is a text completer. With it, it's a system that takes actions.

You'll hear it when

Someone says 'we exposed three tools to it' or 'it tool-called in a loop until it got the right shape'. Also: 'parallel tool calls' (the model emits multiple tool calls in one turn, executed concurrently).

Heard from

"LLMs not as a chatbot, but the kernel process of a new Operating System... it orchestrates input and output across modalities (text, audio, vision), code interpreter, ability to write and run programs..."
— Andrej Karpathy , X (Twitter) — 'LLM OS' (2023)
Prompting & Agents deeper cut

MCP

a.k.a. Model Context Protocol · MCP server / MCP client

An open protocol from Anthropic for exposing tools, resources, and prompts to LLM clients in a standard, swappable way. The USB-C of agent tooling.

MCP

Why it matters

Before MCP, every agent client (Claude Code, Cline, Cursor, Continue) had its own tool format. Adding a new capability meant writing per-client glue. MCP standardizes the wire format so a single MCP server (a small process exposing tools over stdio or HTTP) plugs into any MCP-aware client without changes. It's the closest thing the agent ecosystem has to a portability standard. Adoption is uneven but growing.

You'll hear it when

Someone says 'we exposed our database via an MCP server' or 'the MCP integration broke after the client update'. Also: 'MCP transport' (stdio for local, SSE/HTTP for remote).

Heard from

"Donating MCP to the Linux Foundation as part of the AAIF ensures it stays open, neutral, and community-driven as it becomes critical infrastructure for AI."
— Mike Krieger (Anthropic CPO) , Anthropic/Linux Foundation joint announcement (2025)
Prompting & Agents

Prompt injection

a.k.a. Indirect prompt injection · IPI · Tool poisoning

Untrusted text in the model's context overrides the system prompt's instructions and gets the model to do something the operator didn't intend.

Prompt injection

Why it matters

Prompt injection is the SQL injection of the agent era. Any time a model reads text it didn't author (web pages, emails, file contents, tool outputs), an attacker who controls that text can attempt to redirect the model's behavior. Direct injection is the user typing 'ignore previous instructions'. Indirect injection is far more dangerous: an attacker leaves instructions inside a web page or a document, and the agent reads them while doing its job. Mitigations are partial and the threat surface grows with every new tool.

You'll hear it when

Someone says 'the agent got prompt-injected by the README' or 'we sandbox tool outputs because of IPI'. Also: 'spotlighting' or 'data tagging' (techniques to mark which context is user vs untrusted).

Heard from

"[Prompt injection happens] when you take trusted prompt and untrusted prompt and you concatenate 'em together — that's the root of the whole cause."
— Simon Willison , Generationship podcast, ep. 39 (the term's coiner) (2024)
Prompting & Agents

Context engineering

a.k.a. Context construction · Context assembly

Treating what goes into the context window each step as a first-class engineering problem. The discipline that replaced 'prompt engineering' once agents got real.

Context engineering

Why it matters

Prompt engineering described a 2023 world where a person typed one prompt and the model answered. Context engineering describes the 2025-2026 world where an agent loop assembles tens of thousands of tokens per turn from system rules, retrieved documents, conversation history, and tool outputs, and the assembly is the work. The model is mostly fixed; what you put in front of it is variable. Job titles changed to match.

You'll hear it when

Someone says 'we moved the rule out of the system prompt into retrieval-on-demand to save tokens' or 'the context engineering is the product'. Also: 'we leaked stale context into the next turn' (a context-engineering bug).

Heard from

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."
— Andrej Karpathy , X (Twitter) (2025)
Prompting & Agents

Vibe coding

a.k.a. Vibes-driven development · Agentic coding

Building software by describing intent to an LLM coding agent and accepting most of the output without reading the code closely. A practice, not a methodology.

Vibe coding

Why it matters

Vibe coding is the user-facing consequence of long context plus tool calling plus reasoning models. The practitioner sets a goal, the agent edits files, runs tests, and reports back. The human reviews at a higher level of abstraction (intent, behavior) rather than the line level (syntax, structure). The term carries both an aspiration and a warning: it's how you ship faster, and also how you ship code that no human on the team actually understands.

You'll hear it when

Someone says 'this whole feature was vibe-coded' (sometimes proud, sometimes apologetic). Or 'the vibe-coded version passed tests but the architecture is a mess'. Karpathy coined the phrase in early 2025; by mid-year it had become both a job and a pejorative.

Heard from

"There's a new kind of coding I call 'vibe coding,' where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
— Andrej Karpathy , X (Twitter) (2025)
Eval & Capability deeper cut

Perplexity

a.k.a. PPL · Bits per byte (BPB)

How surprised the model is by held-out text. The training-time scoreboard, exponential of the cross-entropy loss.

Perplexity

Why it matters

Perplexity is the original LLM metric. Lower is better. It measures how well the model models the joint distribution of text on a held-out set, and it's tightly coupled to the loss that pre-training optimizes. Practitioners use it for two things: tracking training progress and comparing models on the same evaluation corpus. It does not measure task usefulness, instruction-following, or capability gain past a certain point. Beating GPT-2 on perplexity does not mean beating it on real tasks.

You'll hear it when

Someone says 'perplexity flattened around 4.5 after the data refresh' or 'we measure BPB instead of PPL because tokenizer differences ruin PPL comparisons'.

Heard from

"The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin."
— Rich Sutton , The Bitter Lesson (2019)
Eval & Capability

Pass@k

a.k.a. pass@1 / pass@10 · SWE-bench / HumanEval scoring

Sample K candidate solutions from the model; the model 'passes' if any one of them is correct. Standard metric for code generation evals.

Pass@k

Why it matters

Most coding benchmarks (HumanEval, MBPP, SWE-bench Verified) score with pass@k rather than single-shot accuracy. The reason is that code is verifiable: you can run the test suite. Sampling K candidates lets you measure not just whether the model can produce a correct answer, but how often it does. pass@1 measures greedy-decode correctness; pass@10 measures whether the model 'knows' the answer even if its top sample misses. The gap between pass@1 and pass@10 tells you how much improvement is possible from better sampling alone.

You'll hear it when

Someone says 'it's 65 on pass@1 but 82 on pass@10' or 'best-of-N at N=8 nearly closes the gap to GPT-4'. Also: 'pass^k', the harder variant where ALL K samples must be correct.

Eval & Capability

LLM-as-judge

a.k.a. Model-graded eval · AutoEval · Judge model

Use a strong LLM to grade outputs from another model. The default for tasks where there's no automatic verifier.

LLM-as-judge

Why it matters

Most useful tasks have no test suite. You can't unit-test 'is this summary good' or 'is this answer helpful'. LLM-as-judge fills the gap: you give a strong model the question, the candidate answer, and a rubric, and it returns a score. It's cheap, fast, and scales to thousands of samples per minute. The catch is that the judge model has biases (it prefers verbose answers, it prefers answers structured like its own, it gives higher scores to answers from its own family), and those biases can swamp the signal.

You'll hear it when

Someone says 'we use Opus as the judge' or 'the LLM-judge gave +5% to the longer answers'. Also: 'pairwise judge' (compare two responses) vs 'pointwise judge' (score one response on a scale).

Heard from

"RLHF is just barely RL... RL is powerful. RLHF is not."
— Andrej Karpathy , X (Twitter) — applies equally to LLM-as-judge reward proxies (2024)
Eval & Capability

Hallucination

a.k.a. Confabulation · Fabrication · Ungrounded generation

The model produces fluent text that is false. Not a bug in the usual sense, the natural consequence of training a model to produce plausible continuations.

Hallucination

Why it matters

Hallucinations are the failure mode that ships closest to production. The model isn't lying; it's doing what it was trained to do, which is produce a plausible continuation given the context. When there's no grounding source, plausibility and truth diverge. Mitigations exist (retrieval, citation requirements, confidence thresholds, reasoning traces), but no current model has 'eliminated' hallucination in any meaningful sense. The practical conversation is about reducing rate and surfacing uncertainty, not preventing.

You'll hear it when

Someone says 'it hallucinated the API endpoint' or 'we added retrieval to cut the hallucination rate'. Also: 'confabulation' (preferred in some research circles because 'hallucination' anthropomorphizes the failure).

Heard from

"In some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts."
— Andrej Karpathy , X (Twitter) (2023)
Eval & Capability deeper cut

Saturation

a.k.a. Benchmark saturation · Ceiling effect

A benchmark stops discriminating between models because frontier systems all score near the top. When 'MMLU is saturated', the test has stopped being useful.

Saturation

Why it matters

Every benchmark has a useful life. When it was created, it discriminated meaningfully between models. Once the best models all score 90+ and a 5-point gap is within noise, the benchmark has saturated and the field needs harder tests. MMLU, HumanEval, GSM8K, and parts of SWE-bench Verified are all saturated or near it as of 2026. Understanding which benchmarks still discriminate and which are decorative is half of reading capability claims.

You'll hear it when

Someone says 'MMLU saturated two years ago' or 'we don't compare on HumanEval anymore'. Also: 'GPQA Diamond is the new floor' (a harder benchmark replacing a saturated one).

Heard from

"Pre-training as we know it will unquestionably end. We have but one internet."
— Ilya Sutskever , NeurIPS 2024 Test-of-Time keynote (2024)
Eval & Capability

Jagged intelligence

a.k.a. Jagged frontier · Uneven capability profile

The observation that LLMs are superhuman on some tasks while failing comically on neighboring trivial ones. Capability is not smooth across difficulty.

Jagged intelligence

Why it matters

Practitioners coined this to capture a behavior that contradicts the 'AGI is near' framing on one side and the 'LLMs are dumb' framing on the other. A model can solve a competition math problem and miscount letters in a word in the same conversation. Treating model capability as a single ordered axis ('smarter than') misses this; the right mental model is a high-dimensional ability surface with sharp peaks and dips. Naming the pattern lets you stop being surprised by it.

You'll hear it when

Someone says 'classic jagged intelligence' when a model nails a 200-line refactor then fails to count parentheses. Or 'we hit the jagged frontier' when a benchmark transfers poorly to a task that looks adjacent.

Heard from

"Jagged Intelligence — the strange, unintuitive fact that state of the art LLMs can both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems."
— Andrej Karpathy , X (Twitter) (2024)
Eval & Capability deeper cut

World model

a.k.a. World simulator · Predictive model of the environment

An internal representation of how the environment evolves over time. The thing LeCun argues LLMs don't have and need, and Hassabis argues video models are starting to build.

World model

Why it matters

The 'world model' debate is the central capability argument of 2024-2026. Yann LeCun has spent years claiming auto-regressive LLMs are structurally incapable of planning, embodiment, and causal reasoning because they lack a model of the world distinct from a model of text. Demis Hassabis points to Genie and Veo-style video models as nascent world models. The argument shapes every roadmap conversation about whether scaling LLMs is enough or whether something architecturally different is needed.

You'll hear it when

Someone says 'LLMs don't have a world model, they have a model of text' (the LeCun line) or 'video models are world models with extra steps' (the counter). Also: 'JEPA' (LeCun's preferred non-LLM architecture) shows up in this conversation.

Heard from

"Auto-Regressive LLMs are doomed... they cannot be made factual, non-toxic, etc. They are not controllable."
— Yann LeCun , NYU CDS seminar (2025)

How to use this: Browse mode is the glossary view; Flashcards mode shows one term at a time for active recall. Click any card to flip it. In flashcard mode, use / to move and Space to flip.

Progress is stored locally in your browser; it doesn't leave the page.