Parameter-Golf is OpenAI's Model Craft Challenge: train the best language model you can in 10 minutes on 8× H100s, with the final artifact constrained to a 16 MB compressed file. The score is bits-per-byte (BpB) on the FineWeb validation set — lower is better. The wall-clock cap is end-to-end: training, post-training quantization, and any test-time compute all live inside that same 10-minute budget.
What is bits-per-byte?
Bits-per-byte (BpB) is the model's average compression rate on the validation text — the number of bits it spends to encode each input byte. It is tokenizer-agnostic: if you switch to a vocabulary that produces fewer tokens per sentence, each token carries more information, and the per-byte count comes out the same. That is what makes it the right metric for a parameter-golf challenge, where the tokenizer itself is fair game.
bits_per_token = val_loss / ln(2)
tokens_per_byte = total_tokens / total_bytes
val_bpb = bits_per_token × tokens_per_byte
Cross-entropy loss comes out in nats, so dividing by ln(2) converts it to bits. Multiplying by the token-to-byte ratio normalizes the cost per byte rather than per token. A perfect compressor would score 0; producing uniform random bytes would score 8. On this challenge, frontier submissions cluster a bit above 1.0 BpB.
A footnote on what can go wrong here. The metric is only as honest as the byte counter that feeds it. Issue #897 describes a subtle case: when a custom SentencePiece tokenizer does not contain the word-boundary token ▁ (U+2581) as its own vocab entry, every boundary falls back to three byte-tokens. The stripping logic counts each byte-fallback token as 1 byte, so a 1-byte ASCII space ends up counted as 3 bytes. The denominator inflates, tokens_per_byte drops, and reported BpB silently underestimates by roughly 20%. The standard tokenizer ships with ▁ as token 939, so the bug never fires on stock submissions — but any custom-tokenizer change to the parameter-golf stack has to be audited against this counting path before its BpB number can be trusted.
Current best submission
PR #1987 — MHA Path + 9-hparam Stack on PR #1948 + #18551.06184
val_bpb · 3-seed mean (σ ≈ 0.000379)
| Seed | val_bpb | artifact bytes |
|---|---|---|
| 42 | 1.06146 | 15,843,016 |
| 999 | 1.06183 | 15,834,049 |
| 1334 | 1.06222 | 15,844,523 |
Δ vs the prior leaderboard record (PR #1493): -0.01916 bpb. Built on top of PR #1948 and PR #1855. Headline change: switch from GQA (kv=4) to MHA (kv=8) with MLP_MULT 4.0 → 3.5 to stay cap-legal, plus the 9-hparam tuning stack ported from PR #1855 and a switch from brotli to lrzip pergroup compression (≈ −270 KB). The two free wins from PR #1948 (Leaky ReLU² slope = 0.3, reverse-Cholesky GPTQ Hinv) carry forward unchanged.
We started looking seriously at the challenge around April 20, 2026. At that point the merged leaderboard had mostly stopped moving, and it did not change much after that either.
The first thing we did was analyze the baseline implementation and skim the PRs that existed at the time. There was already a lot of good work in flight. It was exciting, but also honestly a bit overwhelming.
What helped was reducing the task to a sequence of smaller parts:
data -> tokenization -> model architecture -> optimizer -> quantization -> test-time compute
The Ablation Stages
Eight focused sweeps anchored most of what follows. Each is labeled Stage N throughout the article and links back to its raw TSV. The storyline below is the bird's-eye view; the structural breakdown and per-cell contribution heatmap sit behind dropdowns.
Show stage breakdown by category (DAG)
Show per-stage / per-category contribution heatmap
Data
Hypotheses
- H1 — Vocab size has a low ceiling. Smaller vocabularies are better, but not too small; around 8192 BPE is the right anchor for this challenge.
- H2 — Tokenization should preserve modeling difficulty. If the per-token problem becomes too easy, this model family plateaus before the wall-clock budget runs out.
- H3 — Full-word tokens are wasteful. Pruning long words that subword tokens can already reconstruct should free vocab budget without hurting BPB.
- H4 — Capitalization is a redundant axis. Folding capitalized variants back into a lowercase token plus a <cap> marker should shrink the vocab usefully.
- H5 — FineWeb noise dilutes the signal. Stripping long URLs and other rare, awkward patterns should improve per-token learning.
Findings
H1 — confirmed. Across many of the later successful PRs, 8192 converged as the practical sweet spot for SentencePiece BPE. We treated it as the anchor for everything that followed and only varied vocab size as part of structured reductions (see H4).
H2 — confirmed indirectly. Training on only 1024 tokens kept showing a low ceiling. The interpretation: if tokenization over-simplifies the sequence, this model family peaks too early under the wall-clock cap.
H3 — refuted. Pruning subword-coverable long words came out roughly +0.001 bpb, directionally worse than the baseline. Whatever vocab budget the prune freed up did not compensate for the harder modeling problem and longer sequences it created.
H4 — confirmed by Stage 5. The focused tokenizer sweep put numbers on the <cap> idea, sweeping a single knob p across cells.
How does the <cap> marker work?
A vanilla BPE tokenizer treats the, The, and THE as three different surface forms — each gets its own subword pieces and its own embedding rows. That redundancy is wasted vocab budget, because the underlying lexeme is identical and the casing is a thin, almost-deterministic signal.
The lossless caps transform from PR #1729 (credit: romeerp) moves the casing information out of the alphabet and into a tiny set of control characters in the Unicode private-use area (U+E000–E004). The lowercase form of every word becomes the canonical surface form; SentencePiece then learns subwords on it, plus a few sentinel tokens. The transform is exactly invertible, so val_byte_count is unchanged and BPB scoring is unaffected.
The implementation evolved through seven versions, and the design lesson maps onto the same U-shape Stage 5 measured: aggressive transformation pays vocab budget but inflates sequence; the win lives in a narrow band of "rare patterns where the marker amortizes over enough characters."
- v1 — per-character sentinel: every A becomes <sent>a. Maximum vocab compression, but ALLCAPS spans inflate sequence by one character per letter.
- v2–v3 — word-level markers for TitleCase, ALLCAPS, and mixed-case. Markers should match the cap pattern, not individual letters.
- v4–v5 — drop the rare cases: ALLCAPS only, optionally plus long TitleCase. Mixed-case is rare enough to skip without losing much.
- v6–v7 — only ALLCAPS with length ≥ 3 or ≥ 4. Single- and two-letter ALLCAPS (I, OK) are better as native tokens than as
marker + lowercase.
Stage 5's headline win — -0.00093 bpb at p=0.2 — sits at the bottom of that same curve, with parameter savings from a smaller embedding bank cleanly outpacing the per-marker sequence inflation.
p's definition
p is the fraction of capitalization-eligible vocab tokens replaced by <cap> + lowercase, with the selection inverse-frequency-weighted so low-frequency cap variants are removed first. p = 0 is the vanilla 8192-BPE baseline (no <cap>); p = 1 strips every eligible cap token. Higher p trades vocab budget for sequence-length inflation: fewer embedding rows to learn, but more tokens per document.
Against the no-<cap> anchor, the sweet spot landed at p = 0.2 (220 tokens dropped, 1.027× sequence-length inflation): -0.00093 bpb, roughly twice the single-seed noise floor. Parameter savings from the smaller embedding bank offset the inflation cost cleanly. A more aggressive p = 0.5 traded a little BPB for far more bytes (-0.00050 bpb at -145 KB artifact) — the cleaner pick whenever the size budget is the binding constraint.
Two side findings under H4 turned out more interesting than the headline. At p=1.0 (remove every eligible cap token, 1.135× inflation), BPB was still slightly better than the anchor (-0.00020) — the full-removal cost was essentially zero, with parameter savings exactly offsetting the inflation. And uniform random removal underperformed inverse-frequency by only +0.00014 at the same vocab size, which suggests the dominant variable is the inflation ratio itself, not which specific tokens get removed. The corollary: a frequency-gated variant (keep tokens above a 1% frequency floor) was actively worse (+0.00058), because it kept the inflation cost without buying back the parameter savings.
H5 — partial. We wrote a corpus-level normalization pipeline that decodes each FineWeb document back to text, applies a small set of cleanups — long URLs, email addresses, repeated or unusual punctuation, and a handful of other rare-pattern rules — and re-encodes the result through the same SentencePiece tokenizer before writing back to the FineWeb shard binaries. Because normalization runs once as a corpus pre-pass, it has no training-time or eval-time cost; it only changes what the model sees. We kept the cleaned dataset in the final stack but never isolated its contribution in a clean ablation, so we treat it as a small confirmed-positive of unknown magnitude rather than a measured win.
PR Lineage
Show PR lineage DAG
Model Architecture
The starting point was the 9-layer baseline. Depth recurrence was already there, and that community trick had already proved itself, so the question was what else to add or remove under a brutal training budget.
Hypotheses
- A1 — MHA is worth the parameter cost. Replacing GQA (kv=4) with full MHA (kv=8) should buy enough quality to justify the extra KV-bank parameters.
- A2 — A custom local-attention head can recover local structure cheaply. A per-head split with a sliding window should claw back BPB without breaking the budget.
- A3 — Architectural ideas from frontier work transfer. DeepSeek-style engrams — which Kevin Clark brought to this challenge as value embeddings (PR #1218) — and embedding factorizations should help here too.
- A4 — Attention-side gates carry real signal. SparseAttnGate (nprime06, PR #1787) and SmearGate (aquariouseworkman, PR #65), each layered on top of vanilla attention, should contribute measurable BPB.
- A5 — The SmearGate BOS-mask matters under packed-document training. Without the cross-document leak fix from PR #1855 (codemath3000), the smear residual silently reaches across document boundaries.
- A6 — Raw capacity beats clever components. Wider MLPs and more layers should win out over fancier architecture tweaks when training time is the binding constraint.
- A7 — Smarter RoPE variants help. NTK-aware base scaling, YaRN long-context extrapolation, and alternate partial-rope schemes should improve on the inherited rope_dims=16 baseline.
- A8 — Depth recurrence helps under a fixed parameter budget. Looping a small subset of layers extra times (sharing parameters but adding compute depth) — the trick from PR #1204 (msisovic, "Mini Depth Recurrence") popularized by PR #1394 (clarkkev) — should improve BpB without bloating the artifact.
Findings
A1 — toss-up on a fixed byte budget (Stage 4). Replacing GQA (kv=4) with MHA (kv=8) added KV-cache parameters that immediately overflowed the 16 MB cap, so MHA has to claw back the parameters from somewhere. We traded either MLP width (MLP_MULT 4.0 → 3.5) or one layer of depth (L=11 → L=10): MHA_M35 came in at 1.06513 (+0.00374 vs the GQA baseline of 1.06139) and MHA_L10 at 1.06352 (+0.00213). Stripped naively, both routes lose. But once the Stage 1/2 free wins are layered on top (see the follow-up paragraph), the gap shrinks to within seed noise — so we treat A1 as near-parity rather than a clean refutation. The unconstrained MHA win we had hoped for is being bought largely by extra MLP capacity, not by MHA itself; whether the architectural property of full multi-head attention is worth taking at near-zero net BPB is a judgment call.
Two follow-up observations from the same sweep tightened the picture. Shifting PARALLEL_START_LAYER from 8 to 7, a no-cost architectural knob, bought another -0.00047. And the Stage 1/2 free wins (LQER top_k=1, GateQuant on, TTT batch_size=16) stacked cleanly on top of MHA_M35 for -0.00277 bpb, bringing the best cap-legal MHA point within +0.00097 of the GQA baseline. That trio appears to be near-additively free across architecture choices.
Addendum (PR #1987): the toss-up tipped to MHA. Our final submission PR #1987 commits to MHA (kv=8) with MLP_MULT = 3.5 — i.e., the MHA_M35 shape from Stage 4 — once it's paired with the 9-hparam tuning stack from PR #1855 and the lrzip pergroup compressor (≈ −270 KB artifact savings). 3-seed mean lands at val_bpb 1.06184, beating the previous PR #1948 (GQA-baseline) by −0.00058. The Stage 4 numbers above remain a faithful read of MHA at that point in the search tree; they just understated how much surrounding-stack tuning was still on the table.
Addendum 2 (Stage 8 / PR1855NEW): a late scout flipped the sign again. We ported MHA to a newer pr1855 fork (PR1855NEW) and found that MHA hurts there by +0.00104 at single seed vs the same fork's GQA baseline (1.06217 MHA vs 1.06113 GQA). The newer fork has apparently absorbed via XSA, looping, and parallel-residual variants whatever MHA was capturing on the older base, so MHA on top is purely a byte cost without BPB benefit. There is also a methodology lesson worth flagging: a Stage 8 codebase ablation showed PR #1855's reported gap over v2b (−0.00134) came entirely from its 9-hparam tuned defaults, not from its structural codebase changes — running PR #1855 with v2b's config gave 1.06246 ± 0.00039, statistically indistinguishable from v2b's 1.06242 ± 0.00013. The reading we now favor across A1: MHA was less a structural win than a stepping stone — the route by which we surfaced the 9-hparam tuning stack — and the 9-hparam stack itself is what carries forward.
A2 — refuted by systems, not by theory. The implementation was inherently inefficient and could not take advantage of the optimized FlashAttention-style ecosystem, so whatever benefit it might have had on paper got crushed by practical cost.
What is local (sliding-window) attention?
Vanilla self-attention lets every query token attend to every previous key — an O(N²) cost in sequence length, which becomes the dominant bottleneck once context windows grow into the tens of thousands of tokens. Local attention, often called sliding-window attention, restricts each query to a fixed-size window W of recent keys (typically 128 to 4096). Compute drops to O(N · W) and the KV cache shrinks to W per layer. The price is that information from outside the window can only reach a token by being relayed up through the residual stream layer-by-layer.
Frontier LLMs use this trick widely. Mistral 7B shipped sliding-window attention (W = 4096) in production, and Mixtral, Gemma 2 and 3, and GPT-OSS all interleave local and global layers — most blocks attend within a window, a smaller fraction keep full causal attention so the network still has a path for cross-sequence retrieval. Many variants also keep a handful of always-attended "sink" tokens at the start of the sequence to stabilize the local-only blocks.
There are two common shapes: per-layer alternation (every other block is local) and per-head split (some heads inside a layer are local, the rest global). The per-head version is conceptually clean — different heads can specialize on different ranges — but it asks more from the underlying attention kernel, which is where our attempt ran aground.
Our implementation was the per-head version with W = 128. The math was straightforward, but the systems path was not: as soon as any head was local, the layer fell off the FlashAttention-3 fastpath, because FA-3 in this stack does not accept arbitrary attention masks. Both global and local heads then had to route through F.scaled_dot_product_attention with a dense (seqlen × seqlen) mask, the GQA K/V had to be expanded back up to num_heads, and the structured-causal kernel was lost in the process. Inside a 600-second wall-clock cap, the throughput cost was larger than any BPB win the trick might have delivered. Full implementation: train_gpt_local_attention.py:467-577.
A3 — refuted in this regime. We looked at several ideas from frontier work — most prominently DeepSeek-style engrams, which are structurally the same family as value embeddings: a separate small per-token embedding table, projected into the attention's KV dimension and added directly to V at the last few layers. We also explored embedding factorizations along the same theme. None of them paid off here. The cleanest representative we have a clean number for is Kevin Clark's 4096-vocab value-embedding stack, whose 3-seed mean lands at val_bpb 1.09785 — about +0.035 above the SmearGate-era frontier. With only about ten minutes of training time, this family of tricks simply does not become useful fast enough.
What is a value embedding (engram)?
A standard transformer feeds tokens through one input embedding at the bottom and reads logits off the top; everything in between rides the residual stream. A value embedding — also called an engram in the DeepSeek V3 lineage — adds a second, smaller embedding table that runs in parallel. At every position, the token id is looked up in this auxiliary table, projected into the attention's KV dimension, and added directly to V at one or more selected layers — usually the last few.
The intuition is that after many transformer layers, the residual stream has been heavily transformed; the model has lost some of its anchor on the actual token identity at each position. Re-injecting a fresh, token-conditioned signal straight into the attention values gives the model a way to pull token-specific information back into the computation without forcing it through the full residual stream first. In a deep, well-trained network this is a real win.
Concretely in Kevin Clark's stack: a (vocab × 128) auxiliary table, projected through a CastedLinear to (num_kv_heads × head_dim), with a per-layer learnable scale, applied at layers 9 and 10 of an 11-layer model. The projection is zero-initialized so the value embedding starts as a no-op and the model gradually learns to use it. Useful framing; not enough room under a 10-minute cap for it to pay off.
What is embedding factorization, and why doesn't it work here?
A standard input embedding is a single (vocab × dim) matrix. Embedding factorization replaces it with two smaller matrices, (vocab × r) and (r × dim), for some rank r < dim. Parameters drop from vocab · dim to r · (vocab + dim) — a real saving when r ≪ dim. The technique works in some setups: ALBERT and a few scaled-up LMs use exactly this kind of vocab-side bottleneck for parameter efficiency.
In parameter-golf the trade does not go through. Two reasons matter:
- The artifact is already int6-quantized. After GPTQ + brotli, the input embedding table compresses much better than its raw parameter count would suggest. Low-rank factorization saves fewer artifact bytes than the parameter math implies, and you spend BpB to get those bytes back.
- The objective is BpB, not raw token-level CE. Factorization introduces an extra matmul through a low-rank bottleneck that is harder to fit than a full-rank table. In a converged regime the bottleneck collapses gracefully — the model learns to use the r-dim subspace efficiently and the saved parameters buy themselves back. Inside a 10-minute training budget that ideal does not happen: the model under-converges relative to its full-rank counterpart, the rare-token tail (which dominates held-out BpB) is the first to suffer because rare tokens cannot fit comfortably inside the dominant r directions, and the training CE you see at the end of the run underestimates the held-out BpB cost.
Embedding tying gives a similar parameter saving via a different mechanism (sharing input and output, no rank bottleneck) and lands cleanly because it does not slow down convergence. Factorization is the trick that looks equivalent on paper and quietly costs more BpB in practice.
A4 — confirmed by Stage 1. The two attention-side gates carried most of the architectural budget on this stack. Removing SparseAttnGate cost +0.00356 bpb, the single largest swing in the sweep. SmearGate was worth another +0.00235 when ablated. Two smaller signals also showed up: replacing the sparse gate with a dense GatedAttn variant was strictly worse (+0.00180), and re-enabling the GateQuant flag that #1851 had deliberately turned off was essentially free (-0.00011) — a small neutral lever worth keeping in mind.
A5 — confirmed (Stage 1 B2 cell). The BOS-mask fix that distinguishes the #1851 SmearGate from the original #1797 version was worth +0.00056 on its own, almost exactly the size we had guessed before running the sweep.
What is SmearGate, and why does the BOS-fix matter?
Two things to set up before the fix. First, training rows in this stack are packed sequences — many short FineWeb documents concatenated end-to-end into a single 2048-token row, separated by a BOS (begin-of-document) token. Attention then runs through FlashAttention's varlen API with a cu_seqlens array that lists where each document starts and ends, so attention never crosses a document boundary even when many sit in the same row.
Second, SmearGate is a small, learnable additive mix of each token's embedding with a fraction of the previous token's embedding — a cheap "smear-from-the-left" residual added on top of the embedding before the first attention layer. It costs almost nothing in parameters and helps the model accumulate a tiny bit of left-context for free.
The interaction creates a silent bug. SmearGate's mix is implemented as x[:, 1:] + g · x[:, :-1], so at the first token of a new packed document — a BOS — x[:, :-1] is the last token of the previous document. Attention is masked correctly by cu_seqlens, but the smear path is a residual addition: it leaks the previous document's tail into the new document's head, contaminating the BOS embedding the rest of the network sees. The leak does not crash; it just quietly costs BPB.
The fix from PR #1855 (credit: codemath3000) is a one-line mask: zero the smear contribution wherever the current token is BOS, applied symmetrically in _forward_hidden and forward_ttt so train, eval, and TTT see identical document boundaries.
not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1)
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1)
A6 — confirmed. Vanilla attention plus MLP was extremely hard to beat under this budget. If anything reliably helped, it was making the MLPs wider; most other architectural tweaks did not show enough return on investment. Stage 4 reinforced the lesson from the other direction: pushing MLP_MULT from 3.5 down to 3.0 cost +0.00463, confirming that MLP capacity is the real budget bottleneck. Layer depth gave more BPB headroom than MLP width when shrinking to fit. The cleanest parameter-saving lever was unrelated to attention — embedding tying — and it was already on by default in the inherited baseline.
What is embedding tying?
A vanilla transformer carries two large embedding tables. The input embedding is a vocab × dim matrix that maps each token id to a vector. The output projection is a dim × vocab matrix that maps the final hidden state back to per-token logits. They are usually independent and account for a large slice of the model's parameter count.
Embedding tying — introduced concurrently by Press & Wolf (2017) and Inan et al. (2017) — uses the same matrix for both, with the output projection being the transpose of the input embedding. The conceptual justification is clean: if a token id maps to a vector via row i of the embedding, then the score for predicting that token at the output should be the inner product of the hidden state with the same row. Empirically, tied models match or slightly beat untied ones at the same parameter count, so it has become a default in most production transformers.
For parameter-golf the saving is hard to ignore. With vocab = 8192 and dim = 512, one embedding table is ~4.2 M parameters — at fp32 that is ~16 MB, the entire artifact budget on its own. Even after int6 quantization, two untied tables would cost ~6 MB of the cap; tying them halves that. It was on by default (TIE_EMBEDDINGS=1) in the inherited baseline and remains the single largest parameter-budget concession to the 16 MB cap.
A7 — refuted in this regime. Both YaRN and NTK-aware base rescaling are designed for the case where inference sequence length exceeds training sequence length; under parameter-golf both phases run at seq_len = 2048, so the extrapolation regime never engages. We also tested several alternate partial-rope-dim values, and none beat the inherited rope_dims = 16 from the modded-nanogpt baseline. The final submission keeps standard partial RoPE: rope_dims = 16, rope_base = 1e4, no YaRN, no NTK scaling. The ROPE_YARN flag is wired into the hyperparameters but stays off in the submitted config.
How does RoPE work, and why does rope_dims matter?
RoPE encodes position by rotating consecutive pairs of Q and K feature dimensions by an angle that grows linearly with sequence position. For position p and feature pair (2i, 2i+1), the rotation angle is θᵢ = p · inv_freq[i] where
inv_freq[i] = 1 / base ^ (2i / rope_dims)
The base parameter sets the wavelength spectrum: larger base means slower-rotating low-frequency dims (good for long contexts), smaller base means faster rotations across a denser spectrum at the cost of long-range disambiguation. Standard transformers use base = 10000.
When rope_dims < head_dim, only the first rope_dims feature pairs are rotated; the rest pass through unchanged. This is "partial RoPE": positional information lives in a small subspace, and the remaining head dims are free to carry content-only features. Attention can then match simultaneously by content (across all dims) and by relative position (within the rotated subspace).
We ablated rope_dims = 16 (the modded-nanogpt baseline) against rope_dims = 64 (full RoPE on a head_dim = 64 model). The partial variant won, for two reasons:
- Position is a thin signal at seq_len = 2048. Sixteen sinusoids span enough of the wavelength spectrum to disambiguate every position the model sees during training; the remaining 48 dims are net-positive when used for content instead of rotation.
- Full RoPE rotates every Q/K feature, so every dim carries positional drift. Inner products between Q and K at the same position versus a nearby position differ slightly even when content is identical, which injects noise into the attention pattern. Partial RoPE keeps the un-rotated content features pure and lets attention separate the two signals.
The final submission lands at rope_dims = 16, base = 10000, no YaRN, no NTK scaling.
A8 — confirmed (inherited from upstream). Depth recurrence originated as "Mini Depth Recurrence" in PR #1204 (msisovic) and was popularized via PR #1394 (clarkkev) with a simpler "loop layers 4–5 twice" implementation, then carried forward through dexhunter's SP8192 + QK5 baseline into our stack. In the final submission we tightened the schedule slightly: NUM_LOOPS = 2, LOOP_START = 3, LOOP_END = 5, ENABLE_LOOPING_AT = 0.35 — looping layers 3–5 (three layers, each visited three times, all parameters shared) and switching the schedule on at 35% of training instead of the original 50%. Total per-step layer applications go from 11 (vanilla) to 17 at the same parameter count.
What is depth recurrence?
A vanilla transformer applies each layer's parameters exactly once per forward pass: layer 0, then 1, then 2, …, then N−1. Depth recurrence reuses a subset of layers more than once during the same forward pass — the parameters stay tied across the repeated visits, but the activations evolve, so the model gets more "compute depth" without paying any extra parameter cost.
In our final stack the looping segment is layers 3–5, repeated three times (visited at original position and twice more), gated to activate at 35% of training. The forward pass visits 17 layers total across 11 unique parameter sets:
[0, 1, 2, 3, 4, 5, 3, 4, 5, 3, 4, 5, 6, 7, 8, 9, 10]
Why it works under parameter-golf's constraints: the artifact byte budget only counts unique parameters, so depth recurrence trades an extra training- and eval-time compute cost (more FLOPs per forward pass) for stronger representations at zero byte cost. It is the cleanest "more compute, same artifact" lever on this stack.
The late activation matters. Switching the recurrence schedule on at 35% of training lets the model first build basic features through the un-looped configuration, then re-applies the middle layers to refine those features. Activating earlier hurts convergence — the early transformations need to settle before being re-applied to themselves. PR #1394 used 50%; we found 35% slightly better on this stack but the mechanism is unchanged.
PR Lineage
Show PR lineage DAG
Optimizer
Hypotheses
- O1 — AdamW should not be that bad at small scale. On a relatively small dataset, standard AdamW with tuned betas usually holds its own against fancier optimizers; we expected it to come within striking distance of Muon.
- O2 — Muon's communication overhead is the dominant remaining slack. Reducing the post-backward collective traffic (a reordering we called the "0427 trick", inspired by the observation that DDP already syncs gradients) should give a meaningful throughput win — either by removing all-reduce wire time, or by exposing whatever else is hiding under the optimizer step.
- O3 — torch.compile knobs unlock training-time throughput. Disabling compile or switching to dynamic shapes might leave more wall-clock for actual training inside the 600 s cap.
- O4 — Larger effective batch with LR scaling improves BPB. Doubling effective batch with sqrt(k) or linear LR scaling should beat the 1.0x baseline.
Findings
O1 — refuted. In our experiments, AdamW consistently lagged Muon. The "small-scale parity" intuition turned out to be wrong, and we did not look back. From there we spent time looking inside Muon itself to see where the remaining slack might be.
Why does Muon outperform AdamW even at small scale?
The textbook intuition is that adaptive optimizers like AdamW should be at least competitive at small scale. AdamW assigns each scalar parameter its own per-element learning rate based on first- and second-moment estimates, which sounds like a strong inductive bias — every weight gets the rate it deserves.
The framing we found most useful comes from Moonshot AI's Kimi K1 analysis: AdamW does sparse, independent updates, while Muon does dense, coherent ones. AdamW treats each scalar weight as if it were unrelated to its neighbors — the update is effectively diagonal. Two weights sitting in the same row of an attention projection are coupled through the matrix function they implement, but AdamW does not see that coupling.
Muon, by contrast, treats each matrix parameter as a whole. Each gradient matrix is orthogonalized via a Newton-Schulz iteration before being applied, which bounds the spectral norm of the update and exploits the low effective rank of typical gradients. Every step moves the weight matrix in a direction that is spectrally well-conditioned for the function the matrix actually computes.
The token-efficiency argument follows: Muon makes more BPB progress per token because every step is matrix-coherent. AdamW makes progress too, but it spends some of its update budget on per-element noise that does not move the function in a useful direction. Inside a 600-second wall-clock cap, that translates directly into a BPB gap — even on a dataset whose scale would normally favor adaptive methods. The two optimizers operate on different mathematical objects (element-wise vectors vs. spectral matrices) even though they both look like "one optimizer step" from the outside.
O2 — mostly refuted by Stage 6. The hypothesis named communication overhead — specifically wire time spent in collectives — as Muon's dominant slack. We tested it with the 0427 trick: a Muon variant that exploits the fact that DDP's backward already synchronizes gradients across ranks, so a deterministic Newton-Schulz iteration can run independently on every rank, skipping the extra all_gather and reduce_scatter a sharded Muon would otherwise need. Same-shape matrices across layers are batched into a single NS call. The result was a small win, about +0.0005 bpb.
Stage 6 made the cost structure quantitative. Out of a 4.82 ms training step, exposed collective wait time was only 112 µs — about 2.3% of the step. The wire we had been targeting was already well-pipelined; the headroom for any "save wire time" intervention was tiny by construction. The real cost was hiding inside the optimizer's packed-grad path: opt.ar_packed took 2.77 ms per step (~57% of the whole step), but only 56 µs was wire — the remaining ~2.7 ms was tensor pack/unpack overhead (cat + copy_ of many small gradients inside _all_reduce_packed_grads). A fused pack via torch._foreach_* or a custom kernel could plausibly recover 20–30% throughput at 1.0x batch.
Reading the +0.0005 win against that profile: the wire-time saved by removing two extra collectives is on the order of ~100 µs/step (well under 5% of the eliminated collectives' wall time), which would translate to a barely-measurable BPB delta. The pack/copy work skipped on those same collectives is on the order of ~2 ms/step — much closer in magnitude to the win we actually observed. So the cleanest reading is that the 0427 trick worked, but not for the reason the hypothesis named: it bought a little pack/copy savings as a side effect of removing collectives, while the wire-time mechanism the hypothesis pointed at was never the bottleneck.
Verdict: O2 is mostly refuted. The cleanly-formulated version would be "pack/copy in the optimizer's gradient path is Muon's main slack" — a different mechanism than "communication overhead", and one we could only state confidently after Stage 6 surfaced it. The 0427 trick is still worth keeping (small gain in the predicted direction), but it is a partial workaround for a problem we now know is upstream of the collective itself.
What is all-reduce, and why does it dominate DDP communication?
A modern data-parallel training run replicates the model on every GPU and feeds each GPU a different micro-batch. After backward, every GPU has its own local gradient — different from its peers, because each saw different data. To take a single coherent optimizer step across the cluster, the gradients have to be averaged.
That averaging is an all_reduce: every GPU's gradient tensor is summed (or averaged) with every other GPU's, and at the end every GPU holds the same combined result. NCCL implements this with a ring or tree algorithm; in either case the per-GPU communication volume is roughly 2N(W−1)/W bytes per step, where N is the parameter count and W is the world size. At W = 8 that's about 1.75× the parameter count pushed through the network on every backward pass — usually the dominant communication cost in DDP.
For Muon specifically there is a second question on top of DDP's backward all-reduce: does the optimizer step itself need additional collectives? Many Muon implementations shard the Newton-Schulz iteration across ranks — each rank runs NS only for the matrices it owns, then an all_gather brings everyone's updated weights back together. The "0427 trick" / "optimal Muon" insight is that NS is deterministic, and DDP already left identical gradients on every rank, so every rank can run the entire NS pass independently and skip the post-backward all_gather and reduce_scatter entirely. The cost is W× redundant compute; the benefit is no extra communication on the critical path.
Whether the trade is worth taking depends on where the bottleneck actually is. For us, Stage 6 showed the wire was only 2.3% of the step; the trade gave a tiny win. The pack/copy inside _all_reduce_packed_grads — the cat-and-copy of many small per-parameter gradient tensors before the collective — turned out to be the real cost. A fused pack would route around that without changing anything about the collective itself.
O3 — refuted by Stage 2. Turning torch.compile off entirely crashed at serialization — the surrounding codepath quietly assumed compile-on. Switching the main training compile from static to dynamic=True was strictly worse (+0.00258 bpb vs the compile-on diagnostic baseline). Under a 600 s wall-clock cap there was no hidden throughput win in this knob; static compile was already the right default.
O4 — refuted by Stage 3. The wall-clock cap made the headline simpler than we expected: doubling effective batch at a fixed LR reliably hurt. The 1.0x baseline came in at 1.06342 bpb; 1.5x batch at the same LR landed at 1.06570 (+0.00228), because fewer effective updates fit in 600 seconds. The classic sqrt(k) rule helped slightly at 1.5x (-0.00039 vs fixed) and linear LR over-corrected (+0.00057). A sub-sqrt Muon factor looked promising at 1.75x batch, but those larger-batch runs were not iso-step, so a clean rerun is needed before reading too much into them.
The harder boundary under O4 was memory. Every 2.0x batch cell OOMed in fused-CE backward on 80 GB H100s, regardless of LR scale. The practical ceiling on this stack was around 1.75x, and even there throughput collapsed (about 2.6 Mtok/s versus 6.6 Mtok/s at 1.0x). Under this budget, larger batch simply was not the right axis to push.
PR Lineage
Show PR lineage DAG
Quantization
Quantization still felt somewhat like a black box to us, even though we had done it before. Two questions kept coming up: which compensation tricks pay off, and are BPB and the artifact-byte cap independent constraints we can tune separately.
Hypotheses
- Q1 — Group quantization improves GPTQ. Per-group statistics should give a more stable estimate of weight distributions and improve quantization quality on this stack.
- Q2 — QAT should outperform GPTQ-style PTQ. Quantization-aware training, with quantization in the loop, should give the model time to adapt and beat post-training quantization.
- Q3 — LQER has substantial headroom. Low-rank quantization error compensation, layered on top of GPTQ, should add roughly +0.001 to +0.003 BPB beyond GPTQ alone.
- Q4 — More aggressive matrix-bit quantization is a viable tradeoff. Pushing matrix bits below the int6 default should buy artifact bytes at an acceptable BPB cost.
- Q5 — BPB and byte budget are independent constraints. Levers that improve BPB shouldn't fundamentally have to trade against the 16 MB artifact cap.
Findings
Q1 — refuted. Group quantization did give the predicted statistical-stability improvement, but the per-group scale storage compounded across every quantized matrix. The BPB gain was on the order of +0.0001 bpb (under the single-seed noise floor on this stack), while the extra scale storage cost 300–450 KB after compression — enough to push the cap-legal artifact past the 16 MB limit. Cleanest concrete example of Q5: BPB and bytes are joint constraints, and group quantization sits on the wrong side of that joint.
Why does grouped GPTQ push over the cap?
Implementation in gptq_quantize_weight: the grouped path replaces the per-row scale (shape [rows]) with one scale per group (shape [rows, n_groups], where n_groups = ceil(cols / group_size)) and chooses each group's scale by Hessian-weighted reconstruction error.
The byte arithmetic is unforgiving. At the default group_size = 64, scale storage grows by a factor of n_groups per matrix — roughly 8× for the 512-wide attention projections and up to 32× for the 2048-wide MLP matrices. Summed across the 11-layer stack, the additional fp16 scale storage came out around ~800 KB raw, dropping to roughly 300–450 KB after brotli.
With the cap-legal artifact already at ~15.84 MB (about 160 KB headroom under the 16 MB cap), spending another 300+ KB on group statistics turns a budget-legal submission into an over-cap one — for a BPB delta we cannot reliably measure against single-seed noise. The trade goes the wrong way at every input.
Q2 — untested. Our intuition still says QAT should work better than GPTQ-style PTQ on a long enough training run, but we never got a successful QAT run during this round. Inside a 600 s wallclock cap the engineering overhead of a working QAT loop kept eating the gain we were hoping it would unlock.
Q3 — refuted by Stage 1. LQER in total was worth about +0.00038 bpb when fully ablated — well below the +0.001 to +0.003 range we had loosely predicted. Most of the LQER sub-knobs (sym vs asym, group sizes, factor bits) sat within noise of one another, and rank=8 actively hurt (+0.00077). One small, real win surfaced: dropping LQER top_k from 3 to 1 improved BPB by 0.00044. The default appears to over-spend compensation on tensors that did not really need it.
Q3 addendum (Stage 8): the LQER top_k optimum is codebase-dependent. We re-ran the LQER landscape (top_k ∈ {1, 2, 3, 4, 5}) on the PR #1855 base and found top_k = 2 wins (mean 1.06211, σ 0.000056 — the lowest σ in the whole landscape, and a clean −0.00052 vs the top_k=3 anchor). On the same codebase, top_k=1 is statistically indistinguishable from top_k=3. The Stage 1 result still holds for the PR #1851 codepath; the takeaway is that LQER's compensation sweet spot interacts with the surrounding stack rather than being a single-codebase constant.
Q4 — refuted by Stage 1 and Stage 2. Pushing matrix bits more aggressive (down to 5) was a disaster at +0.0269 bpb. Stage 2 confirmed the floor more cleanly: matrix_bits=4 was unrecoverable (pre-quant BPB jumped to 1.43), and adding LQER rank=8 as a rescue did not bring it back (1.28 vs the 1.06182 anchor). Int6 sits near the actual floor for this model size; the LQER compensation is not the bottleneck that is saturating below it.
Q5 — refuted (the most consequential finding). Stage 1 made the coupling impossible to ignore: embed_bits=8 and a looser MLP clip were the two settings that genuinely improved BPB (-0.00080 and -0.00141) — but both pushed the artifact past the size limit. Stage 2 quantified the other direction: dropping embed_bits from 8 to 6 cost only +0.00080 bpb while shrinking the artifact by 242 KB — one of the cleaner BPB-vs-bytes tradeoff levers in the stack. The takeaway across both sweeps: bit-budget gains and BPB gains are joint constraints that have to be co-optimized, not independent knobs.
PR Lineage
Show PR lineage DAG
Test-Time Compute
Test-time compute felt the most like a backdoor lottery ticket. The space was noisy, and a lot of it felt less principled than the earlier parts of the stack. The guiding heuristic, at least from our perspective, was to align the trained distribution with the test-time distribution as much as possible — beyond that, many of the gains felt opportunistic rather than deeply structural.
Hypotheses
- T1 — TTT itself gives meaningful BPB headroom. Letting the model train briefly on each test chunk before scoring should beat the frozen-model baseline by a clearly measurable margin.
- T2 — Multi-phase TTT beats single-phase. Splitting TTT eval into multiple cumulative phases (each phase trains on the data scored in earlier phases) should beat a single all-at-once pass.
- T3 — Smaller TTT batches are too noisy. The standard intuition says larger batches give cleaner LoRA gradients, so the bs=64 baseline should beat bs=16.
- T4 — SMT (Sparse Matrix Tuning, our own technique from arXiv 2405.15525) at TTT should help. Selecting only the top-K highest-gradient 64×64 blocks per matrix and zeroing updates outside that mask should focus TTT capacity on the parameters that matter and beat full LoRA TTT.
Findings
T1 — confirmed by Stage 1. TTT was the place where the numbers most clearly justified the engineering. The baseline post-TTT score (1.06172) sat ~0.013 bpb below the pre-TTT floor (1.07507). That single delta was several times larger than every gated-attention or quantization knob we touched — TTT is by a wide margin the largest single lever in this search tree.
T2 — confirmed. Phased TTT preferred three phases: collapsing to one cost +0.00118, while five phases were within noise. The alpha and weight-decay values inherited from S9 were each slightly worse than the #1851 defaults, by small margins. A side wrinkle: replaying the older #1797 configuration on top of the #1851 stack came out at 1.06149 bpb, very slightly ahead of the #1851 baseline (1.06172). The two changes (BOS-mask off, GateQuant on) interact non-additively — at this resolution, leaderboard ordering is not a clean total order even within a single code lineage.
T3 — refuted by Stage 2. The data went the other way. At fixed LoRA rank 96, dropping batch size from 64 to 16 improved post-TTT BPB from 1.06195 to 1.06169. Going to rank 192 with batch size 16 was the best of the four cells (1.06166). The "noisier gradient hurts BPB" story did not hold up — TTT seemed to prefer more, smaller updates than the conventional bigger-batch-cleaner-gradient intuition suggests.
T4 — refuted. SMT selects the top-K (default keep_frac = 0.25) 64×64 blocks of each matrix's gradient and zeros updates outside that mask. The mask is computed once at chunk 0 by running a single forward+backward over the chunk's training-eligible sequences (_smt_select_masks), then frozen and applied as a per-step gradient mask for every remaining chunk. In principle, focused updates on the highest-gradient blocks should be cleaner than spreading TTT capacity thinly across the whole matrix.
In practice, the chunk-0 gradient signal turned out to be too unstable a basis for a frozen mask. The blocks that look most active in the first chunk are not consistently the blocks that matter for the rest of TTT — the test-time distribution drifts across chunks, and any one-shot mask bakes in the early signal at exactly the moment when the model has the least information. Full-rank LoRA TTT, which gets to re-allocate capacity at every chunk, was meaningfully more stable end-to-end. The final submission keeps SMT off and uses standard LoRA rank 80, batch size 16, three phases Phased TTT instead.
PR Lineage
Show PR lineage DAG
A Note on Working with Agents
Looking back at the search tree above, much of the substance was carried by agentic LLMs. Codex and Claude Code were not adjacent helpers — they were inside almost every step. We wrote prompts; they wrote scripts. We sketched a sweep; they wired up the runner, kicked off training on the HPC, scraped the logs, parsed the TSVs, ran the regressions, and handed back a summary that often included a sensible next experiment to try.
There was a threshold somewhere in the middle of the project where the loop tightened to the point of feeling almost autonomous. Once we had fed enough context — the codebase, the constraints, the prior PRs, the way the budget was being scored — the agent could take a one-line research goal, build the conda environment on a fresh HPC node, run the experiment to completion, write the results to disk, generate the diff against the previous run, and propose what it thought we should look at next. We mostly nudged.
This is not entirely a happy observation. The work above includes plenty of moments that felt genuinely ours — the choice of hypotheses, the willingness to keep a tossed-up MHA result, the suspicion that the SmearGate BOS-mask was the right size. But the rote engineering — the part that, ten years ago, would have been the work — was largely automated away. The fraction of the effort that was distinctively human kept shrinking.
We are not lamenting the agents themselves; they are extraordinary, and this project would not have been possible at this depth without them. What is harder to sit with is how fast the line between operator and author moved. There is a version of this writeup that could have been, almost entirely, a transcript of what we asked the agents and what they handed back. We did not write it that way — but the fact that we could have is what makes the lament a real one.
A second thing worth flagging is the substrate. The agentic loop runs on hardware that has not gotten cheaper or more available. A 600-second wall-clock cap on 8× H100s is a small experiment by frontier standards, but every iteration of the search tree above still cost real GPU-hours. Most of the project's calendar time was spent waiting for jobs to slot into a shared queue, not waiting for the agent to think. Compute is now the binding constraint on how fast agentic research can move — not the agent's bandwidth, not the human's. We were lucky to have access; many researchers in this design space are not.
A third thing to say out loud is that the slow parts left in the loop are the parts that most need a human. Picking which hypothesis is worth a sweep, recognizing that the chunk-0 SMT mask was unstable rather than just noisy, deciding to call A1 a "toss-up" instead of a "refutation" — those judgment calls were ours. The agents could implement and measure, but the framing of what to measure, and the read of why a number was the size it was, still benefited noticeably from the human. Research efficiency has gone up enormously; the parts that didn't speed up are the parts where intuition and an honest second look at the data still matter, and for now those still seem to be a place where human researchers have an edge over LLM agents. We expect that gap to close. It hasn't yet.
An aside on autoresearch. Alongside the manual sweeps we also ran a Karpathy-style autoresearch hive — 24 agents across desktop 4090s, a 6× 4090 box, and a cloud H100 node, following the standard design (LLM proposes a train.py diff → fixed wall-clock train+eval → val_bpb → git commit if better, reset if not). After 1,114 completed runs over April, the hive's best result was val_bpb ≈ 1.0906 — about 0.029 bpb behind what two humans did in the same window.
A few reasons the loop didn't carry the way it does on nano-gpt: there is no cheap proxy here — parameter-golf's eval is the 10-minute hard cap, so the 12-experiments-per-hour flywheel that makes the ratchet compound on nano-gpt never spun up. With 24 agents each following their own greedy ratchet, they drifted to different local minima rather than converging on one. The wins that mattered for the 1.06 frontier weren't single-knob — they required correlated multi-knob optima (the 9-hparam stack), counterintuitive direction reversals (MHA only paying off when paired with the right surrounding tuning), specific bug fixes (the SmearGate BOS-mask), and byte-budget-aware engineering (lrzip pergroup) — none of which a greedy one-step ratchet can surface. And the ratchet has no mechanism for committing to a worse-looking intermediate, which is exactly how the manual track escaped the 1.09 basin.
Honest read: the hive reproduced roughly what the modded-nanoGPT community had already published — useful, but not a frontier-mover here. The 1.06 frontier moves were all human-found; the agent's contribution kicked in after a human had picked the direction.
Citation
If you would like to cite this writeup:
@misc{li2026paramgolf,
author = {Li, Billy and Shen, Tim},
title = {Ten Minutes, 16 Megabytes: A Parameter-Golf Field Report},
year = {2026},
month = apr,
howpublished = {\url{https://www.junchengbillyli.com/llm-notes.html}},
note = {Submission: \url{https://github.com/openai/parameter-golf/pull/1987}}
}