The NanoGPT Speedrun: From 45 Minutes to 81 Seconds

There is a GitHub repository called modded-nanogpt. It has no corporate backing, no PR team, and almost no mainstream coverage. What it does have is a leaderboard tracking exactly how fast a community of researchers and engineers can train a 124-million-parameter language model to a specific quality threshold on eight NVIDIA H100 GPUs.

In May 2024 that took 45 minutes. Today it takes 81 seconds.

Same hardware. No new chips. Pure software.

That is roughly a 33x speedup in under two years, and it tells you more about what is actually driving AI progress than most conference papers or investor decks.

What Is the Speedrun?

The target is simple to state: train a language model that achieves a cross-entropy loss of 3.28 or lower on the FineWeb validation set, using eight H100s, as fast as possible.

The 3.28 loss target comes from Andrej Karpathy's llm.c GPT-2 replication, which was the starting baseline. Karpathy's version needed 45 minutes and about 10 billion tokens. The repo itself is a direct descendant of that code, which is why it carries the "nanoGPT" name.

The rules are deliberately minimal. You cannot touch the training or validation data streams. You have to hit the statistical significance threshold. Everything else is fair game: architecture, optimizer, precision, kernels, scheduling, data loading, anything.

The result is an unusually honest benchmark. There is no cherry-picking a good run. You have to post logs. Other people verify. The clock is the judge.

Why a Speedrun?

Speedrunning is a format people usually associate with video games, not machine learning research. But it turns out the structure maps surprisingly well.

You have a fixed, well-defined goal. You have a community of contributors who build on each other's work, often combatively and collaboratively at the same time. You get a historical record of every meaningful improvement. And critically, because the goal is speed rather than a new capability, every single optimization has to actually work. You cannot hide behind improved benchmarks that nobody cares about in production.

The format also concentrates a lot of minds on a very specific problem: training efficiency. That is not a glamorous topic. Most AI coverage focuses on what models can do, not what it costs to produce them. The speedrun inverts that.

The Record History as a Story

Looking at the full progression of 82 records, a few phases stand out.

The first wave (records 1 through 11, May to November 2024) was almost entirely Keller Jordan and a handful of close collaborators. They established the baseline, introduced rotary embeddings, reworked the optimizer (more on that in a moment), modernized the architecture with QK-Norm and ReLU², and used skip connections borrowed from U-Net style designs. The time dropped from 45 minutes to around 7.

The attention revolution (records 12 through 20, November 2024 to January 2025) was where things got wild. Record 12, submitted by @KoszarskyB, replaced the standard 1024-token dense attention window with a 64,000-token FlexAttention context. The time dropped from 7 minutes to 5. Then long-short sliding window attention showed up, inspired by Gemma 2. Flash Attention 3 arrived later and shaved off another significant chunk. The speedrun community was independently re-deriving techniques that major labs were publishing at the same time.

The sub-3-minute era (records 20 through 40, January to October 2025) shifted the focus to systems work. Batching, precision, gradient communication, kernel fusion. The time kept falling, but each gain required more sophisticated engineering. Record 20, the first to break 3 minutes, needed contributions from six different people.

The current phase (records 40 through 82, October 2025 to April 2026) has been a sustained grind into the low 80-second range. This is where you start seeing things like custom Triton kernels, asymmetric logit rescaling, multi-token prediction, and a technique called Bigram Hash Embedding that dropped the time by nearly 10 seconds in one shot.

The Muon Optimizer

The most consequential thing to come out of this project is not a record time. It is an optimizer.

Standard neural network training uses Adam or AdamW, which are adaptive gradient methods. They work well but carry known inefficiencies. Keller Jordan introduced an alternative in October 2024 for record 3: Muon, which uses the Newton-Schulz algorithm to orthogonalize gradient updates. The intuition is that many training updates contain redundant information, and by forcing the updates to be orthogonal you can use fewer steps to reach the same result.

Jordan published this as a blog post, not a paper. No institution, no peer review process. Just code and an explanation.

It worked. By record 5, combining Muon with a modernized architecture had already cut the time from 45 minutes to 15. And then something unusual happened: the optimizer escaped the speedrun.

The team behind Kimi K2, the large Moonshot AI model, cited Muon in their training setup. DeepSeek's engineers looked at it too. The optimizer went from "competitive submission in a GitHub repo" to "used in models with hundreds of billions of parameters" in less than a year.

And Keller Jordan was recruited directly to OpenAI.

That trajectory is worth sitting with for a moment. A solo researcher working on a public benchmark repo builds something useful enough that it ends up in production at the biggest AI labs in the world, and gets hired by one of them. No journal submission required.

Record 82 and What It Represents

The most recent world record as of writing this is #82, submitted on April 29, 2026. It clocks in at 1 minute 21.2 seconds, which is about 81 seconds total.

The contributor was Alex Wa, a Yale student who worked on this with two classmates as part of a university course project.

Their idea: add a single learnable scalar parameter per attention head. The parameter acts as a gate that suppresses redundant computation within each head. The concept is grounded in research on cross-layer attention patterns, but the implementation is roughly 15 lines of code.

What Alex published is genuinely worth reading. You can trace the whole process. He documents the failed experiments first, which is rare. Most research writing only shows the thing that worked. Here you can see the systematic search: the hypotheses that went nowhere, the tuning runs that gave nothing, and then the one idea that gave 0.6 seconds back and broke the record.

That feedback loop, from course project to world record, is exactly what makes this repo interesting. There is no prerequisite to contribute beyond having a working idea and enough compute to test it.

What This Has to Do With GPT-4 and Claude

The obvious question is whether any of this matters outside the benchmark.

The answer is yes, though the path is indirect.

Every technique in modded-nanogpt is a technique that production training runs can use. Flash Attention 3 is in production at multiple labs. Rotary embeddings are standard. The ideas around learning rate scheduling and batch size scheduling that appear in records 46 and 72 reflect real tradeoffs that matter at scale. The speedrun is effectively a fast-iteration lab for training improvements where the feedback loop is measured in minutes rather than months.

Frontier labs like Anthropic, OpenAI, and DeepSeek have all been able to release noticeably better models without proportional increases in compute. Better training efficiency is a big part of why. The speedrun makes that efficiency progress visible and measurable in a way that internal lab work usually does not.

It also functions as a talent pipeline, apparently.

The 10 Euro Question

Here is the detail that most people miss when they think about AI training being expensive and inaccessible.

Because of everything that has gone into these 82 records, you can today clone the repo, spend around 10 euros on GPU compute, and train your own GPT-2 class model in about a minute and a half. That was genuinely impossible two years ago regardless of budget, because the algorithms to do it did not exist yet.

The repo includes complete instructions:

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
pip install -r requirements.txt
python data/cached_fineweb10B.py 9
./run.sh

Official records are validated on hardware from Prime Intellect, so you can even run on the same setup used for submissions.

The Bigger Picture

There is a version of AI progress that is legible only to insiders. New models drop, people benchmark them, numbers go up. The mechanisms behind the improvement are rarely explained in public.

modded-nanogpt is the opposite of that. Every record has a log. Every technique has a description. The contributor list has grown from one person to over fifty, spanning students, independent researchers, and engineers from established labs. Some records were set with AI coding systems, which adds its own layer of strangeness.

The project started because one person wanted to see how fast you could train a specific model. It has turned into something closer to an open running record of how much more efficiently we understand how to use compute than we did two years ago.

That number keeps going down. There is no obvious reason for it to stop.

Links