If maximizing model performance with limited data but unlimited compute is part of your research or production roadmap, NanoGPT Slowrun is a benchmark that should be on your radar. This open initiative from Q Labs reverses the standard paradigm: instead of chasing speed or scaling up data, you're challenged to extract the maximum generalization from a fixed dataset using as much compute as needed. Recent results show rapid, order-of-magnitude improvements in data efficiency—making this a proving ground for the next wave of breakthroughs in language modeling optimization.
Key Takeaways:
- NanoGPT Slowrun is an open benchmark for data-efficient language modeling: train on 100M FineWeb tokens, use unlimited compute, and minimize validation loss (source).
- Data efficiency has improved from 2.4x to 5.5x over the modded-nanogpt baseline in the first week, driven by multi-epoch training, regularization, optimizer innovation, and ensembling (source).
- Slowrun enables heavy, compute-intensive strategies—such as high weight decay and advanced optimizers—that are often impractical in speedrun benchmarks.
- Real-world feasibility depends on your access to compute resources, but the benchmark sets new standards for what’s possible with limited data.
- For production, classic speedrun or pretrained model approaches may be more practical, but Slowrun is where algorithmic advances are being pioneered.
What Is NanoGPT Slowrun?
NanoGPT Slowrun is a collaborative research benchmark and open-source repository from Q Labs, designed to push the limits of data-efficient language model training. The rules are intentionally simple: train only on a fixed sample of 100 million tokens from the FineWeb dataset—no augmentation or extra data allowed. The sole metric is lowest validation loss on this fixed split. There is no cap on compute time or hardware resources (source).
This contrasts sharply with the “speedrun” approach that dominates AI benchmarks—where the aim is to reach a result as fast as possible, often by scaling up data or maximizing throughput. Slowrun instead asks: how much can you generalize from a small, static dataset if you can use as much compute as you want?
This setting has immediate relevance for applied AI fields where collecting or labeling data is slow, expensive, or highly regulated (e.g., robotics, genomics, finance, healthcare). While compute capacity is outpacing data growth in many sectors, our current scaling laws assume both must increase together. Q Labs argues that in practice, intelligence will increasingly be bottlenecked by data, not compute—a reality already visible in fields outside large language modeling (source).
By focusing on data efficiency, Slowrun is fertile ground for research into:
- Multi-epoch training (repeatedly cycling through the same data)
- Aggressive regularization (high weight decay, dropout, data shuffling)
- Advanced optimizers (second-order methods, evolutionary algorithms, natural gradients)
- Model ensembling and architectural innovations
- Compression and complexity minimization techniques
The process is fully open: all improvements are submitted as pull requests, and any method that achieves a lower validation loss on the fixed split is merged. The benchmark is not about theory alone—it’s a live, evolving leaderboard of practical, reproducible improvements.
Benchmarks and Data Efficiency
NanoGPT Slowrun’s progress is measurable and fast-paced, with community contributions validated in the open. The initial baseline delivered a 2.4x improvement in data efficiency over the modded-nanogpt speedrun baseline. Within a week, this was pushed to 5.5x, thanks to algorithmic innovations and collective experimentation (source).
Key technical advances driving these results include:
- Shuffling at epoch start: Randomizing data order at the beginning of each epoch significantly boosts generalization in multi-epoch training.
- Learned projections for value embeddings: Replacing standard embedding tables with learned projections improves representational efficiency.
- SwiGLU activations: Using SwiGLU in place of squared ReLU activation functions leads to more data-efficient model transformations.
- Model ensembling: Averaging outputs from multiple independently trained models further lowers validation loss.
All improvements are measured on the same 100M-token FineWeb subset. The table below summarizes these officially reported results:
| Benchmark | Data | Compute Limit | Data Efficiency (vs modded-nanogpt) |
|---|---|---|---|
| NanoGPT Slowrun | 100M tokens (FineWeb) | Unlimited | 5.5x (March 2026) |
| modded-nanogpt | 100M tokens (FineWeb) | Fixed wall-clock | 1x (baseline) |
For context, the Slowrun baseline (using modest settings) trains in approximately 47 minutes on an 8xH100 GPU cluster (~$12 in cloud compute). However, top leaderboard submissions frequently run for much longer and use more compute, leveraging advanced regularization and optimizer strategies. Q Labs projects that 10x data efficiency is within reach soon, and 100x may be possible with further algorithmic innovation (source).
This benchmark is especially relevant if you work in domains where scaling up data is impractical. It provides a reference for the possible efficiency gains from optimizing algorithms, not just increasing dataset size.
Practitioner Guide: Setup and Examples
Getting Started with NanoGPT Slowrun
The NanoGPT Slowrun repository is designed for open experimentation and quick iteration. Key points before you start:
- All experiments use exactly 100M tokens from FineWeb, prepared via the provided
prepare_data.pyscript. - Three official tracks are available: unlimited compute, limited compute (1 hour on 8xH100), and tiny compute (15 minutes on 8xH100).
- The only metric that matters is validation loss on the fixed split.
- Community pull requests are merged if they achieve a lower val loss.
Code Example: Running the Baseline
# Clone the repository
git clone https://github.com/qlabs-eng/slowrun.git
cd slowrun
# Install dependencies
pip install -r requirements.txt
# Prepare the data (FineWeb 100M tokens)
python prepare_data.py
# Train the baseline model (limited compute track)
python train.py --epochs 1 --track limited
To experiment in the unlimited track, modify the --track parameter or omit it for custom training configurations. The codebase is intentionally minimal: train.py drives the training loop, and model logic can be modified directly for new optimizers, regularization methods, or architectural ideas.
Advanced Patterns and Real-World Scenarios
- Multi-epoch training: Tune the number of epochs and shuffling strategies to enhance generalization. Submissions with aggressive shuffling and higher epoch counts have shown substantial gains.
- Custom optimizers: Swap AdamW for advanced methods like Muon, SOAP, or MAGMA. Optimizer logic lives in
train.pyand can be updated as needed. - Regularization: Experiment with high weight decay (up to 16x typical values) and dropout. These parameters are easily adjusted in
model.py. - Activation and embedding tweaks: Try alternative activations such as SwiGLU, and replace embedding tables with learned projections to see their impact on val loss.
- Model ensembling: Train several models with different random seeds and average their outputs for best results.
For practitioners focusing on production, these patterns represent a research-centric approach—Slowrun is about exploring new boundaries, not shipping code fast. However, these same strategies can inspire more efficient model tuning and optimization in real-world constrained environments.
Considerations, Trade-offs, and Alternatives
Limitations and Trade-offs
- Compute Constraints in Practice: While Slowrun assumes “unlimited compute,” actual access to clusters (e.g., 8xH100 GPUs) is a significant barrier for many organizations. Extended training and ensembling can drive up cloud costs, so evaluate feasibility for your situation.
- Generalization Beyond the Benchmark: All reported improvements are validated on a single 100M-token FineWeb split. There's no guarantee that these results will transfer directly to other languages, domains, or noisier real-world data. Always validate on your production data before deploying new methods.
Alternatives for Practitioners
- karpathy/nanoGPT: The original lightweight GPT framework, ideal for quick prototyping and transformer fundamentals. Best when you have abundant data and need fast iteration.
- Speedrun-Style Training: For environments where wall-clock efficiency is critical, modded-nanogpt and similar speedrun benchmarks are more applicable. These optimize for throughput and latency, which is often essential for production workloads.
- Commercial and Large-Scale Frameworks: For robust deployment, consider established toolkits or cloud providers that offer a wide range of pretrained models and infrastructure. These platforms trade off some fine-grained control for scalability and reliability.
For more on how hardware and cloud integration inform deployment choices, see our analysis of Chromebooks in 2026: AI Features and Cloud Integration.
Common Pitfalls and Pro Tips
- Overfitting on Small Data: Multi-epoch training exposes your model to the same data repeatedly, increasing overfitting risk. Counteract this with strong regularization, data shuffling, and vigilant early stopping based on validation loss.
- Chasing Irrelevant Metrics: Validation loss is the only metric that counts for Slowrun. Avoid spending cycles optimizing for perplexity, BLEU, or other secondary metrics unless they clearly correlate with improvements on the fixed split.
- Cloud Cost Management: Unlimited compute is a research abstraction. Monitor GPU usage and set budget controls to avoid runaway costs, especially when running long or multiple jobs on shared clusters.
- Keeping Up with Baselines: The leaderboard evolves quickly. Always benchmark new approaches against the latest results to ensure meaningful progress.
- Translating Research to Production: Techniques that excel in Slowrun’s fixed-data, infinite-compute setting may not generalize to your production pipeline. Always revalidate in your target environment before adoption.
If you’re also working on optimizing database workloads, check out our PostgreSQL JIT compilation performance guide for actionable ways to maximize compute efficiency.
Conclusion and Next Steps
NanoGPT Slowrun is reshaping the landscape for data-efficient language model research. If your goal is to maximize performance on limited datasets—whether for regulatory, economic, or practical reasons—this is the leading open benchmark to follow and contribute to. The project already demonstrates 5.5x data efficiency gains over conventional speedrun methods, with even larger improvements likely as new algorithms are tested (source).
The key insight: the frontier of language modeling is not just bigger datasets or faster hardware, but smarter use of the data you have. Clone the Slowrun repository, experiment with new optimizers or regularization strategies, and help drive the next wave of generalization advances. For strategies on deploying AI in changing environments, see our reviews of Chromebooks and cloud AI integration or explore compute-centric database performance with JIT compilation in PostgreSQL.
Action steps: fork the Slowrun repo, implement a novel optimizer or regularization technique, and benchmark your results against the constantly evolving leaderboard. As the field advances, those who combine compute scaling with data efficiency will shape the future of production-ready AI systems.

