Hugging Face's new playbook reveals the messy, bug-filled secrets to training world-class LLMs.

Hugging Face's new playbook pulls back the curtain on the messy reality of LLM development, revealing that building world-class models isn't about a single secret recipe but a grueling, iterative journey of systematic experiments, unexpected bugs, and relentless debugging.th

Grant Harvey

July 29, 2024

We tend to imagine AI development as a clean, predictable process executed by labs with secret recipes. A new, radically transparent guide from Hugging Face, The Smol Training Playbook, shatters that illusion by detailing the messy reality behind training SmolLM3, their 3B-parameter model trained on 11 trillion tokens. It turns out, building state-of-the-art AI is less about a magic formula and more about surviving a marathon of unexpected bugs, infrastructure failures, and relentless debugging.

The most shocking revelation: they had to restart the entire 11T token training run after already burning through 1 trillion tokens. The model’s performance was mysteriously lagging behind its smaller predecessor. After a frantic search, the culprit was found: a subtle bug where every GPU in a parallel group was initialized with the same random seed, crippling the model's ability to learn effectively. Because they had systematically tested every other component, they could isolate and fix the bug in a single day.

The playbook is a behind-the-scenes look at the hundreds of decisions that go into a model. It’s packed with practical insights, from high-level strategy to low-level hardware optimization.

Here are some of the key takeaways for anyone building with AI:

Don't train from scratch unless you must. First, try to solve your problem with existing open-source models. The playbook provides a "Training Compass" to decide if you have a legitimate reason, like novel research or extreme domain specificity.
Ablate everything. The core of their method is running hundreds of small-scale experiments (ablations) to "derisk" every single change—from the attention mechanism to the data mixture. Intuition is cheap, but GPUs are expensive.
Data curation is your biggest lever. The authors stress that the largest performance gains consistently come from improving the quality and mix of the training data, not from chasing novel architectures.
Expect scale to break everything. The SmolLM3 run was plagued by issues that never appeared in smaller tests, including a storage system that couldn't keep up and a dataloader that buckled under the pressure of a multi-trillion token run.

Why this matters: This guide demystifies the art of LLM training. It shows that even the experts at Hugging Face don't have a perfect recipe. Instead, they rely on a disciplined, empirical process to navigate the inherent chaos of building complex systems. For developers and researchers, the playbook provides a concrete framework for making better decisions, debugging faster, and understanding the real-world trade-offs involved in creating powerful models. It’s a masterclass in turning messy reality into world-class results.

Below, we dive into the details to give a formal overview of everything you'll learn in the 2-4 days it'll take you to read through this guide. Trust us, it's worth it.

The Smol Training Playbook: Pulling Back the Curtain on Building World-Class AI

Published research papers often paint a serene picture of AI development: a logical progression of strategic architectural choices, perfectly curated datasets, and flawless execution on massive GPU clusters. The loss curves are smooth, the results are polished, and every decision seems obvious in hindsight. But as a new, remarkably transparent guide from Hugging Face reveals, the reality is far messier, more iterative, and filled with the kind of late-night debugging sessions and unexpected failures that never make it into the final report.

Hugging Face's The Smol Training Playbook is a detailed account of the journey behind training SmolLM3, a 3-billion-parameter multilingual reasoning model trained on a staggering 11 trillion tokens. It’s not just a recipe for what worked; it's a candid look at the spiderweb of decisions, dead ends, and hard-won insights that define modern large language model (LLM) training. This is the story of infrastructure breakdowns, subtle bugs that forced a complete restart after 1 trillion tokens of training, and the systematic process required to navigate the chaos.

The Compass: Before You Burn Millions on GPUs...

The playbook begins not with code, but with a fundamental question that many teams skip: should we even be training this model? With a rich ecosystem of powerful open-source models like Llama, Qwen, and Gemma, the uncomfortable truth is that most organizations don’t need to train their own model from scratch. Bad reasons—"we have compute available" or "everyone else is doing it"—lead to wasted resources.

The authors propose a "Training Compass" (Why → What → How) to ground the process in strategy. Legitimate reasons for pretraining fall into three categories:

Novel research to answer a specific question.
Production needs driven by extreme domain specificity or deployment constraints.
Strategic open-source contributions to fill a clear gap in the ecosystem.

Once the "why" is established, the "what" (architecture, model size, data) and "how" (infrastructure, frameworks) follow. The playbook emphasizes that successful teams are defined not by genius, but by iteration speed—small, well-equipped teams shipping new models every few months learn fastest. Above all, they are obsessed with data curation, which consistently yields bigger performance gains than architectural tinkering.

The Art of the Ablation: Every Great Model Starts Small

If strategy is the map, empirical validation is the compass. The playbook argues that intuition is often wrong in LLM training; for example, what seems like "high-quality" data (like scientific papers from arXiv) can actually hurt the performance of smaller models. The solution is to run a lot of small-scale experiments, or ablations.

The process is a discipline of derisking:

Choose a Proven Baseline: Start with a well-documented, battle-tested architecture like Llama or Qwen. Don't reinvent the wheel and rediscover every problem yourself.
Test One Change at a Time: Against this baseline, test a single promising modification. Does a new attention mechanism improve performance? Does a different positional encoding help?
Validate Systematically: A change is only "derisked" when testing shows it improves performance or provides a tangible benefit (like faster inference) without unacceptable tradeoffs.
Use Reliable Evaluations: Looking at the training loss isn't enough. A robust suite of downstream benchmarks is needed to measure actual capabilities. For early-stage ablations, the playbook recommends using the "cloze formulation" over multiple-choice questions, as models learn the former much earlier in training.

This systematic process is not cheap. The authors reveal that for SmolLM3, the cost of ablations and debugging (161,280 GPU-hours) was more than half the cost of the final training run itself. But this upfront investment provides the confidence needed to navigate the inevitable surprises that emerge at scale.

Designing SmolLM3: From Architecture to Data

With a clear methodology, the playbook walks through the key architectural decisions for SmolLM3, a model designed for on-device use with strong multilingual, coding, math, and long-context capabilities.

Architecture: A 3B-parameter dense model was chosen to balance capability with on-device memory constraints. MoE and Hybrid models were ruled out due to timeline and memory footprint.
Attention: Grouped-Query Attention (GQA) was used over standard Multi-Head Attention (MHA) to reduce the memory footprint of the KV-cache at inference—a critical optimization for long context and on-device use—without sacrificing performance.
Long Context: The model was built from the start for long context. It used NoPE (a hybrid approach alternating layers with and without Rotary Position Encoding) and intra-document masking to improve generalization to long sequences and speed up training.
Stability: Techniques like embedding sharing (tying input and output embeddings) and removing weight decay from embeddings were used to save parameters and improve training stability.
Tokenizer: After analyzing multiple options, Llama3’s 128k vocabulary tokenizer was chosen for its efficient balance across SmolLM3’s target languages.

Just as critical was the data mixture. The playbook details a multi-stage training curriculum designed to balance competing domains. SmolLM3’s 11T token journey was split into three phases:

Stage 1 (8T tokens): A foundational mix of web, multilingual, code, and math data.
Stage 2 (2T tokens): An injection of higher-quality, filtered datasets like Stack-Edu for code and FineMath4+ for math.
Stage 3 (1.1T tokens): During the final learning rate decay, the mixture was enriched with instruction and reasoning datasets to sharpen the model's capabilities.

This curriculum follows a key principle of modern training: reserve your highest-quality data for the later stages, as a model's final behavior is heavily influenced by the data it sees toward the end of training.

The Training Marathon: A Drama of Bugs and Breakthroughs

With everything planned, the full-scale training on 384 H100 GPUs began. This is where the "messy reality" truly hit. The playbook candidly documents a series of show-stopping issues that emerged only at scale:

The Vanishing Throughput: Within hours, training speed plummeted. The culprit was the network-attached storage, which couldn't handle the 24TB dataset and began evicting data shards mid-training. The fix involved moving the entire dataset to fast, local NVMe storage on every single node and keeping a pre-loaded "spare node" ready for instant swaps when hardware failed.
The Mysterious Dataloader Bug: Even with local storage, throughput still had sharp, periodic drops. The team discovered a bottleneck in their new nanosets dataloader; its internal index grew with the total number of training steps, causing slowdowns on long runs. The fix was to swap it out for the older, battle-tested dataloader from the SmolLM2 project.
The 1 Trillion Token Restart: The most dramatic failure came after two days of seemingly smooth training. Evaluations revealed that the new 3B model was performing worse than its 1.7B predecessor. The loss curve looked fine, but the capability benchmarks told a different story. Because every other component had been derisked through ablations, the team was able to quickly isolate the one untested variable: Tensor Parallelism (TP). The bug was incredibly subtle: every GPU in a TP group was being initialized with the same random seed, leading to correlated weights that hampered learning. The only solution was to fix the bug and restart the entire 11T token run from scratch.

This story highlights the immense value of the systematic process. Without the confidence from prior ablations, finding such a subtle bug would have been like searching for a needle in a haystack.

Post-Training: Sculpting a Raw Model into a Reasoning Assistant

A pretrained base model is a powerful next-token predictor, but it's not a helpful assistant. The final chapter of the playbook details the post-training process used to turn SmolLM3 into a hybrid reasoner capable of both concise answers and step-by-step "thinking."

The process began with Supervised Fine-Tuning (SFT) on a mix of instruction-following and reasoning datasets. A crucial lesson emerged early: always "vibe-test" your models. While automated benchmarks looked fine, manually interacting with an early checkpoint revealed it was completely ignoring system prompts—a bug in the data processing pipeline had stripped them all out.

To boost reasoning, the team performed continued pretraining (or "mid-training") on billions of tokens of distilled reasoning traces before SFT. This step alone nearly tripled performance on competitive math benchmarks.

Finally, they applied Preference Optimization (PO). Instead of just showing the model correct examples, PO teaches it what "better" means by training on pairs of "chosen" and "rejected" responses. After ablating several algorithms, they found that Anchored Preference Optimization (APO) gave the best results, significantly improving both instruction-following and reasoning capabilities beyond what SFT alone could achieve.

The Unsung Hero = Infrastructure

Underpinning this entire journey is the hardware. The playbook provides a crash course in infrastructure, demystifying the components that dictate performance. It details the GPU memory hierarchy, explaining why modern AI is often memory-bound, and explores the vast differences in communication bandwidth—from the blistering 786 GB/s of intra-node NVLink to the comparatively sluggish 42 GB/s of the inter-node network. Understanding these bottlenecks is key to designing efficient parallelism strategies and achieving high hardware utilization.

A New Standard for Transparency

By openly sharing not just their successes but their failures and messy debugging stories in The Smol Training Playbook, Hugging Face has provided an invaluable resource for the entire AI community. It confirms that building world-class models is about a disciplined, empirical, and often grueling process of iteration. For anyone building with or on top of these models, it’s a powerful reminder that behind every smooth loss curve lies a story of chaos navigated and complexity tamed.