Under the Hood of Olmo 3: Everything to Know About America’s Next Top Open Source AI & Its "Glass Box" Reasoning

Ai2's Olmo 3 isn't just a new state-of-the-art open model; it's a fully transparent "model flow" that releases every dataset, recipe, and checkpoint, finally giving developers the blueprint to build (and understand) their own reasoning engines.

Grant Harvey

July 29, 2024

Usually, when a new AI model drops, we get a flashy blog post and a set of "weights" (the final, trained brain of the AI). It’s like a chef giving you a delicious cake but refusing to show you the recipe, the ingredients, or the kitchen where it was baked.

Olmo 3 is different.

The team at the Allen Institute for AI (Ai2) just dropped Olmo 3, a family of 7B and 32B parameter models. But they didn’t just give us the cake. They gave us the farm where the wheat was grown, the blueprints for the oven, the chef’s diary, and the exact temperature of the kitchen.

This is a deep dive (based on their supporting technical paper) into the most transparent AI release in history. We are going to unpack everything—from the "Duplodocus" tool they built to deduplicate the internet, to the "Delta Learning" technique used to teach the model to think.

Buckle up. We’re going fully open-source.

📑 Table of Contents

The TL;DR: the news in brief.
The Philosophy: What is "Model Flow"?
Olmo 3 Base: Building the Foundation (Pretraining)
The Mid-Training Arc: Targeted Boosts & Decontamination
The Long Context: How to Read 64k Tokens
Olmo 3 Think: The "Reasoning" Revolution
Olmo 3 Instruct: The Chatty Assistant
Olmo 3 RL-Zero: The Science Experiment
The Infrastructure: OlmoRL & Engineering Magic

Technical terms decoded in this issue:

SFT (Supervised Fine-Tuning): Teaching a model by showing it examples of "Question -> Correct Answer."
DPO (Direct Preference Optimization): Teaching a model by showing it "Answer A is better than Answer B."
RLVR (RL with Verifiable Rewards): Teaching a model by grading its homework (Math/Code) automatically.
Ablation: A scientific test where you remove one specific part of the system to see if it actually matters.
Inference-Time Scaling: The concept that a model gets smarter if you let it "think" for longer before answering.

Okay, let's get into it!

1. The TL;DR = Ai2 just dropped the ultimate open-source blueprint.

Here's tje TL;DR: The Allen Institute for AI (Ai2) released Olmo 3, a new family of 7B and 32B models. Unlike "open weights" models (like Llama) that only give you the final product, Olmo 3 releases the entire "Model Flow"—including the data, training recipes, and intermediate checkpoints.

The "Think" Model: The flagship Olmo 3 Think is capable of "reasoning" (generating internal thoughts before answering), similar to OpenAI's o1. It is currently the strongest fully open thinking model, outperforming Qwen 2.5 32B and Llama 3.1 70B on math and reasoning benchmarks.
The "Instruct" Model: Designed for speed and general chat, this version strips away the thinking traces for quick, concise answers and robust function calling (tool use).
Data Transparency: Ai2 released Dolma 3, their 6-trillion token dataset. They used a new technique called "quality-aware upsampling," where they deliberately repeated the highest-quality 5% of data (like math and science PDFs) to boost intelligence.
Efficiency: The 32B model rivals top-tier models like Qwen 3 despite being trained on 6x fewer tokens.

WHY IT’S IMPORTANT: This is a massive win for transparency. Most "open" models are black boxes—we don't know what data they were trained on. Olmo 3 allows developers to inspect the exact data that went into the model, making it the safest bet for compliance-heavy industries.

Ai2 isn't just giving you the data; they are giving you a flashlight to look inside the black box. They launched OlmoTrace, a tool integrated directly into their Playground. If the model hallucinates or gives a weird fact, you can verify it instantly. It traces the model’s output back to the exact training documents that influenced it. This closes the loop between "What the AI said" and "What the AI read."

WHAT TO DO:

For Researchers: Stop using closed weights for benchmarks. Use Olmo 3 Base or RL-Zero as your control variable. Because the entire lineage is public, you can trace specific behaviors back to the exact training data—something impossible with Llama or GPT-4.
For Developers: If you need a reasoning model but can't send data to OpenAI/DeepSeek, Olmo 3 Think 32B is your new best friend. It's small enough to self-host but smart enough to handle complex logic chains.

For Founders: Explore "mid-training." Because Ai2 released the intermediate checkpoints, you can insert your proprietary data during the training process (not just after), allowing for much deeper customization than standard fine-tuning.

How to use it: You can try it out on the Ai2 playground, or download the models here and run them with LM Studio. They're also on HuggingFace if you want to download and run them with another tool.

Now, let's dive into the full details from the technical report.

2. The Philosophy: Welcome to the "Model Flow"

Most "Open Source" AI isn't actually open. Llama 3? That’s "Open Weights." You can use the model, but you don't know exactly what data trained it.

Olmo 3 introduces the concept of Model Flow. This is the full lifecycle of the model, including every stage, checkpoint, datapoint, and dependency.

Why does this matter?
If you want to customize an AI, usually you just tweak the final version. But with Olmo 3, you can intervene at any stage. Want to change how the model learns math? Go back to the "Midtraining" stage. Want to change how it filters safety data? Go to the "Post-training" stage.

The Family Portrait:

Olmo 3 Base: The raw foundation.
Olmo 3 Think: The flagship "reasoning" model (competitor to DeepSeek R1 and OpenAI o1).
Olmo 3 Instruct: The chatty, tool-using assistant.
Olmo 3 RL-Zero: A pure reinforcement learning experiment.

Let’s break down how they built this beast, step by step.

2. Olmo 3 Base: Building the Foundation

The Base model is the bedrock. It comes in 7B and 32B sizes. The goal? High performance across the board, but specifically designed to be "post-trainable"—meaning it’s primed to learn reasoning later on.

The Data: Dolma 3 Mix

Training a model starts with data. Ai2 curated Dolma 3 Mix, a massive dataset of 5.9 Trillion tokens.

Here is the secret sauce of Dolma 3:

A. The "Duplodocus" Deduplication

The internet is full of garbage and repeated text. To fix this, the team built a custom tool called Duplodocus (written in Rust, because obviously). It performs deduplication in three stages:

Exact Dedupe: Removes identical copies. (Reduced 38.7B documents to 12.8B).
Fuzzy Dedupe (MinHash): Finds documents that are mostly the same (like the same article on two different news sites). It uses Jaccard Similarity checks.
- Tech Term Explainer: Jaccard Similarity 🤓 -> A statistic used for gauging the similarity and diversity of sample sets. If two docs share 80% of the same unique words, they are treated as duplicates.
Substring Dedupe: This is the cool part. They use Suffix Arrays to find repeated paragraphs (like "Copyright 2024" footers) and nuke them, even if the rest of the document is unique.

Result: They shrank the web corpus by 75%, leaving only the high-quality stuff.

B. Quality-Aware Upsampling

Most models just filter out "bad" data. Olmo 3 does something smarter. They identified the highest quality data (using a classifier trained on OpenHermes and UltraChat) and upsampled it.

Top 5% quality data? Repeated ~7 times.
Bottom 40%? Discarded entirely.

C. olmOCR: Unlocking the PDF Matrix

PDFs are usually a nightmare for AI. They are visual soup. Ai2 used a tool called olmOCR. Instead of just copying the text, it renders the PDF as an image and uses a vision model to extract the text, preserving the structure, math formulas, and layout. This created a massive dataset of scientific papers that other models simply can't read.

The Architecture & Training

The model architecture is standard (Decoder-only Transformer), but with a twist to handle long conversations:

Sliding Window Attention (SWA): Instead of looking at every previous word forever, the model looks at a window of 4096 tokens for 3 out of every 4 layers.
Full Attention: Every 4th layer looks at the entire context.
Tech Term Explainer: Sliding Window Attention 🤓 -> Imagine reading a book. Usually, you remember every word you've read (Full Attention). SWA is like only keeping the last 10 pages in your active memory, which saves massive amounts of computing power, while the occasional "Full Attention" layer lets you peek back at the start of the chapter.

The Result:
Olmo 3 Base 32B outperforms Llama 3.1 8B and Qwen 2.5 7B on math and code, and rivals the 32B parameter class leaders.

You might notice Olmo 3 compares itself to Qwen 3 VL in the benchmarks. Why compare a text model to a Vision-Language model? In this Reddit thread, the Olmo authors revealed a pro-tip: Qwen 3 VL is "secretly an amazing text-only model," especially at the 32B size. By beating it, Olmo 3 isn't just winning against other text models; it's beating the best multimodal heavyweights, too.

3. The Mid-Training Arc

After pretraining on the generic web, the model is smart, but unfocused. Enter Stage 2: Midtraining.

They trained for another 100 Billion tokens on a specific mix called Dolma 3 Dolmino Mix. This stage is all about "Capability Boosts."

The "Microanneal" Method

How do you know if a new dataset is good without spending $100k training a model? You use Microannealing.

Take a dataset.
Train a small proxy model for just 5B to 10B tokens.
See if it gets smarter.
If yes, add it to the main pile.

The Data Enhancements (Math & Code)

The team found that existing open datasets had restrictive licenses (often because they were generated by Llama, which has a specific license). So, they recreated them:

CraneMath: A fully open recreation of "SwallowMath." They took raw math web data and used Qwen 3 (which has a permissive license) to rewrite the math problems to be clearer.
TinyMATH: They took the famous MATH benchmark and synthetically generated 100 new variations of every single problem.
Stack-Edu (FIM): They took coding data and applied Fill-In-The-Middle (FIM) transformations.
- Tech Term Explainer: FIM 🤓 -> Instead of just predicting the next line of code, the model is given the top and bottom of a function and asked to write the middle. This is crucial for coding assistants like Copilot.

The Decontamination Protocol

This is vital. A lot of AI models "cheat" by accidentally training on the test questions.
Ai2 ran a massive N-gram Decontamination sweep. They scanned their training data for any 8-word sequences that matched the benchmark questions (like GSM8K or MMLU) and deleted them.

Surprising Finding: Decontaminating GSM8K (math) actually improved performance. Why? Because the "test" versions of questions often have weird formatting that confuses the model during training.

4. The Long Context

Most models get confused if you paste a 50-page document. Olmo 3 extends its context window from 8,192 tokens to 65,536 tokens (64k).

The Recipe for Long Context:

YaRN (Yet another RoPE extensioN): A mathematical trick to stretch the model's "positional embeddings" (how it knows word #1 is different from word #1000) without breaking the model.
Data Mix: They didn't just use long books. They used a 34% Long / 66% Short mix. If you only train on long stuff, the model forgets how to be concise.
Synthetic Augmentation (CWE & REX):
- CWE (Common Word Extraction): They verify the model is paying attention by asking it to count specific words in a massive document.
- REX (Rewriting EXpressions): They ask the model to summarize a long document in a specific style (e.g., "Explain this 50-page paper like a 5-year-old").

The Result: On the RULER benchmark (the gold standard for long context), Olmo 3 32B scores a 96.1 at 4k length and holds strong up to 64k, beating Llama 3.1 8B and Apertus.

5. Olmo 3 Think

This is the crown jewel. Olmo 3 Think is designed to compete with "reasoning" models that "think" before they speak (generating hidden chains of thought).

This uses a three-stage post-training recipe: SFT

→→ DPO →→ RLVR.

Stage 1: Supervised Fine-Tuning (SFT) with Dolci Think

They curated a dataset called Dolci Think.

Source: They took prompts from OpenThoughts, WildChat, and Tulu 3.
The Trick: They generated reasoning traces (the "inner monologue") using strong models like QwQ-32B and DeepSeek R1.
Filtering: They removed any reasoning chain that contained Chinese political values (a byproduct of using DeepSeek) or excessive repetition.

Stage 2: Preference Tuning with Delta Learning

This is where it gets technical and fascinating. They used DPO (Direct Preference Optimization).

Tech Term Explainer: DPO 🤓 -> Instead of training a separate "Judge" model to score answers (like in RLHF), DPO uses the math of the model itself to optimize for "Answer A is better than Answer B."

The Innovation: Delta Learning 📉
Usually, you want to show the model a Great Answer and a Good Answer. Ai2 found that didn't work well for reasoning.
Instead, they used Delta Learning: They pair a Great Answer (from a smart model like Qwen 32B) with a Terrible Answer (from a tiny model like Qwen 0.6B).

Why? The massive gap (the "Delta") creates a much stronger learning signal. It screams at the model: "BE LIKE THIS, NOT LIKE THAT."

Stage 3: RL with Verifiable Rewards (The Cherry on Top)

Reinforcement Learning (RL) is usually hard because it's subjective. "Write a funny poem" is hard to grade. "Solve this math problem" is easy to grade.

Olmo 3 Think uses RLVR (Reinforcement Learning with Verifiable Rewards).

Domains: Math, Code, and Instruction Following (IF).
The Verifier: A piece of code that automatically checks if the answer is right.
- Math: Did you get the number right?
- Code: Did the code pass the unit tests?
- IF: Did you follow the constraint (e.g., "Use exactly 2 paragraphs")?

The Infrastructure: OlmoRL
To train this, they built OlmoRL, enabling Active Sampling.

Tech Term Explainer: Active Sampling 🤓 -> In RL, many generated answers yield zero reward (they are just wrong). These are useless for training. Active Sampling throws these out immediately and keeps requesting new answers until it fills a "batch" with useful, high-signal data. This sped up training by 4x.

Key Finding: RL works best when applied after DPO. If you skip DPO, the model isn't "primed" enough to learn from the RL signal.

6. Olmo 3 Instruct

While "Think" models are cool, sometimes you just want a quick answer. Olmo 3 Instruct is built for speed and utility.

Function Calling (Tool Use)

A modern assistant needs to use tools (calculators, web search). Ai2 trained Olmo 3 Instruct on two types of data:

Real Interactions: Trajectories from agents using MCP (Model Context Protocol) servers to search the web or read scientific papers.
SimFC (Simulated Function Calling): They synthesized 200k conversations where the user asks for something, and the AI "pretends" to call a tool.

Controlling Length Bias

Here is a weird quirk of AI: Models think "Longer = Better." DPO usually makes models yammers on forever.
For the Instruct model, Ai2 applied Length Control. During the DPO phase, they penalized the model if the "Chosen" answer was significantly longer than the "Rejected" answer.

Result: A model that is concise, punchy, and doesn't write a novel when you ask for a sentence.

7. Olmo 3 RL-Zero: The Science Experiment

Here is the most scientifically interesting part of the release.
Olmo 3 RL-Zero is a model trained via Reinforcement Learning directly from the Base model, skipping SFT entirely.

Why do this?
DeepSeek proved you can get reasoning behaviors purely from RL (DeepSeek R1-Zero). Ai2 wanted to reproduce this in a fully open environment to prove it wasn't a fluke or result of data contamination.

The Experiment:

They took Olmo 3 Base.
They fed it math problems.
They gave it a binary reward: 1 for correct, 0 for wrong.
The Result: The model learned to reason. It started generating "thoughts" to solve problems, improving its score on the AIME math benchmark significantly.

The "Spurious Reward" Check:
To prove the model wasn't just memorizing answers, they ran a "Negative Control." They gave the model Random Rewards (rewarding it for nothing). The model did not improve. This proves the gains in RL-Zero are real, genuine learning, not just dataset leakage.

Check out this livestream video they did with Hugging Face for the launch for more juicy insights.

8. The Infrastructure & Conclusion

This release is all about the code that built them.

Olmo-Core:
They released the training code. It is fast.

7B model training speed: 7,700 tokens/second/GPU.
This utilizes PyTorch FSDP2 (Fully Sharded Data Parallel) and custom kernels to squeeze every ounce of juice out of the H100 GPUs.

The Economics of Open Source

Usually, the cost to train these models is a closely guarded corporate secret. Because Ai2 is fully open, the authors spilled the beans on Reddit.

The Cost: Training the 7B model cost roughly $500,000 (about 220k H100-hours).
The 32B: Estimated around $2.25 million.
Considering this buys you a state-of-the-art model and the recipe to recreate it, that is remarkably efficient compared to the hundreds of millions burnt by closed labs.

What’s Missing? (The MoE Roadmap)

If you're wondering where the Mixture of Experts, or MoE version is (that is an efficient architecture that routes data to specific "expert" sub-models rather than activating the entire neural network for every task), the Ai2 team confirmed on Reddit that it is actively on the roadmap. One researcher called it "one of my regrets" that it didn't land this year, so expect an efficient, sparse Olmo-MoE in 2026.

The Verdict...

‍Olmo 3 Think-32B is currently the strongest fully open thinking model on the planet. It beats Qwen 2.5-32B-Instruct. It beats Google' Gemma 2 27B. It narrows the gap to the closed-source giants.

But more importantly, Ai2 just handed the keys to the kingdom to every researcher, developer, and student. You don't just get the brain; you get the memories, the textbooks, and the teachers.

Open Source just got a whole lot more open. And for that, the entire world should be thankful.