Three months ago, Google released Gemini 3 Pro and it was... good. Competitive. The kind of release where you nod and say "nice" but don't switch your workflow.
Today's Gemini 3.1 Pro is a different story. The ".1" makes it sound like a minor tune-up, like going from iOS 17 to 17.1. But this is closer to trading in your Honda Civic for a Tesla. Same parking spot, wildly different engine.
The headline number: on ARC-AGI-2, which tests whether an AI can solve logic puzzles it's never seen before (think IQ test for machines), Gemini 3.1 Pro scored 77.1%. Gemini 3 Pro? That scored around 31-37%. That's more than double the reasoning ability in three months.
And here's the kicker: it costs exactly the same. $2 per million input tokens, $12 per million output tokens. Cheaper than Claude's Sonnet 4.6. Same price as the model it replaces.
So what changed? And more importantly, should you care?
The Benchmarks Tell a Story (For Once)
Normally when AI companies drop benchmark scores, it's the equivalent of car companies bragging about horsepower; cool on paper, doesn't tell you how it handles in traffic. But the specific benchmarks where 3.1 Pro improved actually reveal what Google was working on.
Reasoning got a massive upgrade. On Humanity's Last Exam, a collection of the hardest questions experts could come up with, 3.1 Pro now leads without tools. On ARC-AGI-2, that 77.1% score puts it in striking distance of dedicated reasoning systems (Google's own Deep Think mode hit 84.6% just last week).
To put this in perspective: when Gemini 3 Pro launched in November, its ARC-AGI-2 score was around 37.5%, which was already considered a big jump over GPT 5.1 at the time. Now 3.1 Pro has more than doubled that. As one Reddit commenter put it: "77% ARC-AGI 2 is actually crazy. Only a few months ago we were talking about how good 31% is."
Hallucinations dropped significantly. This one matters for anyone who's ever had ChatGPT or Gemini confidently make something up. Google's model card shows meaningful improvements in factual grounding. One Redditor called it "probably the best improvement so far." For businesses thinking about deploying AI in customer-facing roles, less hallucination is often more important than higher IQ scores.
Coding got better too, but it's complicated. On standard benchmarks, 3.1 Pro now matches or beats Claude's Opus and Sonnet 4.6 for coding. But there's a gap between benchmark performance and real-world coding. Several developers in the Reddit thread noted that Gemini is "very good at one-shotting things" (solving a problem in a single attempt) but can struggle in longer, iterative coding sessions compared to Claude Code or OpenAI's Codex. One developer flagged a persistent issue with Gemini CLI accidentally deleting chunks of code during file edits, though that may be a tooling problem rather than a model one.
Token efficiency improved. According to Artificial Analysis benchmarks, 3.1 Pro achieved its leading scores with only about 2 million extra tokens compared to the previous version, costing roughly $25-27 more. Compare that to Anthropic's Sonnet 4.6, which is notably more "token hungry." You're getting better results without a bigger bill.
What's Actually New Under the Hood
The architecture hasn't changed; Gemini 3.1 Pro is still a sparse mixture-of-experts (MoE) transformer model. Same 1 million token context window in, 64,000 tokens out. Same January 2025 knowledge cutoff.
What did change is how the model thinks.
Three thinking levels instead of two. Gemini 3 Pro only had "low" and "high" thinking modes. 3.1 Pro adds "medium", giving developers more control over the speed-vs-intelligence tradeoff. Think of it like gears in a car:
- Low: Quick, efficient responses. Great for simple questions, chatbots, high-throughput applications.
- Medium (new): Balanced. Good enough reasoning for most tasks without the latency hit.
- High: Full reasoning power. The model will think for minutes on hard problems. Sam Witteveen tested it on an International Math Olympiad problem and it took over 8 minutes to get the right answer (versus 17+ minutes for the full Deep Think model). On "low," it was much faster but got the answer wrong.
This matters because thinking level = cost. More thinking = more tokens = higher bills. The "medium" option gives developers a sweet spot that didn't exist before.
Lessons from Deep Think baked in. Google's Deep Think mode, which uses advanced "parallel thinking techniques" to explore multiple solution paths simultaneously, was updated just last week. That update hit 84.6% on ARC-AGI-2 and 48.4% on Humanity's Last Exam. Now those same reinforcement learning techniques are trickling down into the base 3.1 Pro model. As Sam Witteveen noted in his breakdown: "If you have thinking set to high, this acts almost like a mini version of Gemini Deep Think."
A Gemini team member (Kish Anand) previously explained this pipeline: "We have had a lot of exciting research progress on agentic RL which made its way into Flash but was too late for Pro." With 3.1 Pro, those techniques have finally arrived.
Thought signatures for developers. This is a new, more technical feature. When Gemini 3.1 Pro reasons through a problem via the API, it generates encrypted "thought signatures" that represent its internal reasoning state. If you're building a multi-step application (like an AI agent that calls tools), you pass these signatures back to maintain context between steps. The developer guide goes deep on this.
A separate endpoint for custom tools. If you're building AI agents that use your own custom tools alongside standard bash commands, there's now a dedicated model variant called gemini-3.1-pro-preview-customtools that's optimized to prioritize your tools over default behaviors. Small detail, big deal for developers building production agents.
Where You Can Use It (It's Everywhere)
Google isn't being shy about distribution. Gemini 3.1 Pro is rolling out to:
- Gemini app (Google AI Pro and Ultra subscribers, with higher usage limits)
- Google AI Studio (free to try: try it here)
- Vertex AI and Gemini Enterprise (for business customers)
- Gemini CLI (command-line coding tool)
- Google Antigravity (Google's agentic development platform)
- Android Studio (for mobile developers)
- NotebookLM (exclusively for Pro and Ultra users)
- GitHub Copilot (rolling out to Pro, Pro+, Business, and Enterprise users)
That last one is notable. GitHub announced that Gemini 3.1 Pro is now available in Copilot across VS Code, Visual Studio, github.com, and GitHub Mobile. In early testing, the model "excels on effective and efficient edit-then-test loops with high tool precision, achieving strong resolution success with fewer tool calls per benchmark."
Legal AI company Harvey has also been testing it, with their AI and Applied Legal Research teams evaluating long-context handling, multimodal reasoning, and performance in ideation and writing tasks.
AI Studio Just Became a Full-Stack Coding Tool
Alongside 3.1 Pro, Google updated AI Studio to support full-stack development; servers, databases, and even multiplayer functionality. Plus, Google's Antigravity agent (their agentic coding assistant) is now built directly into AI Studio's "Build" tab.
Peter Yang, a product manager who's been using AI Studio for six months, walked through a prototyping workflow showing how this changes product development. His key insight: instead of writing specs → designing → building → testing with users (the traditional "waterfall" approach), you can now prototype → test with users → then write specs. Code is now cheaper to produce than a Google Doc.
His five-step process:
- Build a base template by screenshotting your current product and having AI recreate it.
- Prototype a new feature by describing what you want changed.
- Iterate with AI by giving feedback ("move the model picker into the chat box").
- Collaborate by sharing the prototype link with stakeholders and real users (it works like sharing a Google Doc).
- Go to production only after you've validated the idea.
You can build apps with React, Next.js, or Angular. Connect Google APIs like Nano Banana (image generation) natively. Set secrets for full-stack deployments. And push to GitHub when you're ready.
The catch: multiple developers report the build agent is still "relatively slow" and the sandbox environment has growing pains. But as a free prototyping tool, it's hard to beat.
The Competitive Landscape: Musical Chairs for the Top Spot
The AI model leaderboard is starting to look like a game of musical chairs. Claude takes the crown, then Gemini, then GPT, then Claude again; each holding the top spot for a few weeks before the next release.
Artificial Analysis, an independent benchmarking firm that runs its own evaluations across 10 different tests, just updated their rankings. The results paint a clear picture of where things stand:
Overall Intelligence Index (composite of reasoning, knowledge, coding, and agentic tasks):
- Gemini 3.1 Pro: 57 (🥇)
- Claude Opus 4.6 (max): 53 (🥈)
- Claude Sonnet 4.6 (max): 51 (🥉)
- GPT-5.2 (xhigh): 51
Coding (Terminal-Bench Hard, SciCode):
- Gemini 3.1 Pro: 56 (🥇)
- Claude Sonnet 4.6 (max): 51
- GPT-5.2 (xhigh): 49
- Claude Opus 4.6 (max): 48
Agentic tasks (GDPval-AA, τ²-Bench Telecom):
- Claude Opus 4.6 (max): 68 (🥇)
- Claude Opus 4.6: 64
- GPT-5.2 (xhigh): 60
- Gemini 3.1 Pro: 59
Hallucination resistance (AA-Omniscience Index):
- Gemini 3.1 Pro: 30 (🥇, by a mile)
- Gemini 3 Pro: 13
- Claude Opus 4.6 (max): 11
The pattern is clear: Gemini 3.1 Pro leads on raw intelligence, coding, and factual accuracy. Claude dominates agentic workflows where the model needs to plan and execute multi-step tasks autonomously. GPT-5.2 sits solidly in the middle of both.
The pricing makes the picture even more interesting. Gemini 3.1 Pro runs $4.50 per million tokens (blended), GPT-5.2 is $4.80, Claude Sonnet 4.6 is $6, and Claude Opus 4.6 is $10. Google is offering the top-ranked model at the lowest price among the Big Three.
But benchmark scores don't tell the whole story. LMArena scores (which measure real user preferences in blind comparisons) show 3.1 Pro as only marginally better than Gemini 3 Pro in practical use, and still trailing Claude in extended text and code tasks. Benchmarks and vibes don't always agree.
As one commenter summarized: "I think it's going to get to a point where it's just about what you prefer and they're all amazing." We're getting close to that point.
So Should You Switch?
Here's the practical guidance:
- If you're already using Gemini 3 Pro, this is a no-brainer upgrade. Same price, better everything. Google is recommending it as a direct drop-in replacement.
- If you're building AI agents, the new thinking levels, thought signatures, and custom tools endpoint make 3.1 Pro worth serious testing. The combination of strong reasoning, lower cost than Claude, and native Google ecosystem integration is compelling.
- If you're a developer using GitHub Copilot, turn on Gemini 3.1 Pro in your model picker and test it against Claude and GPT for your specific codebase. Early results suggest it's efficient with fewer tool calls, but your mileage may vary for complex, multi-file edits.
- If you're using AI for research or analysis, the improved hallucination rates and 1M token context window make this a strong option for document-heavy workflows. Harvey's legal team is testing it for exactly this purpose.
- If you want to try it right now, the fastest path is Google AI Studio (free, no subscription needed). Play with the thinking levels. Try asking it something hard on "high" and watch it think for a few minutes; it's genuinely impressive.
- The one caveat: this is still a preview release. Google hasn't made it generally available yet, which means performance could shift before the final version. As one skeptic noted: "Impressive, but still just in preview, meaning no performance guarantees." That's fair. Test it, don't bet your production pipeline on it yet.
The Bigger Picture
What's most interesting about 3.1 Pro isn't any single benchmark score. It's that Google shipped a model that's both smarter and the same price. That used to be a tradeoff: you paid more for more intelligence. Now the cost frontier is collapsing.
Google's Q4 earnings showed the Gemini app at 750 million monthly active users, close to ChatGPT's 800 million. Even if Gemini isn't the best model on every benchmark, it powers AI Mode in Google Search, which is a real revenue engine. The company's strategy is clear: make the best generalist model (while competitors focus heavily on coding), distribute it everywhere, and let the Google ecosystem do the rest.
Three months from now, Anthropic or OpenAI will probably reclaim the top spot on some benchmark. That's the cycle. But the floor keeps rising. What was mind-blowing in November is now the baseline. And the gap between the top models keeps getting smaller, while the gap between today's AI and last year's AI keeps getting bigger.
For users, that's the real story. The best AI tool is whatever works best for your workflow. And right now, you have more genuinely excellent options than ever.
Try Gemini 3.1 Pro for free in AI Studio →
Additional Resources: