Claude Opus 4.7 has reclaimed the top spot on SWE-bench. It beat GPT-5.4 and Gemini 3.1 Pro across coding benchmarks to get back to number one — but once you actually use it, your token budget is going to feel the pain.

At a Glance

What: Anthropic's latest flagship model, Claude Opus 4.7, released April 16, 2026

The gist: SWE-bench Pro score of 64.3% reclaims the coding top spot, vision resolution up 3x, agentic workflow performance improved by 14%

The catch: A new tokenizer converts the same input into up to 1.35x more tokens, and output tokens spike significantly during high-effort reasoning

What Is It?

Opus 4.7, released by Anthropic on April 16, is a direct upgrade to its predecessor, Opus 4.6. Anthropic's core pitch: you can hand it the hardest coding tasks without anyone watching over it.

In practice, the model's self-verification abilities are impressive. In one test, it built a text-to-speech engine from scratch in Rust, then ran the audio it generated through a separate speech recognizer to verify it matched a Python reference implementation — autonomously. That's senior-engineer-level work covering months of effort, done without human supervision.

Key shift: Opus 4.7 follows instructions literally. Where previous models would loosely interpret your prompts, this one executes them precisely — which means prompts that worked fine before may produce unexpected results. Anthropic officially recommends re-tuning your prompts.

Pricing matches Opus 4.6: $5 per million input tokens, $25 per million output tokens. It's available right now on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

What Changes?

Let's start with the numbers. Opus 4.7 isn't first everywhere, but it holds a clear lead in the areas developers actually care about.

BenchmarkOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
SWE-bench Verified80.8%87.6%-80.6%
SWE-bench Pro53.4%64.3%57.7%54.2%
MCP-Atlas (Tool Use)75.8%77.3%68.1%73.9%
OSWorld (Computer Use)72.7%78.0%75.0%-
GPQA Diamond (Reasoning)91.3%94.2%94.4%94.3%
BrowseComp (Web Search)83.7%79.3%89.3%85.9%
GDPVal-AA (Knowledge Work Elo)-1,7531,6741,314

Clear number one in coding and tool use, effectively tied in pure reasoning, and actually down 4.4 points in web search (BrowseComp). This isn't a do-everything model — it's purpose-built for coding and agentic work.

Heads Up: Opus 4.7 (79.3%) actually scores lower than 4.6 (83.7%) on BrowseComp. If you're running an agent where web research is central, GPT-5.4 Pro (89.3%) or Gemini 3.1 Pro (85.9%) may be a better fit.

Vision That's 3x Sharper

Image processing resolution is now up to 2,576px on the long side (roughly 3.75 megapixels) — more than triple the previous model's capability. Autonomous security testing firm XBOW confirmed visual accuracy jumped from 54.5% to 98.5%. Screenshot-reading computer use agents, complex technical diagram interpretation, dense UI navigation — things that were previously too blurry to work with are now fair game.

Real Gains in Agentic Workflows

There are changes here that don't reduce to a single number.

+14%
Notion: Multi-step workflow success rate improved, tool errors down to one-third
3x
Rakuten: 3x more production tasks solved on SWE-Bench vs. 4.6
70%
Cursor: CursorBench score (4.6 was 58%), major gains in autonomous coding

The CEO of Cognition (Devin) said: "4.7 works consistently for hours and doesn't give up on hard problems." Factory Droids noted that "a model that used to stop halfway through now sees things to completion," and Replit's CEO described it as "like a colleague who pushes back in technical discussions."

The Token Cost Shadow

Here's the thing — Opus 4.7 genuinely thinks more, and it definitely spends more.

Two reasons token usage is up:
1. New tokenizer — the same input now converts to 1.0–1.35x more tokens.
2. Deeper reasoning — output tokens spike significantly, especially in later turns of agentic sessions.

In Decrypt's real-world testing, a single session burned through the entire token quota. The model would finish all the code, then — labeled as "bug fixes and improvements" — rewrite the whole thing from scratch. Then do it again. This is behavior that never showed up with Opus 4.6.

Anthropic is aware of the issue and has introduced a new effort parameter and task budget to address it.

Effort LevelCharacteristicsBest For
low/mediumFast responses, minimal reasoningSimple queries, data transformation
highBalanced reasoningGeneral coding, analysis
xhigh (new)Deep reasoning, between high and maxComplex agentic coding (Claude Code default)
maxMaximum reasoning, maximum tokensOnly for the hardest problems

Task budget is in public beta and lets you cap an agent's token usage to prevent surprise bills.

Getting Started

Here's what you need to know when migrating from Opus 4.6 to 4.7.

  1. Start with prompt re-tuning
    4.7 follows instructions literally. Loosely worded "just figure it out" prompts can produce unexpected results. Test with representative traffic before switching over.
  2. Set your effort level
    For coding and agentic tasks, start with high or xhigh. Reserve max for your hardest problems. Claude Code defaults to xhigh.
  3. Measure token costs
    The new tokenizer means the same input can consume up to 35% more tokens. Measure your cost delta on real traffic before committing to a full rollout.
  4. Use Task Budget
    For long-running agents, use the API's task budget (beta) to set a token ceiling and avoid unexpected charges.
  5. Be careful with web search agents
    BrowseComp scores dropped, so for research-heavy workflows, it's worth evaluating GPT-5.4 Pro alongside Opus 4.7.

Other Features Shipping Alongside It

There are a few more updates that shipped with Opus 4.7.

01
/ultrareview — A dedicated review session in Claude Code that checks your changes at a senior reviewer level. Pro and Max users get three free sessions.
02
Auto Mode expanded — Max users now have access to Auto Mode, where Claude makes decisions autonomously. Long tasks run without interruption.
03
Cyber Verification Program — A credentialing program that gives security professionals (penetration testers, vulnerability researchers, etc.) access to Opus 4.7's cybersecurity capabilities.

Deep Dive Resources

Anthropic Official Announcement The full release document, including benchmarks, safety profile, and migration guide for Opus 4.7. anthropic.com

Vellum Benchmark Analysis Detailed comparisons across key benchmarks — SWE-bench, MCP-Atlas, GPQA Diamond — and recommendations by migration scenario. vellum.ai

Decrypt Hands-On Review Real test results using a game-building prompt — highest quality output yet, but burned through the entire token quota in a single session. decrypt.co

VentureBeat Deep Dive Migration strategy from an enterprise perspective, plus analysis of Anthropic's market positioning. venturebeat.com

TNW Tech Summary A concise tech media rundown of pricing, availability, and key benchmarks. thenextweb.com

Claude Opus 4.7 Migration Guide What to watch for when switching from 4.6 to 4.7, and how to tune effort levels. platform.claude.com