cdn.sanity.io

Claude Opus 4.7 — Back on Top of SWE-bench, But It'll Drain Your Token Budget

Claude Opus 4.7, SWE-bench, token cost, agentic AI, AnthropicDev

Introducing Claude Opus 4.7 — Anthropic 공식 발표

Anthropic releases Claude Opus 4.7, narrowly retaking lead for most powerful GA LLM

Claude Opus 4.7 leads on SWE-bench and agentic reasoning

Claude Opus 4.7 has reclaimed the top spot on SWE-bench. It beat GPT-5.4 and Gemini 3.1 Pro across coding benchmarks to get back to number one — but once you actually use it, your token budget is going to feel the pain.

At a Glance

What: Anthropic's latest flagship model, Claude Opus 4.7, released April 16, 2026

The gist: SWE-bench Pro score of 64.3% reclaims the coding top spot, vision resolution up 3x, agentic workflow performance improved by 14%

The catch: A new tokenizer converts the same input into up to 1.35x more tokens, and output tokens spike significantly during high-effort reasoning

What Is It?

Opus 4.7, released by Anthropic on April 16, is a direct upgrade to its predecessor, Opus 4.6. Anthropic's core pitch: you can hand it the hardest coding tasks without anyone watching over it.

In practice, the model's self-verification abilities are impressive. In one test, it built a text-to-speech engine from scratch in Rust, then ran the audio it generated through a separate speech recognizer to verify it matched a Python reference implementation — autonomously. That's senior-engineer-level work covering months of effort, done without human supervision.

Key shift: Opus 4.7 follows instructions literally. Where previous models would loosely interpret your prompts, this one executes them precisely — which means prompts that worked fine before may produce unexpected results. Anthropic officially recommends re-tuning your prompts.

Pricing matches Opus 4.6: $5 per million input tokens, $25 per million output tokens. It's available right now on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

What Changes?

Let's start with the numbers. Opus 4.7 isn't first everywhere, but it holds a clear lead in the areas developers actually care about.

Benchmark	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	80.8%	87.6%	-	80.6%
SWE-bench Pro	53.4%	64.3%	57.7%	54.2%
MCP-Atlas (Tool Use)	75.8%	77.3%	68.1%	73.9%
OSWorld (Computer Use)	72.7%	78.0%	75.0%	-
GPQA Diamond (Reasoning)	91.3%	94.2%	94.4%	94.3%
BrowseComp (Web Search)	83.7%	79.3%	89.3%	85.9%
GDPVal-AA (Knowledge Work Elo)	-	1,753	1,674	1,314

Clear number one in coding and tool use, effectively tied in pure reasoning, and actually down 4.4 points in web search (BrowseComp). This isn't a do-everything model — it's purpose-built for coding and agentic work.

Heads Up: Opus 4.7 (79.3%) actually scores lower than 4.6 (83.7%) on BrowseComp. If you're running an agent where web research is central, GPT-5.4 Pro (89.3%) or Gemini 3.1 Pro (85.9%) may be a better fit.

Vision That's 3x Sharper

Image processing resolution is now up to 2,576px on the long side (roughly 3.75 megapixels) — more than triple the previous model's capability. Autonomous security testing firm XBOW confirmed visual accuracy jumped from 54.5% to 98.5%. Screenshot-reading computer use agents, complex technical diagram interpretation, dense UI navigation — things that were previously too blurry to work with are now fair game.

Real Gains in Agentic Workflows

There are changes here that don't reduce to a single number.

+14%

Notion: Multi-step workflow success rate improved, tool errors down to one-third

Rakuten: 3x more production tasks solved on SWE-Bench vs. 4.6

70%

Cursor: CursorBench score (4.6 was 58%), major gains in autonomous coding

The CEO of Cognition (Devin) said: "4.7 works consistently for hours and doesn't give up on hard problems." Factory Droids noted that "a model that used to stop halfway through now sees things to completion," and Replit's CEO described it as "like a colleague who pushes back in technical discussions."

The Token Cost Shadow

Here's the thing — Opus 4.7 genuinely thinks more, and it definitely spends more.

Two reasons token usage is up:
1. New tokenizer — the same input now converts to 1.0–1.35x more tokens.
2. Deeper reasoning — output tokens spike significantly, especially in later turns of agentic sessions.

In Decrypt's real-world testing, a single session burned through the entire token quota. The model would finish all the code, then — labeled as "bug fixes and improvements" — rewrite the whole thing from scratch. Then do it again. This is behavior that never showed up with Opus 4.6.

Anthropic is aware of the issue and has introduced a new effort parameter and task budget to address it.

Effort Level	Characteristics	Best For
low/medium	Fast responses, minimal reasoning	Simple queries, data transformation
high	Balanced reasoning	General coding, analysis
xhigh (new)	Deep reasoning, between high and max	Complex agentic coding (Claude Code default)
max	Maximum reasoning, maximum tokens	Only for the hardest problems

Task budget is in public beta and lets you cap an agent's token usage to prevent surprise bills.

Getting Started

Here's what you need to know when migrating from Opus 4.6 to 4.7.

Start with prompt re-tuning
4.7 follows instructions literally. Loosely worded "just figure it out" prompts can produce unexpected results. Test with representative traffic before switching over.
Set your effort level
For coding and agentic tasks, start with high or xhigh. Reserve max for your hardest problems. Claude Code defaults to xhigh.
Measure token costs
The new tokenizer means the same input can consume up to 35% more tokens. Measure your cost delta on real traffic before committing to a full rollout.
Use Task Budget
For long-running agents, use the API's task budget (beta) to set a token ceiling and avoid unexpected charges.
Be careful with web search agents
BrowseComp scores dropped, so for research-heavy workflows, it's worth evaluating GPT-5.4 Pro alongside Opus 4.7.

Other Features Shipping Alongside It

There are a few more updates that shipped with Opus 4.7.

/ultrareview — A dedicated review session in Claude Code that checks your changes at a senior reviewer level. Pro and Max users get three free sessions.

Auto Mode expanded — Max users now have access to Auto Mode, where Claude makes decisions autonomously. Long tasks run without interruption.

Cyber Verification Program — A credentialing program that gives security professionals (penetration testers, vulnerability researchers, etc.) access to Opus 4.7's cybersecurity capabilities.

Deep Dive Resources

Anthropic Official Announcement The full release document, including benchmarks, safety profile, and migration guide for Opus 4.7. anthropic.com

Vellum Benchmark Analysis Detailed comparisons across key benchmarks — SWE-bench, MCP-Atlas, GPQA Diamond — and recommendations by migration scenario. vellum.ai

Decrypt Hands-On Review Real test results using a game-building prompt — highest quality output yet, but burned through the entire token quota in a single session. decrypt.co

VentureBeat Deep Dive Migration strategy from an enterprise perspective, plus analysis of Anthropic's market positioning. venturebeat.com

TNW Tech Summary A concise tech media rundown of pricing, availability, and key benchmarks. thenextweb.com

Claude Opus 4.7 Migration Guide What to watch for when switching from 4.6 to 4.7, and how to tune effort levels. platform.claude.com

FAQ

Is Opus 4.7 strictly better than GPT-5.4?

Not universally. Opus 4.7 leads in coding (SWE-bench Pro: 64.3% vs. 57.7%) and tool use (MCP-Atlas: 77.3% vs. 68.1%), but GPT-5.4 has the edge in web search (BrowseComp: 89.3% vs. 79.3%) and reasoning (HLE: 58.7% vs. 54.7%). It depends on what you're using it for.

How much more will I pay in token costs?

The per-token rate is unchanged ($5/$25 per million tokens), but the new tokenizer converts the same text into 1.0–1.35x more tokens. Add in the reasoning token increases at higher effort levels, and your real-world cost can go up substantially.

Can I just swap 4.6 for 4.7 directly?

A direct swap can cause your existing prompts to behave differently than expected. Since Opus 4.7 follows instructions literally, any loosely worded prompts need re-tuning before you switch. Anthropic recommends a phased migration.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI