Claude Opus 4.7 SWE-bench 벤치마크 비교 — Vellum AI

cdn.sanity.io

The AI Coding Agent Ranking Just Changed — Why Claude Opus 4.7 Beat GPT-5.4

Claude Opus 4.7, Amazon Bedrock, SWE-bench, Adaptive Thinking, coding agentDev

Introducing Anthropic's Claude Opus 4.7 model in Amazon Bedrock

Introducing Claude Opus 4.7

Claude Opus 4.7 — Amazon Bedrock Model Card

The AI coding agent ranking just changed. In April 2026, Claude Opus 4.7 hit 64.3% on SWE-bench Pro — beating both GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%). And now it's live on Amazon Bedrock. This isn't just "another access channel" — something specifically changed when it arrived on Bedrock.

Quick Summary

#1 on SWE-bench Pro → Adaptive Thinking introduced → temperature param removed → Bedrock enterprise infra → Start in 3 lines of code

What's the 64.3%?

There's a benchmark called SWE-bench. It measures how well AI can resolve bugs and feature requests pulled from real GitHub open-source repos. SWE-bench Pro is the hardest version — it uses actual production issues from major open-source projects. It's the most realistic indicator of "how useful is this coding agent in the real world."

Opus 4.7 scored 64.3%. That's up from 53.4% in Opus 4.6 — a 10.9-point improvement. It leads GPT-5.4 (57.7%) by 6.6 points and Gemini 3.1 Pro (54.2%) by over 10 points. If you're building or using a coding agent, this gap is genuinely noticeable in practice.

64.3%

SWE-bench Pro (Opus 4.7)

87.6%

SWE-bench Verified

77.3%

MCP-Atlas tool use (best-in-class)

It's not just coding. On MCP-Atlas — which measures how well an AI handles external tools — Opus 4.7 hit 77.3%, ahead of GPT-5.4 (75.3%) and Gemini (73.9%). That's the key metric for building multi-agent workflows. The one regression: BrowseComp (web research) dropped to 79.3% from 83.7% in 4.6. The team focused on coding and tool use, and made a trade-off on web search.

Vision got a major upgrade too. Max image resolution jumped to 2,576 pixels on the long edge — more than 3x previous models. That matters for UI screenshot analysis, complex diagram parsing, and dense document processing. CharXiv visual reasoning jumped 13 points to 82.1% (from 69.1%).

What actually changed?

The biggest technical change in Opus 4.7 is Adaptive Thinking. Up through Opus 4.6, you had to manually set thinking.type: "enabled" and budget_tokens — telling the model "think for up to 1,000 tokens on this task" or "use 5,000 tokens here." Developer-tuned, every call. In 4.7, that's gone.

With 4.7, it's just thinking.type: "adaptive". The model judges task complexity itself and allocates reasoning tokens automatically. Simple questions get minimal compute; complex refactoring gets deep thinking. No more budget_tokens tuning — it's fully automatic.

	Opus 4.6	Opus 4.7
Reasoning setup	thinking.type: "enabled" + manual budget_tokens	Just thinking.type: "adaptive"
temperature/top_p	Adjustable	Not supported — remove from requests
SWE-bench Pro	53.4%	64.3% (+10.9pts)
Image resolution	Previous level	Up to 2,576px on long edge (3x+)
Prompt cache TTL	5 minutes	5 min · 1 hour (your choice)
Visual reasoning (CharXiv)	69.1%	82.1% (+13pts)

Migration warning from 4.6

Plugging Opus 4.6 code directly into 4.7 will throw a 400 error. You need to change thinking.type to "adaptive" and completely remove temperature, top_p, and top_k parameters. budget_tokens is also gone — Adaptive Thinking handles this automatically.

Pricing is unchanged: $5/M input tokens, $25/M output tokens — same as Opus 4.6. One thing to note: a new tokenizer means the same content may generate 1.0–1.35x more tokens than before. Actual costs may increase slightly, heads up.

The Quick Start: Bedrock in 5 Steps

Set up AWS account + Bedrock API key
Generate a long-term API key in the Amazon Bedrock console. Set it as the AWS_BEARER_TOKEN_BEDROCK environment variable.
Install the SDK
For the Messages API: pip install -U "anthropic[bedrock]". For Converse/Invoke: pip install boto3. Pick one.
Send your first request
Model ID is anthropic.claude-opus-4-7, region defaults to us-east-1. For thinking, use only {"type": "adaptive"} — using enabled or budget_tokens throws a 400.
Optimize costs with prompt caching
Set cache checkpoints for repeated system prompts or documents (min 4,096 tokens). Choose 5-min or 1-hour TTL. Big savings on repeated calls.
Reduce latency with Geo inference
From Asia, use jp.anthropic.claude-opus-4-7 (Tokyo/Osaka routing) or global.anthropic.claude-opus-4-7 for automatic optimal region.

Bedrock's enterprise edge

Bedrock's next-generation inference engine prevents operator access to customer data. If you're already running VPC, IAM, and CloudWatch in AWS, you get enterprise-grade data isolation with no extra security setup.

If You Want to Dig Deeper

Introducing Claude Opus 4.7 — Anthropic The official launch post. Covers Adaptive Thinking design principles, safety evaluations, and cross-platform availability. anthropic.com

Claude Opus 4.7 in Amazon Bedrock — AWS Blog Official Bedrock launch post with Playground walkthrough, API code samples, and regional availability details. aws.amazon.com

Claude Opus 4.7 Benchmarks Explained — Vellum AI Deep-dive on MCP-Atlas, OSWorld, CharXiv, and side-by-side comparisons with GPT-5.4 and Gemini 3.1 Pro. vellum.ai

Amazon Bedrock Model Card — AWS Docs Adaptive Thinking migration guide, prompt caching setup, service tiers, and per-region routing specs in one place. docs.aws.amazon.com

Claude Opus 4.7 vs GPT-5.5 — DataCamp Side-by-side on coding, reasoning, and pricing. Includes areas where GPT-5.5 still leads (Terminal-Bench). datacamp.com

FAQ

Can I use Opus 4.6 code directly with 4.7?

Not directly — it'll throw a 400 error. You need to change thinking.type to 'adaptive' and completely remove the temperature, top_p, and top_k parameters. budget_tokens is also gone — Adaptive Thinking takes over automatically.

Does a higher SWE-bench score mean it'll do better on my actual project?

Generally yes — there's a correlation. SWE-bench Pro uses real production issues, making it the closest benchmark to real-world performance. That said, highly domain-specific code or internal library-heavy codebases may show less of a gap. Your own A/B test is the most accurate way to know.

If Adaptive Thinking auto-allocates tokens, is cost prediction harder?

Fair point. Reasoning tokens vary per call, so costs fluctuate. To manage this, combine prompt caching (cache repeated content over 4,096 tokens) with Bedrock's Flex service tier for non-time-sensitive work to bring average cost down.

How can I use the 2,576px image support in a coding agent?

Sending UI screenshots and asking 'find the bug in this screen' is the obvious use case. You can also pass architecture diagrams for code structure review, or submit error stack trace screenshots. The high-res support means dense documents and code screenshots can now be read accurately.

Is there latency if I use Bedrock from Asia?

Latency will be higher if you connect to us-east-1 directly. Use jp.anthropic.claude-opus-4-7 for automatic Tokyo/Osaka routing, or global.anthropic.claude-opus-4-7 for automatic best-region selection.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI