GPT-5 performed worse than GPT-4 at coding? Not a joke. IEEE Spectrum verified it.
What Is This?
An IEEE Spectrum analysis from January 2026, combined with a 700+ comment Hacker News discussion. Veteran developers report that newer AI models produce lower quality code than their predecessors.
IEEE Spectrum's key finding is "Silent Failures." Older models would crash obviously when wrong. Newer models generate code that runs without errors but produces incorrect results. Harder-to-find bugs are on the rise.
Tests showed GPT-5 underperforming GPT-4 in certain coding scenarios. A CMU team analyzing 800+ popular GitHub projects also confirmed code quality degradation after AI tool adoption.
Anthropic's own research is telling too — AI-assisted coding made experienced developers 19% slower. One study, specific conditions, but it challenges the "AI always speeds things up" assumption.
What Changed?
| Earlier Models (2024-early 2025) | Latest Models (late 2025-2026) | |
|---|---|---|
| Failure type | Crashes/errors (visible) | Silent failures (runs fine) |
| Debug difficulty | Traceable via error messages | Logic errors, hard to trace |
| Acceptance rate | Lower but accurate code | Higher but subtly wrong code |
| Developer experience | "If it breaks, I know immediately" | "Thought it worked, results are off" |
Why is this happening? Goodhart's Law is at work. Models optimize for "code the user accepts." Since users accept code that runs, models are optimized for "code that runs" — not "code that's correct." A vicious cycle.
DORA Research (Google DevOps) raised similar concerns — over-reliance on AI tools may degrade developers' deep learning ability (human learning, not machine learning!).
Anthropic's Research Finding
Experienced developers using AI coding assistants took 19% longer to complete tasks than those without. The "AI always equals faster" assumption needs revisiting.
How to Deal With This Realistically
- Don't trust AI code blindly
"It runs" and "it's correct" are different things. Always review logic, especially edge cases and boundary conditions. - Increase testing
Test coverage is key to catching silent failures. Have AI write tests too, then review test quality as well. - Pin model versions
Newest isn't always best. If a model version works well for your project, pin the API version. - Be specific in prompts
Instead of "write this function," try "function that takes X, returns Y, handles Z exceptions. TypeScript, with error handling." Specificity improves quality. - Strengthen code review
Code review is the last line of defense, AI or human code alike. Auto-merging AI-generated PRs is still risky.




