There's a specific moment when coding agents crossed from "sometimes works" to "mostly works." Simon Willison — Django co-creator — says it was November 2025. He laid it all out in a five-minute lightning talk at PyCon US 2026.
What actually happened in November?
November 2025 was a strange month for LLMs. The "best model" crown changed hands five times in five months — all clustered around November. Claude Sonnet 4.5 (September), then GPT-5.1, Gemini 3, GPT-5.1 Codex Max, Claude Opus 4.5. Anthropic, OpenAI, and Google trading the top spot in a tight loop.
But that's not the real story. The bigger shift Willison points to: coding agents crossed from "often-work" to "mostly-work." Before, asking an AI to write code meant constant babysitting. Now, you can actually hand it off and expect results.
The technical driver: RLVR (Reinforcement Learning from Verifiable Rewards). OpenAI and Anthropic spent most of 2025 here. The insight is simple — code has objective, instant feedback (compiler pass/fail). Training on that signal instead of human preferences drove a massive quality jump. Coding turned out to be the perfect RLVR problem.
So what actually changed for the people using these tools?
Willison calls November–January the "LLM psychosis period." Coding agents were suddenly working well enough that developers went on wild, ambitious project sprints. Willison himself built micro-javascript — a JavaScript implementation in Python. Utterly impractical, no real users. But it worked, and that felt incredible.
Then February hit. OpenClaw exploded. It's an open-source personal AI assistant you run on your own hardware. Mac Mini M4 sold out across the country. Drew Breunig's framing stuck: "The Mac Mini is an aquarium for your Claw." Running your own AI, not renting someone else's cloud, hit a cultural nerve.
| Before Nov 2025 | After Nov 2025 | |
|---|---|---|
| Coding agent reliability | often-work (needs babysitting) | mostly-work (actually delegatable) |
| Local model quality | Clearly inferior to cloud models | Beating top cloud models on specific tasks |
| Personal AI servers | Niche / complex setup | OpenClaw made it accessible |
| Model competition | OpenAI-dominated | Anthropic, Google, Chinese models all competing |
The most surprising development is the local model story. In April 2026, Willison ran Qwen3.6-35B-A3B on his laptop and it drew a better SVG than Claude Opus 4.7. A 20.9GB laptop model beat Anthropic's flagship cloud model. Then GLM-5.1 (754B parameters, open weights from China's Z.ai) scored 58.4% on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3%) and GPT-5.4 (57.7%).
What's the pelican-on-a-bicycle benchmark?
It's Willison's informal sanity test: "Draw me an SVG of a pelican riding a bicycle." The model almost certainly hasn't trained on this specific combination — so it's testing genuine visual reasoning. It's meant as a joke, but there's actually been solid correlation between pelican quality and general model usefulness.
Getting started: what to actually do
- Take coding agents seriously again
Claude Code, Cursor, GitHub Copilot — the reliability gap is real. If you tried these six months ago and gave up, give them another shot. Start with repetitive tasks, test generation, and refactoring. - Try running a local model
Ollama or LM Studio + Qwen3.6-35B works on a MacBook with 32GB+ RAM. The quantized build is 20.9GB. Good for private work where you don't want data leaving your machine. - Stop being loyal to one model
Five crown changes in five months. What was best in November might not be best now. Test periodically. Claude Code for coding, Gemini 3.1 Pro for image tasks, GLM-5.1 API for agentic coding. - GLM-5.1 via OpenRouter API
754B parameters means local is impractical (needs 8x H200), but it's accessible via OpenRouter. Currently the top open-source model for agentic coding tasks. - Watch out for LLM psychosis
When AI suddenly works better, you get tempted into wildly ambitious projects. Willison learned this firsthand. Ask "who actually needs this?" before building.




