Simon Willison의 PyCon US 2026 라이트닝 토크 슬라이드 - 지난 6개월 LLM 변화

static.simonwillison.net

When Coding Agents Started Actually Working — The November 2025 Inflection Point

coding agents, local LLM, RLVR, Qwen3.6, GLM-5.1, November 2025 inflection pointDev

The last six months in LLMs in five minutes

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro

There's a specific moment when coding agents crossed from "sometimes works" to "mostly works." Simon Willison — Django co-creator — says it was November 2025. He laid it all out in a five-minute lightning talk at PyCon US 2026.

30-second summary

Nov 2025 inflection → RLVR boosts coding agents → OpenClaw explodes → Local models rebel → Qwen beats Opus 4.7

What actually happened in November?

November 2025 was a strange month for LLMs. The "best model" crown changed hands five times in five months — all clustered around November. Claude Sonnet 4.5 (September), then GPT-5.1, Gemini 3, GPT-5.1 Codex Max, Claude Opus 4.5. Anthropic, OpenAI, and Google trading the top spot in a tight loop.

But that's not the real story. The bigger shift Willison points to: coding agents crossed from "often-work" to "mostly-work." Before, asking an AI to write code meant constant babysitting. Now, you can actually hand it off and expect results.

The technical driver: RLVR (Reinforcement Learning from Verifiable Rewards). OpenAI and Anthropic spent most of 2025 here. The insight is simple — code has objective, instant feedback (compiler pass/fail). Training on that signal instead of human preferences drove a massive quality jump. Coding turned out to be the perfect RLVR problem.

5×

"Best model" crown changes in Nov 2025

often→mostly

Coding agent reliability shift

20.9GB

Qwen3.6 quantized — runs on a laptop

So what actually changed for the people using these tools?

Willison calls November–January the "LLM psychosis period." Coding agents were suddenly working well enough that developers went on wild, ambitious project sprints. Willison himself built micro-javascript — a JavaScript implementation in Python. Utterly impractical, no real users. But it worked, and that felt incredible.

Then February hit. OpenClaw exploded. It's an open-source personal AI assistant you run on your own hardware. Mac Mini M4 sold out across the country. Drew Breunig's framing stuck: "The Mac Mini is an aquarium for your Claw." Running your own AI, not renting someone else's cloud, hit a cultural nerve.

	Before Nov 2025	After Nov 2025
Coding agent reliability	often-work (needs babysitting)	mostly-work (actually delegatable)
Local model quality	Clearly inferior to cloud models	Beating top cloud models on specific tasks
Personal AI servers	Niche / complex setup	OpenClaw made it accessible
Model competition	OpenAI-dominated	Anthropic, Google, Chinese models all competing

The most surprising development is the local model story. In April 2026, Willison ran Qwen3.6-35B-A3B on his laptop and it drew a better SVG than Claude Opus 4.7. A 20.9GB laptop model beat Anthropic's flagship cloud model. Then GLM-5.1 (754B parameters, open weights from China's Z.ai) scored 58.4% on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3%) and GPT-5.4 (57.7%).

What's the pelican-on-a-bicycle benchmark?

It's Willison's informal sanity test: "Draw me an SVG of a pelican riding a bicycle." The model almost certainly hasn't trained on this specific combination — so it's testing genuine visual reasoning. It's meant as a joke, but there's actually been solid correlation between pelican quality and general model usefulness.

Getting started: what to actually do

Take coding agents seriously again
Claude Code, Cursor, GitHub Copilot — the reliability gap is real. If you tried these six months ago and gave up, give them another shot. Start with repetitive tasks, test generation, and refactoring.
Try running a local model
Ollama or LM Studio + Qwen3.6-35B works on a MacBook with 32GB+ RAM. The quantized build is 20.9GB. Good for private work where you don't want data leaving your machine.
Stop being loyal to one model
Five crown changes in five months. What was best in November might not be best now. Test periodically. Claude Code for coding, Gemini 3.1 Pro for image tasks, GLM-5.1 API for agentic coding.
GLM-5.1 via OpenRouter API
754B parameters means local is impractical (needs 8x H200), but it's accessible via OpenRouter. Currently the top open-source model for agentic coding tasks.
Watch out for LLM psychosis
When AI suddenly works better, you get tempted into wildly ambitious projects. Willison learned this firsthand. Ask "who actually needs this?" before building.

🔗

더 깊이 파고 싶다면

The last six months in LLMs in five minutes

Willison's full annotated slides from PyCon US 2026 lightning talk

Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7

The moment Willison's laptop model beat Anthropic's best

Z.ai Releases GLM-5.1: 754B Model Tops SWE-Bench Pro

Full spec and benchmarks for the current top open-source coding model

RLVR: Verifiable Rewards for Reliable Enterprise LLMs

How compiler feedback transformed coding agent quality

OpenClaw — Personal AI Assistant

The open-source project that sold out Mac Mini M4 inventory nationwide

Qwen 3.6-35B-A3B Complete Guide

How to run Qwen3.6 locally, setup walkthrough, benchmark comparisons

FAQ

Why did coding agents used to fail so often before RLVR?

Earlier models were trained on human preference signals — raters saying 'this answer feels better.' Code, however, has an objective judge: the compiler. RLVR plugs that compiler feedback directly into training, which is why code quality jumped so dramatically.

Can Qwen3.6 actually run on my MacBook?

If you have an M1/M2/M3/M4 MacBook with 32GB+ unified memory, yes. The quantized build is 20.9GB. Use LM Studio or Ollama — both have simple GUI installers. Expect slower generation than cloud models, but solid quality.

Is GLM-5.1 better than GPT-5 or Claude overall?

On SWE-Bench Pro (resolving real GitHub issues), it's currently #1 among open-source models. That's a coding-specific benchmark — for general conversation or creative writing, results will differ. But for agentic coding tasks specifically, it's genuinely competitive.

What's different about OpenClaw vs just using ChatGPT?

The key difference is local control. OpenClaw runs on your own hardware (typically a Mac Mini), meaning your data never leaves your machine. It's also persistent — it can run 24/7 and maintain long-term memory in a way cloud sessions don't.

How do I avoid 'LLM psychosis'?

Willison's own takeaway: ask 'who actually needs this?' before you start building. AI making something possible doesn't mean it's worth building. Start with a real, specific problem before exploring what's technically achievable.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI