images.unsplash.com

Constrain the Agent, Get Better Results — The Statewright State Machine Paradox

AI agent reliability, state machines, Statewright, MCP tool control, constrained agentsDev

Show HN: Statewright – Visual state machines that make AI agents reliable

Building Effective Agents — Anthropic Engineering

XState Documentation — State Machines and Statecharts

The more tools you give an AI agent, the better it performs — right? Turns out, it's the complete opposite. When tool access was physically restricted per phase, a 13B model consistently beat unconstrained frontier models.

30-second summary

Why agents fail → Define phases (States) → Enforce tool restrictions → Small model = frontier performance → Up to 80% cost savings

Why do agents keep making mistakes?

If you've used a coding agent, you know the feeling. When it works, it's incredible — but when it goes sideways, it really goes sideways. Is this just because the model isn't smart enough?

AI researcher Chip Huyen actually analyzed this mathematically. Even if an agent maintains 95% accuracy per step, after 10 steps the overall success rate drops to 60% — and over 100 steps, it plummets to 0.6%. Errors compound geometrically. That's a structural problem, not a model quality problem.

Anthropic acknowledged this directly: "The autonomous nature of agents can lead to higher costs and the potential for compounding errors." The usual response? Use a bigger model. Extend the context window. Statewright goes the opposite direction. Make the problem space smaller.

The core idea is elegant. Based on which phase (State) the agent is currently in, physically restrict which tools it can access. Planning phase: read-only tools only. Implementation: edit tools unlocked. Testing: bash commands only. This isn't a prompt asking the model to "please only use these tools" — it's a protocol-level rejection of unauthorized tool calls.

The core principle: "Agents are suggestions, states are laws"

That's creator Ben Cochran's framing. When a model tries to skip phases or use the wrong tool, the protocol itself rejects the call — not a politely worded warning. It's structural enforcement, not advisory guidance.

What actually changes?

There are already similar-looking tools — LangGraph, XState, Claude Code. Here's how Statewright differs.

	Existing frameworks	Statewright
Tool access control	Prompt-based (advisory)	State machine (enforced)
On rule violation	Model can ignore it	Protocol-level rejection
Model routing	Manual configuration	Automatic per-phase routing
Input token efficiency	Full tool list exposed	Current-state tools only
Cost reduction potential	—	Up to 80% on multi-phase workflows

LangGraph connects agents as graph nodes with specialized roles. The philosophy of specialization improving performance is similar — but LangGraph still relies on prompts to guide which tools to use, not physical enforcement. Compared to Claude Code, the difference is even more striking — Claude Code starts with 35,000+ tokens of context overhead. Statewright exposes only the tools relevant to the current state, dramatically reducing input tokens and improving cache efficiency.

And here's the most counterintuitive finding. For models above 13B parameters, structurally constrained smaller models consistently outperformed unconstrained frontier models. The pattern held across Qwen-coder, GPT-OSS, Gemma4, Haiku, Sonnet, and Opus. Around the same time, a project called Forge published independently and reached the same conclusion — two projects converging on identical results is a meaningful signal.

The quick-start guide

Install — Connect to your editor via MCP
Install the Statewright plugin for Claude Code, Codex, Oh-My-Codex, or other MCP-compatible editors. The core engine and agent crates are Apache 2.0, so there's no cost to get started.
Define your workflow states
Use YAML or JSON to define states and transition conditions. Specify phases like "planning → implementation → testing" and set guard conditions for each phase transition.
Assign tool access per state
Specify which tools are allowed in each state. Planning: file reads only. Implementation: editing tools. Testing: bash execution only. These constraints are enforced at the protocol level — not in the prompt.
Configure per-phase model routing (optional)
To cut costs, route planning to Haiku, implementation to Sonnet, and review to Opus. On multi-phase workflows, this can reduce costs by up to 80%.
Run and check audit logs
Statewright logs every state transition and tool access attempt. Full traceability of what was blocked and when — making it suitable for SOC 2 compliance and enterprise change management workflows.

~80%

Cost reduction on multi-phase workflows (with per-phase model routing)

13B+

Parameter threshold for consistent improvement over unconstrained frontier models

Apache 2.0

License for core engine + agent crates (fully open source)

Go deeper

Show HN: Statewright Discussion Q&A with creator Ben Cochran. Licensing policy, design intent, and direct comparisons to LangGraph and XState — all in one thread. news.ycombinator.com

Building Effective Agents — Anthropic Anthropic's take on the structural causes of agent unreliability and why simplicity should come first. The foundational read for anyone building with agents. anthropic.com

LangGraph: Multi-Agent Workflows The graph-based multi-agent orchestration framework. Worth reading alongside Statewright to understand the two different philosophical approaches to agent control. langchain.com

Agents — Chip Huyen The mathematical breakdown of why agent errors compound. The "95% accuracy → 0.6% success at 100 steps" calculation comes from here. huyenchip.com

XState Documentation The reference for state machines in UI development. Useful context for understanding what Statewright adapted specifically for agentic tool access control. xstate.js.org

FAQ

Does it work with smaller models (7–13B)?

Statewright's test results show consistent improvement above 13B parameters. Below 13B, models tend to be unstable in following state machine constraints, which limits effectiveness.

Is it hard to migrate from LangGraph or LangChain?

Statewright is a standalone framework, so existing LangChain/LangGraph code isn't directly portable — you'd need to remodel your workflows as state machines. It's most practical as a fresh start or for adding new modules to existing projects.

Can it be used for workflows beyond coding agents?

Yes. Real-world use cases include content pipelines (research → draft → review → publish), SOC 2 compliance auditing, and enterprise change management (plan → review → implement → approve → deploy). It's particularly well-suited to processes that require step-by-step audit trails.

How does the FSL 1.1 license work in practice?

The core engine and agent crates are Apache 2.0 — fully open source. Plugins and the gateway are FSL 1.1, which converts to Apache 2.0 after 3 years. Solo developers and researchers have explicit patent exclusions.

Will it conflict with Claude Code if used together?

Statewright is designed to work alongside Claude Code via MCP integration. Claude Code filters its tool list at execution time (no cache impact), while Statewright controls tool access by state — they're more complementary than conflicting.

Written by Rush

Tracking where business meets AI.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI