Harness Engineering 개념도 — AI 에이전트를 제어하는 시스템 설계

philschmid.de

Harness Engineering: How to Actually Control Your AI Coding Agent

As AI coding agents get more powerful, control becomes everything. Learn how HarDev

Vibe Coding Terms - Part 2 — Peter's Place

Harness engineering: leveraging Codex in an agent-first world

Superpowers — An agentic skills framework & software development methodology

OpenAI shipped a million-line codebase with zero lines written by a human. The secret? They handed the code to agents — but first, they designed the environment those agents would work in.

TL;DR

AI agents going rogue → Introduce the Harness concept → Control via Commands, Skills & Hooks → Compare 3 real-world frameworks → Apply it to your project

What Is It?

A harness is what you use to control a horse — reins, saddle, bit. It's a set of tools for directing something powerful that would otherwise go wherever it wants. AI coding agents need exactly the same thing.

Harness Engineering is an emerging discipline focused on designing the environment, constraints, and feedback loops that AI agents operate within. If prompt engineering is about "what to tell the agent," harness engineering is about "what world to put the agent in."

Philipp Schmid (formerly of Hugging Face) frames it as a computer analogy: if the model is the CPU and the context window is RAM, then the harness is the operating system. It manages context, handles boot sequences (prompts and hooks), and provides standard drivers (tool handling).

Here's the thing — this isn't just theory. The OpenAI Codex team wrote over a million lines of code with agents over five months, with three engineers merging an average of 3.5 PRs per day. What those engineers actually did wasn't write code — it was design the harness. LangChain kept the model the same, changed only the harness, and jumped from Top 30 to Top 5 on a benchmark.

Key Takeaway

Prompt engineering = crafting effective instructions for a single interaction
Context engineering = optimizing what information the model sees
Harness engineering = designing the entire agent system — environment, constraints, feedback, and lifecycle

The Building Blocks: Commands, Skills, and Hooks

@_petercha's vibe coding glossary series does a great job laying out these components. Using Claude Code as the reference, here are the three core tools that make up a harness.

1/3

Commands & Skills

Think of these as abilities you teach the agent. Define them as markdown files in .claude/commands/ or .claude/skills/, and the agent automatically picks them up and applies them in the right situations. Superpowers is the most extreme example of what's possible here.

2/3

Hooks

Shell commands that fire automatically at specific points in the agent's lifecycle. The key difference from Skills: Hooks run every single time, no exceptions — they're deterministic by design. You can enforce automatic linting after every code change, or fire a Slack notification the moment a task completes.

3/3

Subagents & Agent Teams

For tasks too big for one agent, you can distribute the work in parallel. Subagents just hand back results, but Agent Teams actually talk to each other and coordinate on a shared task list. Imagine separate agents handling frontend, backend, and database simultaneously.

What Changes?

You've probably seen what happens when you run an agent without a harness. It drifts in the wrong direction, repeats the same mistakes, and forgets your early instructions once the context gets long enough. It doesn't matter how good the model is — without a harness, you're working against yourself.

	Without a Harness	With a Harness
Direction	Prompt-dependent, drifts often	Auto-corrects via constraints and feedback
Quality consistency	Different results every time	Enforced by architecture rules + linters
Long-running tasks	Drifts after ~50 steps	Context compression + subagent distribution
Team collaboration	Every developer does it differently	Shared harness keeps the whole team aligned
Scalability	Constant human intervention required	Autonomous operation → humans just review

According to NxCode's breakdown, harness engineering rests on three pillars:

Context Engineering
Making sure the agent sees the right information at the right time. Use AGENTS.md as a table of contents, not an encyclopedia — put the detailed docs in a structured docs/ directory.
Architecture Constraints
Instead of telling the agent to "write good code," you mechanically enforce what good code looks like. Dependency direction, naming conventions, and file size limits get verified by linters and CI.
Entropy Management
AI-generated code drifts over time. Run periodic "cleanup agents" that scan for documentation inconsistencies, pattern violations, and dead code.

3 Real-World Harness Frameworks Compared

Enough theory — here's a look at open-source harness frameworks you can actually use today.

Framework	Core Philosophy	GitHub Stars	Supported Agents
Superpowers	TDD-driven autonomous dev workflow	76.5K	Claude Code, Codex, Cursor, OpenCode
Oh-my-claudecode	Teams-first multi-agent orchestration	9.2K	Claude Code (+ Codex, OpenCode variants)
Ouroboros	Socratic intent extraction, oscillation prevention	1.1K	Claude Code

Superpowers, built by Jesse Vincent, is an agentic skills framework. Its standout feature is enforced TDD — the agent must write tests before writing any code, then follow the red → green → refactor cycle. Install it and you get a full automated workflow: brainstorming → design → planning → subagent execution → code review → branch cleanup.

Oh-my-claudecode, built by Korean developer Yeachan Heo, structures 21 specialized agents — code reviewer, debugger, architect, and more — to collaborate like a team. It comes in a whole family: oh-my-claudecode for Claude Code, oh-my-codex for Codex, and oh-my-openagent for OpenCode.

Ouroboros, built by Q00, follows the philosophy of "Stop prompting. Start specifying." It uses Socratic questioning to draw out the user's real intent, then tracks ambiguity (Ambiguity Score) and agent confusion (Oscillation) as numeric metrics — so you can actually see when your agent is going in circles.

Getting Started

Start with CLAUDE.md (or AGENTS.md)
Document your project's architecture, coding conventions, and directory structure — in 100 lines or fewer. Table of contents, not encyclopedia.
Set deterministic rules with Hooks
In .claude/settings.json, add a PostToolUse hook to auto-run linting, and a Stop hook to send a completion notification.
Install a harness framework and try it out
If you're on Claude Code, one line does it: /plugin install superpowers. TDD workflow kicks in automatically.
Add constraints gradually
Don't try to build the perfect harness on day one. Start with basic linting, add architecture constraints once you see patterns, and bring in subagents only when you actually need them.

Heads Up: The over-engineering trap

Philschmid's key advice: "Build to Delete." Models improve fast — what needed a complex pipeline in 2024 might be a single prompt in 2026. If you over-engineer your control flow, the next model update can break your whole system.

🔗

Deep Dive Resources

Harness Engineering — OpenAI

The original harness design philosophy from the team that shipped a million lines with Codex

Superpowers — GitHub

TDD-based agentic skills framework, 76.5K stars

Oh-my-claudecode — GitHub

Team orchestration with 21 specialized agents

Ouroboros — GitHub

Socratic intent extraction to prevent agent oscillation

The importance of Agent Harness in 2026

Philipp Schmid's harness-as-OS analogy and where this is all heading

Vibe Coding Terms Part 2 — Peter's Place

Clean breakdown of Hooks, Subagents, and the Harness concept

FAQ

Can I use multiple harness frameworks at the same time?

Yes, you can. A common approach is using Superpowers for the TDD workflow and layering on custom Hooks separately. Just make sure your Skills and Hooks don't conflict — it's worth documenting the priority order upfront.

Will using a harness increase my token costs?

There's a small upfront cost from additional context injection, but here's the thing — the cost of an agent going off the rails and having to backtrack is much higher. In practice, clearer constraints mean the agent converges faster, which tends to reduce overall token usage over time.

Can I use harnesses with Cursor or Codex, not just Claude Code?

Superpowers supports Claude Code, Codex, Cursor, and OpenCode. For Hooks specifically, Claude Code currently has the richest support (18 lifecycle events), while Codex CLI and Gemini CLI have started adding support as well. Coverage varies by agent, so check the official docs for each one.

Our team doesn't have senior developers — can we still adopt harness engineering?

Honestly, it's even more valuable in that situation. The key is that encoding your architecture rules and coding standards into the harness means the agent enforces them automatically, raising your quality floor. Install Superpowers and you get TDD and code review automated from day one.

Written by Kevin

Dissecting AI tools and workflows from a developer's lens.

Did you find this reference helpful?

Get curated references delivered to your inbox weekly

Share this reference

Antioch — Meet the Cursor for Robot AI

Physical AI startups no longer need to rent warehouses or build million-dollar test facilities. Antioch brings software-speed development to robotics through cloud simulation — and just raised $8.5M seed to prove it.

Explore more AI workflow guides on similar topics

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

morningbrew.com

Medvi telehealth, AI startup leverage, GLP-1 startup, one-person unicorn, AI operations

$20K and 12 AI Tools Built a $1.8B Telehealth Company — And Then the Red Flags Arrived

Matthew Gallagher built Medvi, a GLP-1 telehealth startup, in 14 months with $20,000 and AI tools. 2 employees. 16.2% net margin. $401M in year one. Here's how the model works — and where it's breaking.

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

substackcdn.com

What if your code review was already done when you woke up, and your newsletter

AI That Works While You Sleep — Automating Recurring Tasks with Claude Code Scheduled Task

What if your code review was already done when you woke up, and your newsletter sources were already organized? Here's how to automate recurring tasks with Claude Code Scheduled Task.

Next →Antioch — Meet the Cursor for Robot AI