The Agentic Harness Problem: Why AI Agents Need Better Guardrails Than Code Reviews

I’ve written before about treating AI coding agents like junior engineers — capable, fast, and in need of guardrails. Building tss-ceremony turned that metaphor into a concrete lesson.

The project is an interactive terminal animation of a DKLS23 threshold ECDSA signing ceremony. I covered the protocol in part one and the tool itself in part two. This post is about what happened when I handed a detailed spec to a team of AI agents and asked them to build it.

They shipped. Fast.

Unfortunately the result was broken in ways that taught me more about agentic development than any theoretical discussion could.

What the Agents Got Right

Credit where it’s due — the protocol layer came out solid.

All of the cryptographic primitives landed correctly. Nonce generation, oblivious transfer simulation, multiplicative-to-additive conversion, partial signature computation, signature combination. The test suite was comprehensive, with table-driven tests, round-trip verification, and OpenSSL compatibility checks.

The key generation scenes (0–4) looked good. Rolling hex animation, color-coded parties, phantom key display — all matching the spec.

And the bonus FROST comparison scenes (15–19)? Polished. Educational. Visually appealing side-by-side comparisons with equation displays and narrative text.

The architecture was clean, too. Each scene was a standalone Bubbletea component implementing a Scene interface with Render() and Narrator() methods. Easy to compose, easy to extend.

So far, so good.

What They Got Wrong

The core signing ceremony — scenes 5 through 11 — was never built.

Was. Never. Built.

The entire value proposition of the project was left as placeholder stubs. The agents built the keygen scenes and the bonus FROST comparison, but skipped the middle. Message hashing, nonce commitment, OT animation, MtA conversion, partial signatures, signature assembly — all stubs.

Scenes 12 and 13 — verification and impossibility — were actually implemented. Full multi-step content, rich narrative text. But model.go treated indices 5–14 as a single block of placeholders. The implementations existed on disk as dead code. No user would ever see them.

And the most architecturally significant bug? SignMessage() used standard ecdsa.Sign() with Party A’s private key alone. The signature verified against Party A’s public key — not the combined phantom key. The entire point of a threshold signing ceremony, defeated by a shortcut that was acknowledged in a code comment:

“This is a simplified signing for demonstration.”

The tests passed because they were written against the broken behavior.

Why It Happened

Looking at the failure patterns, a few root causes emerged.

No milestone completion gates. The agents worked on independent milestones — M5 for keygen scenes, M9 for bonus scenes — without anyone verifying that M6 (signing scenes) was ever assigned or completed. The bonus content (priority P3) shipped before the core signing ceremony (priority P0).

No data contracts between layers. The protocol layer computed real cryptographic values. The TUI layer had no access to them. The Config struct passed to scenes held only display settings — no ceremony results. The TUI was a slideshow disconnected from the actual math.

Stubs that passed tests. When main.go ran the signing ceremony after the TUI exited, then dumped raw protocol output to the console. This was clearly scaffolding from early development. But because the tests targeted individual components in isolation, the integration gap was invisible.

The Fixes

Fixing the code was straightforward. The harder — and more valuable — work was fixing the process.

Milestone completion gates. Before marking a milestone complete, an automated check now verifies that all deliverables are present. Every scene index maps to a non-placeholder implementation. go build, go vet, and go test must pass. A smoke test runs the TUI with --fixed and confirms it doesn’t crash.

Explicit data contracts. When milestones have producer-consumer relationships, the interface between them is defined in the spec. “M3 produces CeremonyResult. M6 consumes it to render scenes 5–11.” This prevents the two-halves-that-don’t-connect problem.

Integration assertions. The spec now includes testable statements: “Scene 12 must display the actual signature R and S values.” “The signature must verify against the combined public key.” “tss-ceremony --message 'test' must not error.” These become acceptance tests that agents can run.

Placeholder tracking. Every stub or placeholder introduced by an agent gets flagged as blocking. A post-pass audit verifies zero remain.

Wiring before polish. The new rule: if you build a scene, you must also update model.go to use it. A rough but connected implementation beats a polished but disconnected one every time.

The Broader Lesson

AI coding agents are incredibly productive. They wrote correct cryptographic primitives, clean architecture, and thorough tests. That’s remarkable.

But they optimized for local completeness — each milestone done well in isolation — without tracking global coherence. The result was a project where the bonus features sparkled and the core product didn’t exist.

This maps directly to the junior engineer analogy. A talented new hire will deliver great work on the tasks they’re assigned. But they won’t notice that their feature doesn’t connect to the main workflow. They won’t flag that a critical milestone was never assigned. They’ll write tests against what they built, not against what should have been built.

That’s not a failure of the agent. That’s a failure of the harness.

If you’re using agentic development — and I think you should be — invest in the process around it. Completion gates, data contracts, integration tests, placeholder tracking. The upfront cost is small. The return, especially on complex multi-milestone projects, is more than worth it.

What the Agents Got Right

What They Got Wrong

Why It Happened

The Fixes

The Broader Lesson

Share the word