HITL in Vibe Coding and IaC: Avoid the Long Bill

Generative AI already writes specs, code, and complete pipelines. Removing the human from the process doesn’t give you speed: it gives you burned tokens, drift in production, and oncall nights you could have avoided with a single well-placed checkpoint.

In 2026, corporate discourse goes almost entirely in one direction: autonomous agents, full automation, self-healing pipelines. The promise is seductive because it sells. The operational reality is that almost no team is ready to execute that promise without a human validating the critical points of the flow.

I’ve written before about the false promise of bought autonomy. That post diagnosed the problem. This is the prescriptive version: where to place the HITL (Human In The Loop), what to validate at each gate, and why skipping that discipline pays off in tokens, MTTR, and rework hours.

HITL is not a brake, it’s a multiplier

There’s a deeply rooted confusion in teams just exploring agentic flows: thinking that HITL means “the human reviews everything” or “the human slows down the AI.” Neither of those scales.

Well-designed HITL works as a gate in critical state transitions. The AI suggests, proposes, optimizes, detects typos the human eye misses. The human approves the step from one stage to the next when that step is costly to reverse. It’s the same logic we’ve applied in CI/CD for years: you don’t block every commit, you block the merge to main and the deploy to production.

Applied to AI-driven flows, the two places where the gate pays for itself with interest are spec definition in Vibe Coding and the IaC path to production.

HITL in Vibe Coding: the gate is in the SDD, not in the code

When an agent delivers a mediocre Pull Request, the natural reflex is to review the code line by line. It’s too late. The error is almost never in the code; it’s in the spec or prompt that generated that code.

Spec Driven Development (SDD) gives structure to the agent: requirements, scenarios, design decisions, tasks. Without that structure, the agent hallucinates interfaces, invents contracts, and mixes domains. With that structure, the agent advances with less noise and more predictability.

The problem is that a poorly defined spec is radioactive. The agent will interpret it literally and generate 800 lines of code that perfectly fulfill something that wasn’t what you wanted. Then come the corrections, the re-prompts, the partial rollbacks. Each iteration consumes the full repo context, intermediate specs, and the conversation history.

A conservative estimate based on real projects: a medium feature started with a loose spec usually requires between 3 and 5 extra correction iterations, each consuming between 30k and 80k tokens. That’s between 100k and 400k burned tokens that didn’t add value, only undid a decision that was made wrong at the start.

The HITL in SDD is cheap compared to that. Ten minutes reviewing that the spec describes the correct problem, that the scenarios cover the edge cases you know, and that the decisions reflect the project’s real stack. That gate prevents the agent from generating half a system based on a wrong premise.

SDD frameworks are not magic

Here’s a point being overlooked in many teams: adopting an SDD framework like OpenSpec or SpecKit doesn’t solve the problem just by installing it. The framework gives you the skeleton: folder structure, artifact types, execution flow, hooks. What it doesn’t give you is domain context, your organization’s rules, or your stack’s conventions.

If you leave the framework at its default configuration, the agent keeps hallucinating. It doesn’t hallucinate less by using OpenSpec; it hallucinates differently. It will invent libraries, suggest microservice patterns where your project is a modular monolith, propose event-driven architectures when your five-person team can’t operate them well.

Customizing the framework is engineering work: domain glossary injected into the context, project coding rules, stack restrictions (runtime versions, allowed databases, vetted libraries), naming conventions, testing criteria. That layer is what turns a generic framework into something that actually reduces hallucinations.

HITL coexists with all of this. The human validates that the framework is well configured, that the rules stay up to date, and that each generated spec respects those rules before the agent moves to implementation. Without that validation, the framework only gives the appearance of rigor to a flow that remains chaotic.

HITL in IaC and GitOps: the gate goes before the apply, not after

In infrastructure, the temptation to let the AI execute via direct CLI is high. There are agents that can run terraform plan, terraform apply, kubectl apply, gh workflow run. They work. The problem is that the cost of an error in infra isn’t measured in re-prompts, it’s measured in outages.

A real case that repeats itself: an agent generates a change in Terraform where a for_each receives a map with keys different from those in the state. For the human eye without enough context, the diff looks reasonable. The plan shows “5 to add, 5 to destroy.” If nobody reviews that plan with judgment, the apply deletes five productive resources and recreates them with new IDs. Broken endpoints, downtime measured in minutes in the best case and in hours if it depends on DNS or things that replicate slowly.

The HITL in IaC doesn’t mean a human approves every terraform apply. That creates too much friction and ends up in rubber stamping, which is worse than having no gate. The useful HITL is at two specific points:

Pull Request review before the merge, with the plan attached in the PR (Atlantis style, Terraform Cloud, or Argo CD with preview). The human reads the plan and approves the change when they understand what will be touched.
Promotion gate between environments (staging → prod), where a human confirms that what was applied in staging behaved as expected before propagating to prod.

What the AI contributes in this flow is valuable and specific: it detects typing errors, validates that the code compiles, suggests module optimizations, compares the diff against the state, annotates potential risks in the PR. It’s work a human does slowly and poorly because it’s repetitive. The AI does it quickly and consistently.

An estimate based on teams that adopted this model: gating the merge and promotion with HITL reduces serious incidents attributable to infra changes by 30 to 50%. It doesn’t eliminate incidents, but pushes them to less costly categories and leaves the MTTR much healthier because the rollback is decided with context, not in panic.

Gate checklist: where to place the human

Actionable summary, designed for teams just setting up their flow with agents:

Gate 1, Approved spec: before the agent generates a single line of code, a human validates that the spec describes the correct problem, the scenarios cover the known edge cases, and the decisions reflect the real stack.
Gate 2, SDD framework configured: periodically review that the framework’s rules, glossaries, and restrictions are up to date with the project’s evolution. Frameworks don’t self-administer.
Gate 3, PR review with visible plan: in IaC, no merge to main without the terraform plan (or equivalent) in the PR (executed pipeline) and having been read by a human who understands which resources it touches.
Gate 4, Environment promotion: the staging to prod step requires human confirmation, ideally with staging metrics attached. Automatic apply in prod without prior staging validation is technical debt disguised as speed.
Gate 5, Agent output audit: periodic spot-check of PRs approved by agents to detect quality drift before it becomes systemic.

Conclusion

The interesting conversation in 2026 is no longer whether to use AI in the development cycle, but where to let it decide alone and where to force a human in the middle. The teams getting real ROI understood it: HITL isn’t resistance to change, it’s engineering discipline applied to the new stack.

Skipping that discipline out of enthusiasm or pressure from an aggressive roadmap is cheap at the beginning and expensive at the end. The bill comes in the form of consumed tokens, production incidents, and eroded trust with the business. Placing the correct gates costs less than any of those three things.

AI is a brutal multiplier when it operates within a framework we define for it. Without that framework, the multiplier works just as well for chaos.