Give Claude a way to verify its infra work

Share
Give Claude a way to verify its infra work
Antrieb: Disposable Environments for AI Agents

Coding agents have become remarkably capable in recent years. Give Claude, Codex, or Gemini access to a repository and they can write features, fix bugs, refactor code, and complete complex tasks with minimal supervision.

Part of that improvement comes from better models, but a large part comes from the loop around the model. Modern coding agents go beyond writing code: they execute it, run tests, inspect results, and use those results to improve their work. Anthropic captures the pattern succinctly:

Claude stops when the work looks done. Without a check it can run, "looks done" is the only signal available, and you become the verification loop: every mistake waits for you to notice it. Give Claude something that produces a pass or fail, and the loop closes on its own. Claude does the work, runs the check, reads the result, and iterates until the check passes.

This lesson applies to infrastructure code equally. Unfortunately, checking an infrastructure task is non-trivial and has received less attention in the industry. How does a model check the SONiC or Arista config it just generated? Or how does it validate that the newly generated deployment script passes?

Antrieb fills this gap so humans stop being verification loops for LLM-generated infra code. It does so by offering environment-level primitives to spin up VM clusters instantly, wire them, use them, and dispose them. We created short demos showing Antrieb in Action.

The Runtime for Infra Is full Environment

Software has runtimes in the form of containers or virtual machines. Infrastructure has a larger runtime: an environment with machines, clusters, networks, switches, routers, storage, operating systems, and services. Once those things exist, the checks are familiar. Can the nodes reach one another? Did DHCP assign addresses correctly? Did the cluster form? Are pods distributed as expected?

There are obvious apparently simple answers: use staging, give the agent access to a shared lab, or hand it a hypervisor. Each helps, but each misses part of the loop. A human-operated environment leaves the human as the conduit. A shared long-lived environment accumulates state and loses reproducibility. Raw hypervisor access pushes too much low-level orchestration into the agent’s reasoning loop. The useful environment is one the agent can operate directly to mimic the real world (faithfulness), recreate quickly (instantaneity), and throw away after each attempt (disposability).

Faithfulness

Faithful means the environment preserves the parts of the real system that matter for the check. Faithfulness has many dimensions: operating system image, kernel version, package set, network topology, etc. Each infrastructure task depends on a different subset of those dimensions, so faithfulness has to be chosen deliberately.

For instance, a BGP exercise depends heavily on routing behavior, while Kubernetes scheduling may care more about node topology and labels. Reproducing every dimension is usually expensive and often unnecessary, so the useful environment is the one that preserves the properties that decide the outcome being exercised.

Another dimension that also impacts faithfulness is age of the environment under test (e.g. legacy) as well as its physical distance from the datacenter (e.g. edge, field).

Disposability

A test runner starts clean: no leftover state, no half-applied fixes, no residue from the previous attempt. Infrastructure loses that property quickly because the agent patches, retries, changes routes, restarts services, leaves temporary files, and modifies cluster state. After a few rounds, "it works now" may only mean the environment has evolved around the mistakes.

The same applies when the agent original hypothesis was based on a wrong assumption, such as the wrong NIC names. It should be able to throw the environment away and recreate a clean one.

A disposable environment restores the clean slate. If a solution cannot be reproduced on a fresh environment, it does not count.

Instantaneity

Instantaneity matters because the loop has to stay usable. Slow provisioning creates more opportunities for timeouts, retries, partial output, stale assumptions, and half-explained state. Those artifacts become part of the environment state and of the agent's working context. That means, the model has to reason over more noise while still deciding what to try next.

The human side follows the same pattern. A loop measured in seconds keeps supervision engaged, while a loop measured in minutes sends attention elsewhere, to coffee, Slack, email, or some other distraction. By the time the environment is ready, the context the person was holding has to be rebuilt.

Verification works best when the environment can be recreated fast enough to stay inside the loop, both for the model and for the person watching it. In fact, this applies to both manual and automated verification. Slow or awkward verification creates pressure to skip verification. That is how "looks done" infrastructure code can make its way toward production.

Antrieb

Antrieb provides an implementation attempt for the runtime described in this article. It gives infrastructure agents access to the primitives they need to close the loop themselves: machines, clusters, LANs, NICs, switches, routers, egress, operating system images, command execution, and runbooks. An agent can provision up to four virtual machines in roughly two seconds, wire them up, deploy packages, run checks, inspect failures, and destroy the environment when done. Because the environment appears quickly, verification stays inside the agent's reasoning loop instead of becoming a separate human operational task.

The available operating systems include general-purpose Linux distributions, network operating systems, and firewall/router images such as VyOS, OpenWrt, SONiC, and OPNsense. Users can also create custom images to extend faithfulness.

The reader is welcome to try Antrieb. The easiest way is through Claude.ai. Add a connector with the MCP server: https://antrieb.sh/mcp. Then run an infra task.

For example:

Create a SONiC switch with two isolated LANs, attach three Ubuntu nodes, install k3s, deploy nginx with anti-affinity, and verify failover.

Or:

Create a 4-node cluster. On three nodes, install httpd on port 6700; on the fourth, install HAProxy and configure it as a load balancer.

Not every check needs live execution by the AI agent. When infra code is generated some questions can be answered from the existing configuration, through static analysis, simulation, or modelling, and Antrieb supports these approaches as well. For instance, Cisco and Arista configs can be checked with the Batfish images. These approaches answer different questions, and the agent can reach for whichever the task needs.

Conclusion

The advice from Anthropic's guide is simple:

"Claude does the work, runs the check, reads the result, and iterates until the check passes."

That loop is now ordinary in software, and infrastructure deserves the same treatment. Today, many infrastructure prompts produce infrastructure code and, in the optimistic case, human verify them. The model writes, and the human provisions, deploys, inspects, diagnoses, and decides what passes.

A realistic disposable environment changes that loop because the agent can provision the machines, wire them, deploy the system, inspect the outcome, run checks, and keep iterating until the checks pass. That is the idea behind instant ephemeral infrastructure for AI agents. See Antrieb in Action.

If AI infrastructure agents are going to become genuinely useful and safe, they need a place where reality can answer back. Antrieb gives them that place.