Anvil doesn't just write code — it plans, executes, verifies, and recovers. Built on 210K+ agent traces across 21 projects. Stops only when tests pass.
Most coding agents write code and hand it to you. Anvil runs a continuous Plan → Execute → Verify → Recover loop — iterating until every test passes.
Analyze the task, decompose into steps, identify dependencies and constraints before writing a single line.
Generate code with full context awareness — understanding project structure, conventions, and test expectations.
Run the test suite. If tests fail, the loop doesn't stop — every failure is a signal, not a dead end.
Parse errors, diagnose root causes, and apply targeted fixes. Then re-verify. Repeat until green.
Unlike other agents, Anvil doesn't just write code — it verifies it, recovers from errors, and only stops when tests pass. This is the forge: heat, hammer, quench, test. Repeat.
Watch the verification loop run end-to-end. Anvil writes code, runs tests, catches failures, and fixes them — all without human intervention.
A complete ecosystem for building, training, and deploying self-verified coding agents — from the core loop to specialized models to production infrastructure.
The self-verified coding agent. Plans, executes, verifies, recovers — until tests pass.
The verification loop engine. Runs tests, parses failures, feeds signals back to recovery.
Structured error diagnosis and targeted fix generation from test failure signals.
Multi-agent orchestration for parallel task execution with shared context.
14B parameter coding model fine-tuned on verified agent traces. Full code understanding.
Lightweight shell command model for terminal interaction and environment execution.
Reasoning critic model for verifying logic, catching edge cases, and detecting hallucinations.
210K+ verified agent traces across Python, JS, Rust, Go — the training fuel for Anvil models.
Filters, cleans, and distills raw agent traces into high-quality training examples.
Compiles multi-step agent traces into structured, tokenized training data for fine-tuning.
Production runtime for deploying Anvil agents with sandboxed execution and monitoring.
Observability for agent loops — trace every verify-recover cycle, token usage, and latency.
Automated benchmarking framework for evaluating agent performance across coding tasks.
Three purpose-built models — each trained on verified agent traces, each designed to make the verification loop stronger.
Install Anvil, run your first task, and watch the verification loop in action.
Anvil with verification loops consistently outperforms agents without verification across every benchmark.
| Benchmark | Anvil (w/ Verify) | Without Verify | Δ |
|---|---|---|---|
| HumanEval | 89.2% | 72.1% | +17.1% |
| MBPP | 84.6% | 68.3% | +16.3% |
| SWE-Bench Lite | 31.4% | 18.7% | +12.7% |
| LiveCodeBench | 71.8% | 55.2% | +16.6% |
| MultiPL-E (avg) | 76.3% | 61.9% | +14.4% |
Join the community shaping the future of self-verified coding agents. Every issue, PR, and trace makes Anvil stronger.