One Agent Per Layer: Building an AI Software Development Team

A full dev team that never sleeps, never burns out, and follows your process every time.


In our previous post, we showed how to encode consulting methodology into a team of AI agents. Today we’re doing the same thing for software development.

Here’s the conventional wisdom: you give an AI a prompt, it writes code, you deploy it. Maybe you add “you are an expert developer” to the system prompt. Maybe you paste in your architecture docs for context.

This works for scripts. It doesn’t work for production software.

The problem isn’t that AI can’t write code. It can. The problem is that writing code is maybe 30% of software development. The other 70% is: understanding the problem, designing the architecture, testing edge cases, deploying safely, and monitoring what happens after.

Human dev teams don’t put one person in charge of all of that. Why would you do it with AI?

The Architecture: One Agent Per Layer

AgentForge is an 8-agent pipeline where each agent owns one phase of the software development lifecycle:

Feature Request
     │
     ▼
┌─────────────┐
│ Orchestrator │ — Manages the pipeline, enforces quality gates
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Strategist  │ — Frames the problem, writes user stories
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Analyst    │ — Detailed requirements, acceptance criteria
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Architect   │ — Technical design, component structure, tech stack
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Developer   │ — Writes the code, commits to repo
└──────┬──────┘
       │
       ▼
┌─────────────┐
│     QA       │ — Tests everything against acceptance criteria
└──────┬──────┘
       │ (fail → loop back to Developer or Architect)
       │
       ▼
┌─────────────┐
│   DevOps     │ — Deploys, verifies, documents rollback plan
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Monitor    │ — Post-deploy health check, catches what testing missed
└─────────────┘

One agent per layer. Each one has a SOUL.md that defines exactly what it does, what it expects as input, and what it must produce as output.

Why Not One Super-Agent?

We tried. We really did. Here’s what happens when you give one agent a prompt that says “you are a full-stack developer who also does architecture, testing, deployment, and monitoring”:

  1. It skips steps. Under pressure to produce code, it jumps straight to implementation. Requirements? “I’ll figure them out as I go.” Testing? “I tested it in my head.”
  2. It optimizes for completion, not quality. A single agent wants to finish. Multiple agents have competing incentives — QA wants to find bugs, Developer wants clean code, Architect wants clean design. These tensions produce better software.
  3. It can’t review its own work. This is the fundamental problem. An agent that writes code and then tests that same code has an inherent blind spot. Separating them forces genuine adversarial review.

The Document-Driven Pipeline

Here’s the key design principle: agents don’t share state — they pass documents.

When the Strategist finishes, it produces a markdown file with user stories and success criteria. The Analyst reads that document (not the Strategist’s internal context) and produces a requirements document. The Architect reads the requirements document and produces a technical design.

Why?

  1. Auditability. If something goes wrong in production, you can trace back through every intermediate document and find exactly where the logic broke down.
  2. Human-injectable. You can edit any document before passing it to the next agent. Don’t like the architecture? Modify it before the Developer sees it.
  3. Debuggable. When QA fails, you don’t have to re-run the whole pipeline. You read the architecture doc, find the gap, and loop back to just that phase.

Every deliverable is stored:

project-root/
├── .forge/
│   ├── 001-strategy.md
│   ├── 002-requirements.md
│   ├── 003-architecture.md
│   ├── 004-qa-report.md
│   ├── 005-deploy-report.md
│   └── 006-monitor-report.md

Complete audit trail. Every decision documented. Every phase reviewable.

Quality Gates: Where the Magic Happens

Between every phase, the Orchestrator checks quality criteria before passing work forward:

Transition The Orchestrator Checks
Strategist → Analyst Are user stories in “As a [user], I want [X] so that [Y]” format? Are success criteria measurable?
Analyst → Architect Are requirements MECE? Does every user story have acceptance criteria?
Architect → Developer Is every component specified? Are tech choices justified?
Developer → QA Does code match the architecture doc? Is it documented?
QA → DevOps Do ALL acceptance criteria pass? Zero critical/high bugs?
DevOps → Monitor Is deployment verified? Is rollback plan documented?

If a gate fails, work loops back. In our testing, QA → Developer is the most common loop — the QA agent finds issues the Developer missed, and the Developer fixes them. QA → Architect is rarer but more impactful — it happens when the QA agent discovers a design-level issue that no amount of code fixing will solve.

These loops aren’t failures. They’re the system enforcing discipline that human teams often skip under deadline pressure.

A Real Example

To illustrate the pipeline, here’s how we’d break down a real task we actually faced: building the Nananami website. The site was built iteratively with human direction using AI assistance — not autonomously by the 8-agent pipeline (we hadn’t fully wired the orchestration yet). But retrospectively mapping the work to AgentForge’s phases shows how the pipeline should work, and it’s close to what actually happened:

Task: “Build a customer intake form for our consulting website with post-payment flow”

Here’s what each phase produced:

Strategist (3 minutes): - Defined three user personas (new client, returning client, enterprise inquiry) - User stories with acceptance criteria - Success metric: “Form submission → Stripe payment → intake data captured in <2 minutes”

Analyst (2 minutes): - Specified 12 form fields with validation rules - Edge cases: what if payment fails? What if they close the tab mid-form? - GDPR requirements for data handling - Mobile responsiveness requirements

Architect (3 minutes): - HTML + Tailwind (no framework needed for a static form) - Stripe Payment Links for payment (no backend) - Formspree for form data collection - File structure: 4 files, hosted on Azure Static Web Apps

Developer (5 minutes): - Implemented all 4 files - Connected Stripe payment links to pricing buttons - Built the post-payment intake form at /success - Added responsive design breakpoints

QA (3 minutes): - Tested all form validations ✅ - Tested mobile rendering ✅ - Found: Stripe link redirects weren’t passing client context → LOOP BACK to Developer - Developer fix: added URL parameters to capture which package was purchased → QA re-tested ✅

DevOps (2 minutes): - Pushed to GitHub → auto-deployed via GitHub Actions to Azure - Verified HTTPS, custom domain, CDN caching - Documented rollback: git revert + force deploy

Monitor (1 minute): - Verified all links functional - Confirmed form submissions arriving in Formspree - Confirmed Stripe payment links active in live mode

Estimated total pipeline time: under 30 minutes. In reality, we built nananami.com iteratively over a longer session with human decision-making at each step. The pipeline above represents our target workflow — fully wiring the autonomous orchestration is active work.

Lessons from the Trenches

What broke: - The Developer agent sometimes “improved” the architecture without being asked. We had to add explicit guardrails: “Implement EXACTLY the architecture spec. If you think it needs changes, flag it — don’t change it.” - QA initially wasn’t aggressive enough. It would say “all tests pass” when it hadn’t actually tested edge cases. We rewrote the QA SOUL.md to include mandatory checklists and evidence requirements. - The Monitor agent was an afterthought. We added it after a deployment that “worked” but had a broken link on mobile. Now it catches things QA missed in the real environment.

What surprised us: - The Strategist-to-Analyst handoff was the highest-value step. In our experience, most AI-generated code fails not because the code is bad, but because the requirements are wrong. Having a dedicated Analyst agent that turns vague stories into precise acceptance criteria significantly reduces downstream issues. - The loop-back mechanism creates natural “sprints.” When we’ve tested the pipeline, runs typically involve 1-2 loop-backs, which means the final output has been reviewed and revised — not just generated once and shipped.

What we’d change: - Add a “Code Reviewer” agent between Developer and QA — dedicated to style, patterns, and maintainability (QA focuses on correctness) - Make the Architect agent aware of previous projects to suggest reusable patterns - Build a “Post-Mortem” agent that analyzes QA failures across runs to improve SOUL.md files over time

What We’re Exploring Next

The architecture is built. The SOUL.md files are written. Here’s what we’re actively testing and curious about:

  1. Autonomous orchestration end-to-end: We’ve tested each agent individually and in pairs. Wiring the full 8-agent pipeline with automatic quality gate enforcement (using OpenClaw’s sessions_spawn) is our current engineering focus. The goal: give it a feature request, walk away, come back to deployed code with a complete audit trail.

  2. Model mixing across layers: Not every agent needs the most expensive model. Could the Monitor run on a local model (Qwen, Llama) while the Architect runs on Claude? The document-driven pipeline makes this possible because agents only share files, not context. We want to measure quality vs. cost tradeoffs per layer.

  3. Learning across runs: When QA catches a bug, that’s useful information for the next time the Developer encounters a similar pattern. Can we build a feedback loop where QA failure reports automatically update the Developer’s SOUL.md with new anti-patterns? This would be the AI equivalent of a dev team’s institutional knowledge.

  4. Parallel agent execution: The Strategist → Analyst → Architect flow is sequential. But after architecture is defined, could the Developer work on independent modules in parallel using multiple spawned agents? This is theoretically straightforward with the document-driven approach but untested.

  5. Client-specific pipeline templates: Instead of customizing SOUL.md files from scratch for each client, could we build a library of domain-specific templates (fintech, healthcare, e-commerce) that pre-load relevant coding standards, compliance requirements, and architectural patterns?

These are hypotheses we’re working to validate. We’ll share results as we learn.

The Power of Customization

Here’s why one-agent-per-layer matters for businesses: you can swap any agent without changing the pipeline.

The pipeline stays the same. Only the expertise changes. This is how we onboard new clients — copy the agents directory, customize the SOUL.md files for their domain, and the methodology transfers instantly.

Try It Yourself

The full architecture, all SOUL.md files, and the pipeline documentation are open-source: github.com/IotecBol/agenticAI

And if you want us to customize an AgentForge pipeline for your team — tailored to your tech stack, your coding standards, your deployment process — book a free 30-minute consultation.


This is Part 2 of our series on multi-agent AI systems. Previously: “Cloning McKinsey: How We Built an AI Consulting Team.”

Next up: “When AI Meets the FDA: Building a 10-Agent Regulatory Pipeline.”

— Nananami