2 minute read

Part 1 of 3: Setup and methodology for this AI agent comparison using the BMAD method.

TL;DR: This study compares Claude Sonnet 4.5, Codex, and Gemini 2.5 Pro/Flash on the same development workflow. This post covers the experimental setup and the core learning objectives.

The Goal

AI agents can dramatically accelerate software development. But with several models now offering “coding assistant” modes, which one should you trust to run an actual Agile workflow? To determine the best daily driver, a controlled test was conducted comparing Claude Sonnet 4.5, Codex (gpt‑5‑codex), and Gemini 2.5 Pro/Flash under the same conditions.

The goal was not merely to see who ‘wins,’ but to understand how to optimize their performance: what makes one model stumble, and how to structure the work to make them succeed.

The BMAD Method

The experiment utilized the open-source BMAD method, which defines repeatable “agent personas” (Product Owner, Scrum Master, Architect, Developer, QA, and so on) that can be used inside different AI CLIs. Each persona has its own prompt context and tasks. BMAD can install directly into your project:

$ npx bmad-method install

After a few setup questions, it configures all three CLIs (Claude, Codex, Gemini) with identical agent commands like /bmad:agents:sm or /bmad:tasks:create-next-story. The result: a consistent sandbox for comparing how each model behaves with the same project structure.

Example: Activating the Scrum Master agent
> /bmad:agents:sm
🏃 Bob - Scrum Master Activated
Ready to assist! What would you like me to do?

The Test Project

To keep things realistic, the test utilized Squirrel, an in-progress ‘production-ready’ rewrite of a previous toy app that stores PDFs. Unlike the earlier version, which had a single endpoint and simple data flow, this story spans multiple services (Rust API, Python worker, and integration tests) making it a much stronger test of each model’s reasoning and consistency. The goal was to implement Story 2.2, a worker endpoint that fetches and cleans web articles. It’s a perfect mid‑development test: it spans backend, worker logic, and integration tests.

Story 2.2 PRD Summary

System Goal: The Python worker must fetch web article URLs and extract clean content using newspaper4k, so that users can read articles offline.

Acceptance Criteria (abridged):

  1. Fetch article with metadata.
  2. Parse content via newspaper4k.
  3. Store extracted data in PostgreSQL.
  4. Handle failures gracefully.
  5. Add integration tests verifying successful extraction.

The Workflow

Each model went through the same four roles sequentially:

Each phase was triggered manually, letting the agent hand off to the next. Minimal hints were provided to maintain parity across models.

To maintain a level playing field, all external MCP servers and plugins were disabled. Each agent relied purely on its internal reasoning and the provided project files.

This setup ensured that each model faced identical constraints. No hidden context, no external tools; just reasoning and code.

What’s Next

Part 2 details the quantitative and qualitative results: how many lines of code each added, where they excelled, and which model proved practical for day-to-day use.

If you’d like to follow the full series:

Updated:


Comments/Suggestions?

NOTE: You'll need to have a github account and give giscus comment access. This is necessary to allow it to post a comment on your behalf. If you don't feel comfortable giving giscus access, please find the corresponding topic and manually comment here.