3 minute read

Part 2 of 3: Quantitative and qualitative results from comparing the three AI agents.

TL;DR: Codex delivered the cleanest, most maintainable code; Claude came close with superior documentation; Gemini lagged in depth but won on user experience. Below are the results, metrics, and my hands‑on impressions.

Results

To help with brainstorming and include some diversity of opinion, I added Claude Sonnet 4.5 as an independent judge and include its scores below. I have reviewed and agree with its numbers. See here for a comprehensive summary (generated by Claude and peer reviewed by me).

Summary

(Claude’s self-scores vs. my own developer assessment.)

Model Claude’s Score My Score Lines Added QA Fixes Test Files Architecture Grade
Codex 9.0/10 7.6/10 2,338 8 8 A+
Claude v1 7.2/10 8.4/10 4,138 8 8 B
Gemini 6.4/10 6.4/10 1,930 10 11 B‑

Claude’s Rubric Summary

Rubric Claude Gemini Codex Notes
Documentation 8/10 7/10 9/10 Codex: comprehensive docstrings with examples; Claude v1: good inline docs; Gemini: adequate but less detailed
Test Quality 8/10 7/10 9/10 Codex: best coverage & organization; Claude v1: good fixtures; Gemini: fewer test scenarios
Code Readability 7/10 6/10 9/10 Codex: clear class structure, dependency injection; Claude v1: procedural but clear; Gemini: less organized
Code Sustainability 6/10 6/10 9/10 Codex: excellent separation of concerns, easy to extend; Claude v1 & Gemini: moderate coupling
Architecture Quality 7/10 6/10 9/10 Codex: best practices (DI, singleton services, object storage); Claude v1: solid but simpler; Gemini: functional but basic
Overall Score 7.2/10 6.4/10 9.0/10 Codex demonstrates significantly stronger engineering practices

My Personal Rubric Summary

Claude’s rubric emphasizes structural rigor, while my own focuses more on day-to-day developer experience.

Category Claude Gemini Codex Notes
Documentation 10/10 9/10 8/10 Claude produced beautiful, exhaustive docstrings.
Code Quality 7/10 4/10 10/10 Codex wrote professional‑grade code with clear abstractions.
Debugging 8/10 2/10 8/10 Gemini struggled to escape loops; Claude & Codex self‑corrected.
Speed 9/10 8/10 4/10 Codex was slowest, Claude fastest.
CLI Experience 8/10 9/10 8/10 Gemini’s fallback to Flash made quota exhaustion less painful.
Overall Score 8.4/10 6.4/10 7.6/10  

Claude

Claude Sonnet 4.5 was the most consistent collaborator. Its generated tests used realistic HTML fixtures, mocking libraries like responses, and clear async patterns. It produced verbose but readable documentation and had the lowest context churn.

Pros:

  • Excellent test coverage and fixtures.
  • Clear documentation scaffolding that encourages future edits.
  • Fast iteration speed.

Cons:

  • Slightly procedural code structure.
  • Over‑parameterized functions (could have used data classes).

Codex

Codex felt like an experienced senior engineer: slightly slower but much deeper. It made holistic design decisions, introducing dependency injection, typed service classes, and structured exception hierarchies.

Pros:

  • Cleanest architecture, best abstraction boundaries.
  • Found and fixed subtle issues (e.g. network access, port collisions).
  • Consistent naming, modularity, and maintainability.

Cons:

  • Slowest by far (15 min for some steps vs 2-3 min on Claude).
  • Over‑engineered at times (e.g. wrapper classes inside tests).

Gemini

Gemini had a better UX but faltered in engineering consistency. Pro quota ran out often and downgraded to Flash mode, which degraded reasoning depth.

Pros:

  • Smooth CLI experience.
  • Simpler code: easier to read for small scripts.
  • Flash fallback prevents total workflow interruption.

Cons:

  • Incomplete or incorrect implementations.
  • Frequent logic loops and weak debugging.
  • Lost context between sessions unless manually saved.

The Numbers Behind the Experience

Metric Claude Gemini Codex Observation
Lines Added 4,138 1,930 2,338 Claude writes the most verbose code.
Lines Deleted 80 398 655 Codex performs structural clean‑up.
Net Change +4,058 +1,532 +1,683 Gemini least verbose, Codex most balanced.
Files Modified 34 35 29 Similar project impact across agents.
Test Files 8 11 8 Gemini added the most but with lower quality.
QA Fix Commits 8 10 8 Gemini required more corrections.
Total Commits 21 30 20 Gemini needed more iterations.

Developer’s Take

If I had to choose today:

  • Claude wins for day‑to‑day velocity and documentation.
  • Codex wins for long‑term maintainability.
  • Gemini wins for user experience and accessibility.

What’s Next

In Part 3, I explore what happened when I changed one simple instruction: asking the Product Owner persona to explicitly reason through architectural trade‑offs. That single tweak transformed Claude’s performance by nearly 20 % and produced a cleaner architecture.

If you’d like to follow the full series:

Updated:


Comments/Suggestions?

NOTE: You'll need to have a github account and give giscus comment access. This is necessary to allow it to post a comment on your behalf. If you don't feel comfortable giving giscus access, please find the corresponding topic and manually comment here.