Claude vs Codex vs Gemini: Results and Analysis
Part 2 of 3: Quantitative and qualitative results from comparing the three AI agents.
- Part 1: Setup and methodology
- Part 2 (this post): Results and analysis
- Part 3: Why asking the model to reason changed everything
TL;DR: Codex delivered the cleanest, most maintainable code; Claude came close with superior documentation; Gemini lagged in depth but won on user experience. Below are the results, metrics, and my hands‑on impressions.
Results
To help with brainstorming and include some diversity of opinion, I added Claude Sonnet 4.5 as an independent judge and include its scores below. I have reviewed and agree with its numbers. See here for a comprehensive summary (generated by Claude and peer reviewed by me).
Summary
(Claude’s self-scores vs. my own developer assessment.)
| Model | Claude’s Score | My Score | Lines Added | QA Fixes | Test Files | Architecture Grade |
|---|---|---|---|---|---|---|
| Codex | 9.0/10 | 7.6/10 | 2,338 | 8 | 8 | A+ |
| Claude v1 | 7.2/10 | 8.4/10 | 4,138 | 8 | 8 | B |
| Gemini | 6.4/10 | 6.4/10 | 1,930 | 10 | 11 | B‑ |
Claude’s Rubric Summary
| Rubric | Claude | Gemini | Codex | Notes |
|---|---|---|---|---|
| Documentation | 8/10 | 7/10 | 9/10 | Codex: comprehensive docstrings with examples; Claude v1: good inline docs; Gemini: adequate but less detailed |
| Test Quality | 8/10 | 7/10 | 9/10 | Codex: best coverage & organization; Claude v1: good fixtures; Gemini: fewer test scenarios |
| Code Readability | 7/10 | 6/10 | 9/10 | Codex: clear class structure, dependency injection; Claude v1: procedural but clear; Gemini: less organized |
| Code Sustainability | 6/10 | 6/10 | 9/10 | Codex: excellent separation of concerns, easy to extend; Claude v1 & Gemini: moderate coupling |
| Architecture Quality | 7/10 | 6/10 | 9/10 | Codex: best practices (DI, singleton services, object storage); Claude v1: solid but simpler; Gemini: functional but basic |
| Overall Score | 7.2/10 | 6.4/10 | 9.0/10 | Codex demonstrates significantly stronger engineering practices |
My Personal Rubric Summary
Claude’s rubric emphasizes structural rigor, while my own focuses more on day-to-day developer experience.
| Category | Claude | Gemini | Codex | Notes |
|---|---|---|---|---|
| Documentation | 10/10 | 9/10 | 8/10 | Claude produced beautiful, exhaustive docstrings. |
| Code Quality | 7/10 | 4/10 | 10/10 | Codex wrote professional‑grade code with clear abstractions. |
| Debugging | 8/10 | 2/10 | 8/10 | Gemini struggled to escape loops; Claude & Codex self‑corrected. |
| Speed | 9/10 | 8/10 | 4/10 | Codex was slowest, Claude fastest. |
| CLI Experience | 8/10 | 9/10 | 8/10 | Gemini’s fallback to Flash made quota exhaustion less painful. |
| Overall Score | 8.4/10 | 6.4/10 | 7.6/10 |
Claude
Claude Sonnet 4.5 was the most consistent collaborator. Its generated tests
used realistic HTML fixtures, mocking libraries like responses, and clear
async patterns. It produced verbose but readable documentation and had the
lowest context churn.
Pros:
- Excellent test coverage and fixtures.
- Clear documentation scaffolding that encourages future edits.
- Fast iteration speed.
Cons:
- Slightly procedural code structure.
- Over‑parameterized functions (could have used data classes).
Codex
Codex felt like an experienced senior engineer: slightly slower but much deeper. It made holistic design decisions, introducing dependency injection, typed service classes, and structured exception hierarchies.
Pros:
- Cleanest architecture, best abstraction boundaries.
- Found and fixed subtle issues (e.g. network access, port collisions).
- Consistent naming, modularity, and maintainability.
Cons:
- Slowest by far (15 min for some steps vs 2-3 min on Claude).
- Over‑engineered at times (e.g. wrapper classes inside tests).
Gemini
Gemini had a better UX but faltered in engineering consistency. Pro quota ran out often and downgraded to Flash mode, which degraded reasoning depth.
Pros:
- Smooth CLI experience.
- Simpler code: easier to read for small scripts.
- Flash fallback prevents total workflow interruption.
Cons:
- Incomplete or incorrect implementations.
- Frequent logic loops and weak debugging.
- Lost context between sessions unless manually saved.
The Numbers Behind the Experience
| Metric | Claude | Gemini | Codex | Observation |
|---|---|---|---|---|
| Lines Added | 4,138 | 1,930 | 2,338 | Claude writes the most verbose code. |
| Lines Deleted | 80 | 398 | 655 | Codex performs structural clean‑up. |
| Net Change | +4,058 | +1,532 | +1,683 | Gemini least verbose, Codex most balanced. |
| Files Modified | 34 | 35 | 29 | Similar project impact across agents. |
| Test Files | 8 | 11 | 8 | Gemini added the most but with lower quality. |
| QA Fix Commits | 8 | 10 | 8 | Gemini required more corrections. |
| Total Commits | 21 | 30 | 20 | Gemini needed more iterations. |
Developer’s Take
If I had to choose today:
- Claude wins for day‑to‑day velocity and documentation.
- Codex wins for long‑term maintainability.
- Gemini wins for user experience and accessibility.
What’s Next
In Part 3, I explore what happened when I changed one simple instruction: asking the Product Owner persona to explicitly reason through architectural trade‑offs. That single tweak transformed Claude’s performance by nearly 20 % and produced a cleaner architecture.
If you’d like to follow the full series:
- Part 1: Setup and methodology
- Part 2 (this post): Results and analysis
- Part 3: Why asking the model to reason changed everything