Claude vs Codex vs Gemini: Results and Analysis

3 minute read

Part 2 of 3: Quantitative and qualitative results from comparing the three AI agents.

Part 1: Setup and methodology
Part 2 (this post): Results and analysis
Part 3: Why asking the model to reason changed everything

TL;DR: Codex delivered the cleanest, most maintainable code; Claude came close with superior documentation; Gemini lagged in depth but won on user experience. Below are the results, metrics, and my hands‑on impressions.

Results

To help with brainstorming and include some diversity of opinion, I added Claude Sonnet 4.5 as an independent judge and include its scores below. I have reviewed and agree with its numbers. See here for a comprehensive summary (generated by Claude and peer reviewed by me).

Summary

(Claude’s self-scores vs. my own developer assessment.)

Model	Claude’s Score	My Score	Lines Added	QA Fixes	Test Files	Architecture Grade
Codex	9.0/10	7.6/10	2,338	8	8	A+
Claude v1	7.2/10	8.4/10	4,138	8	8	B
Gemini	6.4/10	6.4/10	1,930	10	11	B‑

Claude’s Rubric Summary

Rubric	Claude	Gemini	Codex	Notes
Documentation	8/10	7/10	9/10	Codex: comprehensive docstrings with examples; Claude v1: good inline docs; Gemini: adequate but less detailed
Test Quality	8/10	7/10	9/10	Codex: best coverage & organization; Claude v1: good fixtures; Gemini: fewer test scenarios
Code Readability	7/10	6/10	9/10	Codex: clear class structure, dependency injection; Claude v1: procedural but clear; Gemini: less organized
Code Sustainability	6/10	6/10	9/10	Codex: excellent separation of concerns, easy to extend; Claude v1 & Gemini: moderate coupling
Architecture Quality	7/10	6/10	9/10	Codex: best practices (DI, singleton services, object storage); Claude v1: solid but simpler; Gemini: functional but basic
Overall Score	7.2/10	6.4/10	9.0/10	Codex demonstrates significantly stronger engineering practices

My Personal Rubric Summary

Claude’s rubric emphasizes structural rigor, while my own focuses more on day-to-day developer experience.

Category	Claude	Gemini	Codex	Notes
Documentation	10/10	9/10	8/10	Claude produced beautiful, exhaustive docstrings.
Code Quality	7/10	4/10	10/10	Codex wrote professional‑grade code with clear abstractions.
Debugging	8/10	2/10	8/10	Gemini struggled to escape loops; Claude & Codex self‑corrected.
Speed	9/10	8/10	4/10	Codex was slowest, Claude fastest.
CLI Experience	8/10	9/10	8/10	Gemini’s fallback to Flash made quota exhaustion less painful.
Overall Score	8.4/10	6.4/10	7.6/10

Claude

Claude Sonnet 4.5 was the most consistent collaborator. Its generated tests used realistic HTML fixtures, mocking libraries like responses, and clear async patterns. It produced verbose but readable documentation and had the lowest context churn.

Pros:

Excellent test coverage and fixtures.
Clear documentation scaffolding that encourages future edits.
Fast iteration speed.

Cons:

Slightly procedural code structure.
Over‑parameterized functions (could have used data classes).

Codex

Codex felt like an experienced senior engineer: slightly slower but much deeper. It made holistic design decisions, introducing dependency injection, typed service classes, and structured exception hierarchies.

Pros:

Cleanest architecture, best abstraction boundaries.
Found and fixed subtle issues (e.g. network access, port collisions).
Consistent naming, modularity, and maintainability.

Cons:

Slowest by far (15 min for some steps vs 2-3 min on Claude).
Over‑engineered at times (e.g. wrapper classes inside tests).

Gemini

Gemini had a better UX but faltered in engineering consistency. Pro quota ran out often and downgraded to Flash mode, which degraded reasoning depth.

Pros:

Smooth CLI experience.
Simpler code: easier to read for small scripts.
Flash fallback prevents total workflow interruption.

Cons:

Incomplete or incorrect implementations.
Frequent logic loops and weak debugging.
Lost context between sessions unless manually saved.

The Numbers Behind the Experience

Metric	Claude	Gemini	Codex	Observation
Lines Added	4,138	1,930	2,338	Claude writes the most verbose code.
Lines Deleted	80	398	655	Codex performs structural clean‑up.
Net Change	+4,058	+1,532	+1,683	Gemini least verbose, Codex most balanced.
Files Modified	34	35	29	Similar project impact across agents.
Test Files	8	11	8	Gemini added the most but with lower quality.
QA Fix Commits	8	10	8	Gemini required more corrections.
Total Commits	21	30	20	Gemini needed more iterations.

Developer’s Take

If I had to choose today:

Claude wins for day‑to‑day velocity and documentation.
Codex wins for long‑term maintainability.
Gemini wins for user experience and accessibility.

What’s Next

In Part 3, I explore what happened when I changed one simple instruction: asking the Product Owner persona to explicitly reason through architectural trade‑offs. That single tweak transformed Claude’s performance by nearly 20 % and produced a cleaner architecture.

If you’d like to follow the full series:

Part 1: Setup and methodology
Part 2 (this post): Results and analysis
Part 3: Why asking the model to reason changed everything

Share on

X Facebook LinkedIn Bluesky

Julien Lhermitte

Claude vs Codex vs Gemini: Results and Analysis

Results

Summary

Claude’s Rubric Summary

My Personal Rubric Summary

Claude

Codex

Gemini

The Numbers Behind the Experience

Developer’s Take

What’s Next

Share on

You May Also Enjoy

How Asking the Model to Reason Changed Everything

How I Compared Claude, Codex, and Gemini for Software Development

Pdf Store App With Gemini

Managing Papers With Zotero And Personal File Server