AI Tooling Evaluation

Evaluation of software-engineering AI tools

A pragmatic evaluation of software-engineering AI tooling to be used by over 1,000 software engineers, focused on measuring productivity in realistic development tasks instead of relying only on demos, vendor claims, or subjective impressions.

The pilot compared GitHub Copilot, Cursor, and Claude Code across real Adevinta software-engineering work. It involved 77 engineers, 14 teams, and 165 tracked tasks over four weeks. The objective was practical: decide which tooling could create meaningful value in real codebases, with real workflows, and with enough measurement discipline to support an enterprise rollout decision.

The pilot allowed us to choose the best tool for Adevinta at the time, several months ahead of the overal concensus.

Evaluation design

The central artifact to measure was a task tracker. Engineers recorded task characteristics, AI-tool experience level, estimated time without AI, expected PR revisions, actual time with AI, actual PR revisions, task outcome, and perceived usefulness. This made it possible to compare estimated cycle time without AI against actual delivery time with AI, while still capturing qualitative evidence from surveys, focus groups, and manager discussions.

Results

The directional result was that Claude Code showed the strongest performance in this pilot: higher productivity proxy, higher completion rate, and stronger user rating. Cursor showed meaningful gains, especially for more experienced users. GitHub Copilot, already familiar to many engineers, showed smaller gains in this specific setup.

You can find many more details in the following links: