How we evaluated Claude Code, Cursor, and GitHub Copilot across 77 engineers and 165 real tasks to determine productivity impact
March 9th, 2026
Originally published on Adevinta Tech Blog
Manuel Sánchez Hernández, Ramón Salvadó Ferrero
Albert Puigsech Galicia
Mario Viñas Ruiz
Francesca Lorenzoni
Diego Duchowney
Ferran Grau
| How to interpret the results of this pilot This was a pragmatic, decision-oriented pilot, done with real work across our teams. Results are directional and internally valid to Adevinta codebases and workflows. Limitations and Design Trade-offs: Our results are based on assessment of experienced software engineers on how long a task should last, which can be subjective. Tools were assigned to teams to minimize disruption, so results reflect real adoption contexts rather than randomized trials. Our sample size is limited, we add confidence intervals in our data. The study lasted only 1 month, hence long term effects are not measured. Pilot was done in May-June 2025, with the state of tools at that point in time. All tools have evolved since then. | | :—- |
We evaluated three coding AI assistants: Cursor, Claude Code and Github Copilot, with 77 software engineers over 165 real work tasks and four weeks to understand which AI assistant delivers meaningful, cost-effective impact in Adevinta’s environment. The objective of the pilot was to make a decision about which AI tool to deploy; it had to be pragmatic, limited in scope and time, and hence is not an academic study. Data is grounded in Adevinta codebases and workflows.
However, we believe our study is still relevant if the results are interpreted directionally, especially as there is a lack of published results in enterprise settings. In the absence of reliable metrics, in an area where there is so much noise, hype and evolution, we believe that it is important for a company like Adevinta to evaluate first hand different AI tools before making a decision.
Our main findings were: Claude Code demonstrated the strongest performance across all metrics, showing the largest productivity boost, highest completion rate, and highest user rating. Cursor provided moderate but meaningful gains, with a solid completion rate but not as good user sentiment. Github Copilot, which was the tool most engineers had access to already, showed small, almost negligible productivity gains, the lowest completion rate, and the lowest rating.
The reasons for this difference relies on: (1) The default configuration of Claude Code (which at the time of the pilot worked with a “pay as you go” pricing model) makes it automatically select which model should be used for each prompt, while for Cursor and Github Copilot (flat monthly rate) more powerful models need to be manually selected. Claude Code selected the most powerful models 60 times more than the other tools. (2) Claude Code is easier for users to start with, we observed how inexperienced users get value out of it, while Cursor has a steeper learning curve. We actually observed that medium experienced users with Cursor can reap a productivity similar to Claude Code. (3) Cursor, which is an IDE based on VSCode, and Github Copilot, a plugin that works for IntelliJ and VSCode but not XCode, created a temporary learning curve for IntelliJ or XCode users, which may also influence our metrics.
With the data gathered we estimated that Claude Code had the best ROI and integrates best within Adevinta workflows. Deploying this tool in our over 2,000 software engineers is a first step towards increasing productivity through AI coding assistants while we work on other fronts, such as automated metrics, training, software engineering workflow integration… We also plan to do periodic evaluations, similar to the one done in this pilot, of different software engineering tools to adjust our strategy.
Generative AI coding assistants are promising double-digit productivity gains for developers. For a company like Adevinta with thousands of software engineers across brands such as Leboncoin, mobile.de, Kleinanzeigen or Marktplaats, even single-digit improvements translate into millions of euros of capacity a year.
At the same time, the AI software engineering tooling ecosystem is changing very rapidly, with new tools coming every quarter, promising large productivity gains and sometimes disappointing for its inconsistent performance. The amount of hype seems to increase at the same pace, with leaders in the industry claiming the end of human software engineers at the end of 2025.
Contributing to this noise is the jagged frontier: no one knows in which type of tasks AI models will perform or fail, unless you test it yourself. We were quite conscious about that, since our teams have been building and deploying AI models for over a decade, with over 250 use cases currently delivering value.
Additionally, Adevinta has already integrated AI coding tools like Github Copilot. While initial engineer feedback was largely positive, we couldn’t conclusively link these tools to productivity gains. Informal tests with other Software Engineering AI tools also yielded inconclusive results.
In summary, due to the potential of these new wave of AI tools, the absence of reliable metrics, the amount of marketing hype, the importance of software engineering to Adevinta, the high growth of the ecosystem and the jagged frontier, we opted to conduct our own pilot program. We believe that first hand experience with the most advanced tooling was required to decide on the best path forward.
In May and June 2025 we ran a structured evaluation of GitHub Copilot, Cursor and Claude Code, with the aim to answer which assistant delivers the most value in our real-world codebases. We shortlisted these three tools because they were industry leaders at the time and had received positive feedback from many informal tests run by our engineers.
Our pilot lasted four weeks, with 77 engineers of 14 different teams and four marketplaces, working on real tasks from their team’s backlog. All teams were product teams, each responsible for a part of Adevinta’s marketplaces, and made up of various engineering roles such as backend, frontend, platform…
Each team was assigned to one of the three AI tools for the whole period. Before the pilot started we made sure that every engineer received an onboarding training on their AI tool by subject matter experts. We also held weekly sessions where they could ask questions, and there were support online channels with experts on each tool available.
Pilot data was gathered from various sources, including two surveys (pre- and post-pilot), three focus groups, four meetings with the team leaders, cost information and telemetry from the AI tools, and information for each individual task performed by the engineers. In total we were able to track 165 tasks during the pilot, which were added to a table, the “task tracker”.
For a summary of the design of the pilot, see the following table:
| Aspect | Design choice |
|---|---|
| Participants | 77 software engineers, 14 cross-functional product squads (back- and front-end, data & mobile developers) |
| Duration | 4 weeks, during May-June 2025 |
| Tools | Github Copilot, Cursor and Claude Code |
| Task mix | Bug-fixing, refactoring, feature development, migrations, documentation, planning, tooling creation, each flagged with their corresponding area: backend, frontend, Android, iOs, infra or full-stack |
| Metrics captured | Estimated-vs-actual cycle time, task outcome, PR size & comments, survey ratings, API usage and costs |
We conducted sessions with team’s managers to align on how to best measure the impact of AI. Even if there were different metrics already available (some teams tracked DORA metrics) they were not consistent across teams and marketplaces, hence we agreed that the best was to manually track them for this pilot. In fact, several software engineer productivity tracking tools resort to doing polls to track AI productivity.
A key artifact we employed was our “task tracker”, a table where task details were collected. These tasks were pre-existing in the teams’ backlogs, integrated into their plans, and not specifically chosen for the pilot. For each task, information was collected both before it started and after it was completed.
Before beginning each task, engineers will inform about the following details:
Once the task was completed, engineers were required to add the following information:
Engineers received definitions and guidance for each attribute in the task tracker, with weekly meetings held to address any questions. A comprehensive list and further details are available in Appendix 1.
From the task tracker, we calculated a productivity proxy, or percentage of time saved, by comparing the actual cycle time to the estimated cycle time without AI, using the following formula:
Where Ptool is the productivity of the tool, ΣiAi is the summation of the actual times for all tasks of the tool, and ΣiEi is the summation of the estimated times without AI for all tasks in the tool. By using this formula, longer tasks will naturally weigh more. We also determined a 95% confidence interval.
In the Appendix 2 we analyze different sources of errors for the metric and how they were mitigated.
Before starting the pilot, we surveyed our population of 77 engineers to have a better understanding of the adoption of AI tooling. With 70 answers, the most notable results are summarized in the table below:
| Engineers using AI tools daily | 47.1% |
|---|---|
| Used Github Copilot | 62.9% |
| Used Cursor | 27.1% |
| Used Claude Code | 8.6% |
| Time in coding activities (%) | 66.8% |
Almost half of the engineers already use AI tools daily, with Github Copilot the most used tool, as it was already available within Adevinta for all engineers, followed by Cursor and Claude Code.
We also asked them to estimate how much time they spend in coding activities out of all their time, resulting in 67% of their time. This result surprised us, as in many other sources the average is between 25% and 40% (Kumar 2025, Meyer 2019, IDC report 2024).
We also asked what development environments they use, and the results are in the chart below. More than half of the users used IntelliJ, with VSCode the second choice. This is relevant as each one of the tools tested work differently across environments, and becomes an important factor in the results of the pilot.
This is a summary of the main metrics we tracked:
These results were relevant to our decision-making, and some were actually very different from our initial assumptions.
In the following table and violin plot we provide more information:
This distribution of the plot shows that, for many tasks, the productivity of Cursor and Claude Code are very high. It is however a set of tasks that drag the performance of Cursor down in comparison to Claude Code. We analyze this further in the discussion section.
Some additional results from the pilot were:
One important result from the focus groups and comments collected from the polls was about how well the tools integrated with the different IDE’s and workflows of the software engineers. In particular, according to many comments, the integration of Claude Code into Adevinta development workflow was easy, as it didn’t require changing or learning a new IDE. In contrast, to use Cursor, many engineers that were used to IntelliJ (63% of total) or XCode (16%) needed to learn about its usage, as Cursor is a fork of VSCode. Users found Github Copilot easier to start due to its availability as a plugin for both VSCode and IntelliJ.
Additionally, we observed the following:
Results seem to show that Claude Code outperforms other AI tools in many metrics: our productivity proxy metric, user experience and average AI completion rate. This outcome was unexpected for us, but further analysis revealed several factors that likely contributed to it:
At the time of the pilot (June 2025) Cursor and Github Copilot had a flat rate close to 40$ per user and month, while Claude Code consumption model was a “pay as you go”. Our 10 users with the highest consumption spent $202 in Claude Code during the month.
By default, Claude Code will select which model to use depending on the complexity of the task while for Cursor and Github Copilot selects the less powerful models by default. The user could change that and they sometimes did (for Cursor they were called Max models, which may incur extra cost after some usage), more so when they became more experienced with the tool.
We observed that the amount of tokens consumed by Claude Code on these models was 60 times larger than the other two tools, which also explains the highest performance.
Over 50% of our engineers used IntelliJ and XCode, while Cursor is a fork of VSCode and Github Copilot, being a plugin, works for both IntelliJ and VSCode but not for XCode. Users needed to learn another IDE that they were not familiar with, which we believe also reduced their effectiveness during the pilot. This was commented by some engineers in the focus groups and the post-survey. This learning is something temporary, and we believed it also had some weight on the difference in productivity by level of experience with the AI tool.
The pilot challenged our initial assumptions, and we adapted our marketplace’s strategy (Mobile, Leboncoin, Marktplaats, Kleinanzeigen and Subito). We plan to continue evaluating different tools with similar pilots as the one presented in this article, and evolve our approach accordingly.
Building the capability to independently and rapidly evaluate, first hand, emerging AI tooling and generate reliable, evidence based results is ultimately more valuable than identifying a short term winner. As the AI landscape evolves at high speed, with a significant noise-to-signal ratio and often unreliable public benchmarks, developing this internal evaluation muscle is a key long-term advantage.
Increasing the speed of our software engineers, however, requires more than just deploying a tool, that is why we are working on several other fronts: automated metrics, training, integrating the tool across the whole engineering workflow… We hope we can share results about this soon.
Kumar, S., Goel, D., Zimmermann, T., Houck, B., Ashok, B., & Bansal, C. (2025). Time Warp: The Gap Between Developers’ Ideal vs Actual Workweeks in an AI-Driven Era. arXiv preprint arXiv:2502.15287. Retrieved from https://arxiv.org/abs/2502.15287
Meyer, A. N., Barr, E. T., Bird, C., & Zimmermann, T. (2019). Today was a Good Day: The Daily Life of Software Developers. IEEE Transactions on Software Engineering. Microsoft Research preprint available at: https://www.microsoft.com/en-us/research/wp-content/uploads/2019/04/devtime-preprint-TSE19.pdf
IDC. (2024). How Do Software Developers Spend Their Time? IDC Report (as summarized in InfoWorld). https://my.idc.com/getdoc.jsp?containerId=US53204725
Dell’Acqua, Fabrizio, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Harvard Business School Working Paper, No. 24-013, September 2023.
The following table represents the data that was consolidated in our task tracker. Some information was added before the task started and some it finished.
| Field | Definition |
|---|---|
| Task ID | Unique identifier from our issue tracking system |
| Task Description | Brief, clear description of what the task involves |
| Engineer Name | Name of the person accountable for the task |
| Experience with the AI Coding Tool | Level of experience of the engineer accountable for the task, which will likely change after 1-2 weeks. First time means the user is understanding how it works, with very little experience. Intermediate means that the user understands and has used the main features, and is comfortable with them. Expert is someone that can provide tips and best practices to others. |
| AI Tool Used | AI tool that was used for the task: GitHub Copilot, Cursor, or Claude Code. Note that teams (and hence engineers) were assigned to one of the tool during the whole period. |
| Task Type | Type of task among the following categories: Bug Fixing, Refactoring, Feature Development, Code Migration, Documentation, Tooling or Other |
| Task difficulty | Indicate how challenging the task is: Small, Medium or Large |
| Area | Specify which functional area the task is mostly about among the following options: Backend, Frontend, Android, IOS, Infra, Fullstack and Other |
| Estimated Time Without AI (days) | Engineer estimation of the task cycle time without AI assistant, e.g. by comparing to similar tasks, using Three-Point-Estimation, or the experience of the engineer. |
| Estimated # PR revisions | How many reviews the engineer expects the PR would typically require, e.g., tasks with low complexity would only require 1 review, whereas high complexity tasks may require several reviews. |
| Confidence in estimate | How confident the engineer is in the time estimates. Values were Low, Medium or High. |
| Field | Definition |
|---|---|
| Actual Time with AI (days) | Actual time taken to complete the task using the AI assistant in days |
| Actual # PR revisions | How many revisions were actually needed before PR approval |
| Task Outcome | Indicates if task was (1) Completed, fully done by AI, (2) Completed, partially done with AI, (3) Completed manually, AI not able to help, (4) Not completed, due to external factors not related to AI |
| Rate the help of AI tool | Rating of the usefulness of AI tooling for this task, from 1 being the lowest to 5 |
| Link to the Pull Request | Link to the Pull Request for the task |
| Optional comments | Any notable observations about the AI’s performance or limitations for this task |
Our method offers a pragmatic approach to measuring productivity gains from AI tooling. However, several sources of error could affect the validity of this metric. Below, we outline these potential biases and describe the steps taken to mitigate them.
People often underestimate task durations (planning fallacy) either due to optimism bias or because estimates do not include unexpected eventualities that produce delays. This can inflate the apparent productivity gain when actual times are longer. Conversely, some estimates may be done conservatively, making the tool look less effective than it is.This effect can be mitigated by adding a control group.
In fact, as we had Github Copilot deployed already in the company before the pilot and adopted by 63% of the population, using a control group without SWE AI tooling would mean to actually stop it for some users of teams. Instead of that, we decided to take Github Copilot as part of the tooling to compare and one can consider this group as the control group.
The “actual time” logged may include overhead unrelated to code writing (e.g. waiting for code reviews, access permissions, builds, or clarifications). These delays distort the metric since they are not influenced by the AI assistant.
At the same time, these delays are also included in the time estimations and the real time, which should reduce the effect of this factor.
Different task types (e.g. bug-fixing vs. code refactoring) vary widely in predictability and in how much AI can help. Averaging them together risks masking the fact that tools may excel in one category but underperform in others.
We present the stratified results by task type and area in Appendix 4. We observe that, for all tools, the most common task types are Feature Development, Code Refactoring and Bug Fixing. We see that Bug Fixing is actually less present in Claude Code, with only 2 tasks, which could possibly impact the result. We see however that Bug Fixing had a positive impact in Cursor, increasing its metrics.
Senior vs. junior developers estimate differently and use tools differently. Some may benefit more from AI (e.g. juniors for boilerplate, seniors for exploring alternatives). To reduce the effect of this factor, we assigned the tools by teams, which by design have a mix of roles and seniority.
Weighting long tasks more heavily can skew the average. A single large task with an unusual length could dominate the results, even if smaller tasks consistently showed steady benefits.
In the following table we represent the same data after removing the 5% longest tasks (in their predicted time). All these tasks actually were from Cursor, had an average estimated duration of 7.6 days. We see how the results remain, directionally, in the same way as the original results.
| AI Tool Used | % of productivity 95% confidence range | # of tasks | Average AI Completion Rate | Average Rating | Average estimated time without AI (days) |
|---|---|---|---|---|---|
| Claude Code | 42.5 - 44.5 | 44 | 74% | 4.2 | 1.3 |
| Cursor | 23.5 - 24.6 | 59 | 60% | 3.6 | 1.4 |
| GitHub Copilot | 0.5 - 2.7 | 54 | 44% | 2.8 | 1.1 |
| Grand Total | 21.2 - 21.5 | 157 | 58% | 3.5 | 1.3 |
We also include the average estimated time without AI in days for each tool after removing the 5% longest task, and show that they are quite similar, with a variation of 0.3 days between the shortest (Github Copilot) and the longest (Cursor).
Potential paths to publishing