The Problem With Measuring AI Productivity
Every time I read about some study measuring the productivity gain of AI-assisted development, I raise an eyebrow. I find them uninteresting and useful only for the online fights between “there is no AI productivity for software developers” and “Claude Code increased my productivity by a billion percent.” The problem is that the landscape is so varied that, honestly, you can stress the data to prove both assertions.
My personal opinion is that many papers on the subject are just messy, and often they do not test at all how people using AI efficiently for work are actually using it. There are many issues I identified: they give too-easy problems, they impose arbitrary time constraints, they give problems to people not familiar with the framework they have to use (a big no no), and, surprisingly, I still see a lot of studies where the test subjects use “copy-pasting code from the web chat interface back and forth.” That is a way to work with AI that I think every developer stopped using in 2024.
But setting this aside, my main problem is that they never measure a specific class of task: the tasks that I would not have even started without AI assistants. We can discuss “this AI agent increases productivity by 2%” or “5%” or “20%” for as long as we want, but the reality is that many software and features I made in the last year would not exist at all without AI agents. I have folders with a sea of tools, automations, and single-use scripts that I would have never started.
So how could I measure that? For me, that is an increase of infinite percent, because they didn’t help me complete a project faster; they are the reason some projects exist. In some sense, it is a kind of Pascal’s wager. How should I redact the claim “AI agents improve developer output by only 2%” when I have concrete evidence on my hard drive of things that would have remained annotated in my notebook for all eternity?
Maybe it is my ADHD talking. Probably other people function differently and can do everything by just deciding to. But for me, lowering the activation moat that blocks me from starting to work on something is 100% worth it, even if I end up doing 80% of the work by hand.
It is a personal thing. I know that. I also understand if you work differently. And that’s the point. I saw developers using these tools in wildly different ways, for very different purposes, and with very different outcomes, and I don’t think it’s possible or fair to reduce that to a universal percentage number in some random study.
Research Code vs. Commercial Code

Since the beginning of my working life, I was torn between my researcher and software developer self. As a software development enthusiast, during my experience as a Ph.D. student, I suffered a lot looking at software implemented by researchers (many times my code is in the set too). Working with research code is usually an horrible experience. Researchers do so many trivial software development mistakes that I’d like to cry. The result is: a lot of duplicated work reimplementing already existent modules and a lot of time spent in integration, debugging and understanding each other code.
On the other hand, it is almost impossible that a researcher will learn the basics of software development in some book because 1) nobody cares (and they really should!) and 2) books on this topic are mostly focused on commercial software development. This is a problem because, even if best practices overlap for the 80%, research code and commercial code are driven by completely different priorities.
So, because I am just a lonely man in the middle of this valley, nor a good research code writer nor a good commercial code writer, I can share my neutral opinion. Maybe, I will convince you, fellow researcher, that may be worth to spend some time improving your coding practices.