RSS Icon GitHub Icon LinkedIn Icon RSS Icon

The Problem With Measuring AI Productivity

Every time I read about some study measuring the productivity gain of AI-assisted development, I raise an eyebrow. I find them uninteresting and useful only for the online fights between “there is no AI productivity for software developers” and “Claude Code increased my productivity by a billion percent.” The problem is that the landscape is so varied that, honestly, you can stress the data to prove both assertions.

My personal opinion is that many papers on the subject are just messy, and often they do not test at all how people using AI efficiently for work are actually using it. There are many issues I identified: they give too-easy problems, they impose arbitrary time constraints, they give problems to people not familiar with the framework they have to use (a big no no), and, surprisingly, I still see a lot of studies where the test subjects use “copy-pasting code from the web chat interface back and forth.” That is a way to work with AI that I think every developer stopped using in 2024.

But setting this aside, my main problem is that they never measure a specific class of task: the tasks that I would not have even started without AI assistants. We can discuss “this AI agent increases productivity by 2%” or “5%” or “20%” for as long as we want, but the reality is that many software and features I made in the last year would not exist at all without AI agents. I have folders with a sea of tools, automations, and single-use scripts that I would have never started.

So how could I measure that? For me, that is an increase of infinite percent, because they didn’t help me complete a project faster; they are the reason some projects exist. In some sense, it is a kind of Pascal’s wager. How should I redact the claim “AI agents improve developer output by only 2%” when I have concrete evidence on my hard drive of things that would have remained annotated in my notebook for all eternity?

Maybe it is my ADHD talking. Probably other people function differently and can do everything by just deciding to. But for me, lowering the activation moat that blocks me from starting to work on something is 100% worth it, even if I end up doing 80% of the work by hand.

It is a personal thing. I know that. I also understand if you work differently. And that’s the point. I saw developers using these tools in wildly different ways, for very different purposes, and with very different outcomes, and I don’t think it’s possible or fair to reduce that to a universal percentage number in some random study.