Since the beginning of my working life, I was torn between my researcher and software developer self. As a software development enthusiast, during my experience as a Ph.D. student, I suffered a lot looking at software implemented by researchers (many times my code is in the set too). Working with research code is usually an horrible experience. Researchers do so many trivial software development mistakes that I’d like to cry. The result is: a lot of duplicated work reimplementing already existent modules and a lot of time spent in integration, debugging and understanding each other code.
On the other hand, it is almost impossible that a researcher will learn the basics of software development in some book because 1) nobody cares (and they really should!) and 2) books on this topic are mostly focused on commercial software development. This is a problem because, even if best practices overlap for the 80%, research code and commercial code are driven by completely different priorities.
So, because I am just a lonely man in the middle of this valley, nor a good research code writer nor a good commercial code writer, I can share my neutral opinion. Maybe, I will convince you, fellow researcher, that may be worth to spend some time improving your coding practices.
The Priority Rift
The first stop of our journey is in the Priority Rift. Here is where the 80% of the differences between commercial and research code live in. You will never be effective in explaining software development techniques to a researcher if you don’t look the world with his/her priority scale.
Commercial software development has three big priorities:
In (very) short, commercial software must be usable in order to get a lot of customers. It must have sexy and appealing features in order to get customers. It must be stable, because you want customers to stay.
On the other side, the only user of research software is the developer himself (and sometime other researchers). You don’t care about “lot of customers”. So the priorities are:
- Stability/Robustness (but in a controlled environment).
- Features (but with a different meaning).
I put flexibility in bold because it is the main difference. In research code software specifications change after every run. You run an experiment, the experiment produces some kind of results (that you don’t know a priori) and then your supervisor says “Mmm. Let’s try this.” And this often require to do something that you didn’t imagine at the time you started writing the code. You have to minimize the this effect on the software. That’s where you gain more.
Every software development technique that does not take into account the rapid mutability of research software is doomed to fail hard. If you want to sell a software development workflow to a research lab you have to push on how it will be easy to modify the software in the future, how you can modify the software without breaking anything and how this will help the research lab to add features to the research software.
Now that we know which are the top priorities for a researcher when he/she approaches software development, we have to stop in the motivation valley. Even if you can match the researcher priority scale, many times no researcher really cares about good software development anyway.
The first objection is: “software is not the important part, the results are”. This is true, but we have to keep in mind that software is what will produce results. You want a lot of results, you want good results, you want flexibility in software design in order to accommodate flexibility in results production. Moreover, a good software is a stimulus for other researchers to replicate your experiment, to use your software to explore other work directions. Because you want citations, you want this to happen!
And this brings us to the second objection. “Research software doesn’t have users.” That’s not correct. There are two main users for every research project:
- You. You are the user. Good software is the only way to avoid becoming crazy when in a year you will come back to your code to extend it. Your future you is a completely different person and he doesn’t know anything on your software. Documentation and unit testing should be mandatory.
- Colleagues and other researchers. Your code will be used by other researchers too. In the worst case they are the colleagues in your lab. You want to work with them as smooth as possible. A precise code versioning workflow, unit testing and continuous integration systems are the Holy Grail of collaboration. Unfortunately, they are used very bad in many laboratories.
I’ve seen too many times the same algorithm, the same software, the same thing implemented over and over and over because other people’s software is impossible to understand. I’ve measured weeks and months of lost work because of an undocumented and breaking change to a library had been introduced directly in the codebase (without revision, without a pull request). When this happens and there is no unit test discipline and no continuous integration systems, finding where the error is hiding is almost impossible.
Many laboratories work with a terrible efficiency because of bad management of the software development part. I hope, if you are a researcher who writes code every day, I’ve convinced you that learning and applying some “best practices” is a good investment for your research.
But when you can start? I probably will write something in the future but, for now, some video/book on test driven development and refactoring is more than enough. I drop here some recommended books on Test Driven Development and Refactoring that, without too much philosophy, can teach you a tiny bit of software architecture.