Research software (i.e., software written to support research in engineering or the sciences) is usually a tangled mess of spaghetti code that only the author knows how to use. Very occasionally I encounter well organized research software that can be used without having an email conversation with the author (who has invariably spent years iterating through many versions).
Spaghetti code is not unique to academia, there is plenty to be found in industry.
Structural differences between academia and industry make it likely that research software will always be a tangled mess, only usable by the person who wrote it.
Using MODULEs, IMPLICIT NONE, picky compiler options, and multiple compilers can help. A general problem is that graduate students, post-docs, and professors are usually rewarded for publications, not their software.
And perhaps things are evolving. For example, in the âTen Simple Rulesâ articles, we can read:
PrliÄ, Andreas, and James B. Procter. âTen Simple Rules for the Open Development of Scientific Softwareâ. PLoS Computational Biology 8, no. 12 (6 December 2012): e1002802. https://doi.org/10.1371/journal.pcbi.1002802.
Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. âTen Simple Rules for Reproducible Computational Researchâ. Edited by Philip E. Bourne. PLoS Computational Biology 9, no. 10 (24 October 2013): e1003285. https://doi.org/10.1371/journal.pcbi.1003285.
Taschuk, Morgan, and Greg Wilson. âTen Simple Rules for Making Research Software More Robustâ. PLOS Computational Biology 13, no. 4 (13 April 2017): e1005412. https://doi.org/10.1371/journal.pcbi.1005412.
Lee, Benjamin D. âTen Simple Rules for Documenting Scientific Softwareâ. Edited by Scott Markel. PLOS Computational Biology 14, no. 12 (20 December 2018): e1006561. https://doi.org/10.1371/journal.pcbi.1006561.
One of the comments below the article says âpeople do like to reinvent the wheelâ, well⌠itâs not that people like to reinvent the wheel. A lot of scientists/engineers donât even know those âwheelsâ exist and some of the wheels are wrapped within some more sophisticated yet complicated APIs that learning how to use those APIs are more timing-consuming than just inventing a get-the-job-done wheel.
As a graduate student, I always have to balance between âspending more time on the readability and optimization of my code so me, my advisor and the following graduate students can benefit from itâ and âgetting the job done so I can get the paper published and thereby get my degree soonerâ. Itâs a never ending battleâŚ
It looks to me (an outsider) like there may be a parallel between how people use APIs and how they use mathematics: There is a wealth of sophisticated results available that may very well apply to the particular case at hand, but sometimes it takes more time to find out if and how, than to build an ad-hoc solution.
Incidentally, I am happy to see the recent activity around modern Fortran here and on related sites. At the moment my own needs are mostly in symbolic computation and not exactly high performance, but I follow it all with interest.
I agree with the title âResearch software code is likely to remain a tangled messâ, but probably not for all research codes. I think, it is more or less normal in reasearch and it is the case for my own codes for several reasons:
A researcher (or student/postdoc/colleague) adds new functionalities in a code which were not planned from the beginning. Therefore, the code starts to be messy. Of course, you could write a new code more properly, but it takes time (for nothing?). However, with a new code, the data from the previous code will be probably not compatible. Then with your new code, youâll have the same problem if you want to add a new functionalities. Itâs never ending!
Nevertheless, sometimes, the code starts to be so messy, that you spend more time patching the code than adding new functionalities. Therefore, itâs better to write a new version.
As a chemist, I did learn some languages (mainly fortran 77), but I never learn how to program properly (code structure, test, manual âŚ). It is the same for most of my colleagues around the world. Furthermore, most of them donât want to learn how to program, because it takes too much time (language, code structure, tests, make or cmake, git or others, manual, parallelization, âŚ). fpm could help a lot for some aspects , if people want to learn how to use it.
Of course, you can hire a software engineer (or try to hire, it is very difficult to justify that in a mainly experimental chemical lab!). Then, the engineer rewrite the code in a right way. So far so good!! But âŚ
Some time the researcher does not understand (or donât make the effort to understand it) this new version (to complicate for him/her) and therefore he/she is using the old messy code.
Other time, this is working properly and the researcher and the software engineer work as a team.
Anyway, the difficulty about adding new functionalities is still present âŚ
About other comments:
âreinvent the wheelâ, as han190 or other say, is not that simple. I did it several times!! Some reasons:
I didnât know this wheel exist
I knew it existence, but it was more simple to rewrite the wheel (taking less time, less complex in terms of dependencies âŚ)
I knew it existence, but by writing the code, I understood more properly how the wheel works. It is particularly interesting for students.
The existing wheel did not fit properly in the code structure or its functionalities is exactly what you want.
" Very occasionally I encounter well organized research software that can be used without having an email conversation with the author". The reason for that has at least two sides: (i) the code can be messy (ii) the science (physics, math âŚ) behind the algorithms can hard to understand and not well understood by a new user in the field.
I have experienced Test Driven Development on one of my research code, and it can be very useful when the code is growing. Having automated tests make it easier to refactor code because you are far less afraid of breaking something. So regularly, when it becomes messy (or you need to optimize some parts), you refactor the code, you launch the tests with confidence. And so on.
Yes you are right, Iâm using some tests also to check new functionalities or new code version although as automatic as it should be. Iâm moving to something more automatic.
I also find the Software Engineering guidelines from the DLR (German Aerospace Center) on this topic quite interesting: DLR Software Engineering Guidelines
Essentially, they divide code into four application classes:
Application Class 0: For software in this class, the focus is on personal use in conjunction with a small scope. The distribution of the software within and outside DLR is not planned.
Software corresponding to this application class frequently arises in connection with detailed research problems.
Application Class 1: For software of this class, it should be possible, for those not involved in the development, to use it to the extent specified and to continue its development. This is the basic level to be strived for if the software is to be further developed and used beyond personal purposes.
Application Class 2: For software in this class, it is intended to ensure long-term development and maintainability. It is the basis for a transition to product status.
Application Class 3: For software in this class, it is essential to avoid errors and to reduce risks. This applies in particular to critical software and that with product characteristics.
Thanks @everythingfunctional for that blog article. I knew JOSS (Journal of Open Source Software), where I published about gtk-fortran, but not JOSE (Journal of Open Source Education), which could be very interesting for me. I never tried to publish about teaching, but this journal could be an opportunity.
And I agree that researchers are learners, eternal learners. I donât know if Learn Fortran - Fortran Programming Language is the place, but it could be interesting to have somewhere a page with articles about research software development, like the Ten Simple Rules cited above. I have a collection of such articles, and learned a lot reading them those ten last years.
Learning Fortran is a good thing, but learning good development practices and methods is also important. Whatever the language, if you have bad programming practices the output will not be optimum! (and soon you will be too scary to modify anything in your dear messy codeâŚ)
This is very interesting discussion. For collaborative codes where many people come and go over the years, I believe itâs very important that the code is organized so that the clashes between developers are minimized. In my experience, the best strategy is to have well-written, documented and organized main routines (main, I/O, globals, parallelization) even with certain aspects fixed in a sort of a protocol, while the developers of particular modules should have freedom to organize their own work as they âlikeâ as long as it fits the global picture/plan.
It is a bit like building a telescope. Once the building is constructed, the size of the dome is fixed, the control room is set in a certain place, the main mirrors and the construction are there, one can let various groups to build their own instruments - they should have freedom to optimize them as they want, but still they have to respect the overall blueprint and they should avoid clashes with other groups doing the same. In my experience, researchers joining already developed code often tend to reinvent the blueprint or to see only their particular module without considering other developers. It leads to a complete mess. It is also my impression, from a small sample, that even worse mess may be created by some IT guys assigned to research teams to optimize the code without fully understanding the purpose of it. As Knuth wisely said: âpremature optimization is the root of all evilâ.
And one more remark. There is a lot of risk in adopting various existing subroutines and, in many cases, what seems to be a shortcut in the end turns into a major restriction. For example, one could get a nice subroutine for various finite difference formulae. However, in a real code, the real troubles often start with boundary conditions that come with a lot of ambiguity and, if developer does not have full control of the FD implementation, there is a big chance that sooner or later s/heâll have to rewrite this FD module from the scratch.
I notice that on a lot of research projects, those blueprints are missing along with guidelines for contributing and requirements for merging. There is momentum in the right direction, but until project leads value drafting and maintaining those guidelines as much as they value their code, the spaghetti code wonât change. The hope, at the very least, is that spaghetti projects will stand out as expensive relative to others and this will be the incentive for change.
To those of you addressing this problem in your own projects, keep showing the community the way!
I agree with you @vmagnin. From my experience, writing programs in Fortran is very easy. I felt it because fortran has small set of rules which can be mastered easily. Subsequently, programmers feel confident and in control while working with Fortran.
However, many books on Fortran do not talk about the packaging, distribution, and CI/CD. I feel such techniques of project/code management should be included (with reference to modern Fortran) on our fortran-lang website.
Yes, @Niko, I have experienced and done such things. However, it is not very common, because often the desired code is a part of big package. Thus, removing the code from the big library is very difficult due to many reasons. This force me to develop my own code (i.e., reinventing the wheel). To overcome this issue, we should disintegrate the big package as a collection of useful objects, modules, data-types.
Very true, there is a lot of powerful tooling around for Fortran, but documentation is rare or the workflows are difficult to grasp without deep prior knowledge of the toolchains involved. CMake happens to be one of the prime examples for a much relied on yet not well documented tool. Also, conda, usually wrongly perceived as a Python package manager, can make a powerful packaging tool for Fortran projects.
Fortran-lang is a good place to collect those resources or write new ones. I can only again encourage everyone here to have an open eye for interesting Fortran projects and learning material and submit it to the fortran-lang webpage as pull request (GitHub - fortran-lang/fortran-lang.org: (deprecated) Fortran website).
Iâm happy to start joint effort on writing introductions and providing examples/templates to build and package infrastructure for Fortran. We already have a bit of material at fortran-lang:
In summary, most scientific modelling codes are expected to be used by user-developers with extensive internal knowledge of the code, the model, and the assumptions behind it, and who are routinely performing a wide variety of checks for correctness before doing anything with the results. In the right hands, you can have a lot of confidence that sensible, rigorous results are being obtained; however they are not for non-expert users.