Testing in Fortran

I am looking for ways to test Fortran code. The Fortran Wiki lists some frameworks. Do you have experience with these?

The codes I have worked on so far did not have unit tests but rather used system tests (total calculation results were compared to reference results from before). What are your thoughts on that?

4 Likes

For frameworks, take a look at test-drive by @awvwgk and vegetables by @everythingfunctional.

I went through phases of writing unit tests, and then not. I found that I’ve had difficult time deciding what’s a useful unit test. I usually end up doing end-to-end tests (what’s the output given the input), and work backward from there. Other people like writing unit tests before they write the main code, which is slower but you end up being more confident with the code being correct. There’s probably no right answer, and as many answers as there are people. But don’t listen to me, listen to those guys I mentioned above.

6 Likes

Your “system tests” are often what I have to fall back on when starting to work on “legacy” systems, but they fall into a possible logical fallacy; confirmation bias. Your basically left in a position of saying “those results look like I would expect, so they must be right,” but that’s a logical fallacy, because it could be that what “looks” right is actually wrong. I call that “regression” or “reference” testing, and it’s a good starting point, but it’s not where you want to leave it.

I’m going to be giving a talk at FortranCon titled " Your Requirements Specification as an Executable Test Suite" where I discuss my technique/style for writing tests, and provide some examples/demonstration using vegetables. I also offer live training/courses on the topic through Sourcery Insitute.

P.S. thanks @milancurcic for mentioning me

5 Likes

I’m currently making a lot of use of unit testing. I found it quite liberating to use unit tests in the development process to stage and debug new features. Part of my unit tests end up being a reference implementation to compare against my actual implementation, like a numerical differentiation to check the gradient of an energy expression. But maybe those do not strictly qualify as actual unit tests.

Unit testing certainly imposes certain constraints on the implementation, because everything you want to test must be pretty much self-contained and shouldn’t depend (too much) on global states. I think this is a good thing to have but can be difficult archive in an existing historically grown code base.

As for regression tests, I have mixed feelings about those, especially if they are the only possibility to test a project. An unfortunate thing about them is that they tend to grow over time and depending on the problem can become expensive to run. This makes it more cumbersome to regularly run those without a dedicated powerful development machine.

I think regression and end-to-end testing are certainly an important part of a testsuite for a project, but that doesn’t mean a project can’t also have a unit testing / smoke testing suite, which can be run more often in the development process to catch errors early.

5 Likes

I generally find different codes require different techniques, but if you have real confirmed measured data or known confirmed values a “regression” test comparing results to those values is invaluable; but may just tell you your values are wrong (or at a minimum changed); so incorporating other tests like unit tests can be much more valuable in detecting where the cause is; especially if you don’t like debuggers. For regression tests comparing values in a file it is nice to have the tests for floating point values being compared using some kind of tolerance measure. The simple numeric difference program numdiff is an example of a tolerant numeric difference test; there are several listed on the Fortran Wiki.

Other testing-related packages are becoming available as fpm packages that you can easily use as a dependency so I have been finding lately that if I set up my build as an fpm package I can use routines for tolerant floating point repairs, unit testing and logging and report mechanisms and other current and emerging tools very easily.

If you have fpm set up that same repository (the GPF one) runs about 1 000, unit tests when you do an “fpm test” as it includes a unit test framework, and some modules that do statistics, floating point compares, and assertion tests that might be useful examples.

I have been meaning to put my unit testing package out there but there are already several others available so it has not been a big push; but one of the reasons I have stuck with my one is that mine allows for an external process to be called that in my environment builds an sqlite3 file which is used to create automated reports, but if I put the same tests out on github the same calls just write a simple ASCII text report, as running “fpm test” on many of the GPF-related packages such as “M_strings” provides.

You can see other tests such as the reference BLAS/LAPACK packages that do a very nice job. If anyone knows of some existing publicly-available packages with good tests I think it would be useful to list them here. Some testing schemes as set up with Jenkins or github are interesting, as are some language and compiler test suites.

Confidence testing is invaluable in allowing you to make quick changes to your codes. Timing tests and/or using profiling tools (GNU users can see gprof(1), for example) are invaluable for identifying bottlenecks, so if performance is an issue think about including performance tests as well. Even some (conditionally compiled) CPU usage and wallclock values can be useful, and perform a valuable service when combined with unit tests, as it lets you catch changes that impact performance as soon you introduce them in many cases.

So whether you grow your own or use a package a combination of unit/regression/timing tests can have big payoffs; the biggest being when you are working with code that you want to rapidly develop.

The easiest can be numeric libraries that you can do a regression test again know properties like mathematical functions or steam table properties. I was involved in several libraries generating material properties and in one case they were using a printed eight-inch thick reference manually and spot-checking a few thousand values by EYE (which is what lead to the numdiff(1) program I mentioned earlier the first time they asked me to be the one to do the checks!).

Statistics and graphics are some of the more overlooked tools in unit testing in my experience especially when random numbers and field measurements are involved (although some people have bit-repeatable pseudo-random number generators in their codes just so they can do solid regression testing -which can be a very good idea) but even without using other more accurate but sometimes costly methods the human mind is amazingly good at picking up data from a good old plot.

I mention the GPF resources as actual examples you can pull using fpm in a few minutes, but make sure to look at the Fortran Wiki for a list of tools and ideas available. Maybe some of the upcoming talks will make it onto fortran-lang or the Wiki, but I guess we will both be tracking the FortranCon presentation.

@Beliavsky has some nice lists in the Wiki and his github repository that are related that you do not want to overlook.

I forgot to mention one of my favorites is to prepare an input file for programs that read them and using a little program to randomize some of the input. Making sure your code responds well to that, producing good diagnostics for bad or questionable input can be nearly as important as making sure it is producing the right answers when given “correct” input.

3 Likes

That sounds awful. :face_vomiting: Although I’ve used numdiff on several occasions (it’s what I’m currently using for automated regression testing on a project), so I’m glad at least something useful came of it.

Like most people who have already replied, I use both unit tests and system tests. I am also author of one the frameworks (ftnunit). Unit tests are quite useful, if you can set them up from the start or for a library. The main thing about them is that you should be able to implement them in a small amount of code. Otherwise the possibility of errors lurking in the “scaffolding” code becomes a problem - both for verifying the code is correct and for maintenance. And implementing unit tests for an existing, large program is very tedious.
System tests (regression tests mostly) are easier to estabilish, as in general you can use the existing program and merely check that the output does not change or does not change too much from one version to the next.
Another advantage of system tests is that you do not have to prepare the input for complicated pieces of code and examine the output for correctness. For instance, it is not uncommon to have a bunch of routines working together to produce a sensible result that you can check, but it would be at the very least very tedious to examine the results of the intermediate routines for correctness. Just think: your program parses an input file, stores everything in a complicated but adequate data structure, does its calculations and produces the result in the form of a readable report. Formulating unit tests to check that the data structure is filled correctly might involve a lot of extra code that needs to be changed whenever the data structure changes, whereas the program as a whole will give you the answers in a succinct way without that extra efort.
But unit tests are quite good for locating mistakes because you exercise small pieces of the code only.

1 Like

Thanks a lot for all your answers!

I have had a bit of experience with (unit) testing in Django/Python but not so much with Fortran. And it seemed to me that Fortran is a bit more work to make tests with all the strict typing and other constraints. I guess that’s why a lot of the frameworks listed on the Wiki use another language and create the test code dynamically.

I see at least two dimensions in test classification. One is the scope and ranges from unit test over integration test to system test (maybe even more steps). Whether or not something is a regression test is in another dimension. Phrased differently, the first one answers what is tested, the second why, and the end-to-end test would be how, which would be a third. Probably overthinking. :thinking:

@everythingfunctional Thanks for the link. I have now registered for the conference and am looking through the talks from last year.

@urbanjost At least they had tests, only the automation was missing. :sweat_smile:

Subtopic: Do you have additional advice for testing multiprocess code, e.g. with MPI and ScaLAPACK?

For any parallel program that’s meant to be deterministic (i.e. not stochastic), test for bit-for-bit reproducibility. Specifically:

  • The program should produce the same output when run in serial vs. parallel.
  • The program should produce the same output when run consecutively with the same inputs. Different output between consecutive runs of the same executable program given same inputs may be due to race conditions, for example if there’s a parallel barrier (synchronization) missing or in the wrong place. The order of parallel reduction of floating point numbers can also cause this, as the order of operations may be different run-to-run.
  • The program should produce exactly the same output between different number of parallel processes

I think stochastic programs can be tested in a similar way by choosing a set random seed value. I have less experience with those though.

The above tests are for the correctness of a parallel program. Also helpful are regression tests for parallel scalability, i.e. measuring the run-time of the program for different numbers of parallel processes, and paying attention that the scalability (how does the runtime decrease with the increase of the number of parallel processes?) doesn’t drop unexpectedly.

2 Likes

Yes. The “why” for parallel development is reduced wall-clock time so performance and scalability tests are a much bigger requirement, but running such tests can consume a lot of resources so it is OK to include small performance tests with each change, but it is best to have a scheduled set of larger tests that can be regularly scheduled so you are not using your system entirely for testing, assuming you are using the same facilities for production work.

The other change, as mentioned, is to take into account those types of errors that can occur only because your code is parallel, which can vary with the methods used; but the list above is dead-on about the types of permutations you need in a test suite when the code is parallel that do not really enter into scalar codes (albeit lack of reproducibility in scalar codes is an indicator of other errors that occur even in scalar codes).

1 Like

I’d recommend as much as possible keeping the serial and parallel logic separate. That way you can independently test whether the calculations are correct (the serial parts) vs the coordination/communication parts. That’s more of a design challenge than a testing challenge, but often the two aren’t mutually exclusive.

In general, @milancurcic 's advice holds even after taking the above into account.

We’ve been using FRUIT on a small scale and been quite happy with it.
It can generate JUnit format XML reports which are easy to load by your favourite build system. Useful for making sure you haven’t broken anyone elses code.