What kind of tests are sufficient: Some personal thoughts

Code must be battle-tested before it becomes software. How should we test the libraries that we develop? What kind of tests are sufficient?

Suppose that we are developing a package of numerical solvers (for linear algebra, optimization, PDE, etc). I suppose this is the case for many programmers here. I keep PRIMA (Reference Implementation for Powell’s methods with Modernization and Amelioration) in mind at the time of writing. This is a package for solving general nonlinear optimization problems without using derivatives.

One may believe it suffices to test a few problems and observe whether the results are expected. If yes, then “the implementation is correct”. It seems that many people are happy with such a test. For me, this is a joke (sorry to say so, but continue to read).

I would like to elaborate a bit more about the importance of testing and verification, motivated by a conversation with a friend who uses PRIMA in his projects.

This friend refers to his projects as “critical projects”, which are directly related to the life and death of humans — imagine, for example, the designing of a new medicine (although this is not what my friend works on). The reliability of the solver and the reproducibility of the solution can never be exaggerated in such projects. This is quite different from the machine learning problems whose objective is only to decide how to post advertisements — nobody will die if the solver does something wrong. In critical projects, however, people may die.

So, what kind of tests are sufficient for PRIMA, which is designed for (and is being used in) critical projects?

No test will be sufficient. I can only tell that the following tests are necessary.

  1. A large number (e.g., hundreds) of test problems, which can represent as much as possible different challenges that may occur in applications.

  2. TOUGH tests. In applications, function evaluations may fail, it may return NaN or infinity from time to time, and it is very likely contaminated by noise. Any test is insufficient without trying such problems.

  3. Randomized tests. It is impossible to cover all the possible difficulties with a fixed set of problems. Some bugs can only be triggered under very particular conditions that are “difficult” to encounter without randomization (a bug that is rarely triggered is still a bug!!!). Therefore, tests must be randomized, and the random seed must be changed periodically (daily or weekly).

  4. Stress tests. If a solver is designed to solve 100-dimensional problems, then we must test it on (randomized) 1000-dimensional problems and make sure that it does not crash.

  5. Automated tests. It is not enough to randomize the tests. Randomized tests must also be executed automatically every day and night, for example, using GitHub Actions.

  6. Tests on various platforms under different systems using all compilers/interpreters available. Our software should not crash on any platform. Without thorough tests, the only thing I know is that we do not know what will happen.

  7. A sufficiently long time of testing. In general, I do not feel confident about a solver if the accumulated testing time is below 10 years.

Comparing this with “testing a few problems and observing whether the result is expected”, I hope it is clear what I meant by “it is a joke” (sorry again for saying so). Recalling that the solvers may serve projects that decide human life, I guess it is clear why such a joke is not enough.

My experiences in past years have taught me three things.

  1. I do not know what will happen in a particular case until I have made sufficiently many tests about it.

  2. When I believe a test is stupid and unnecessary, the test will show me later that I am the stupid one.

  3. When I believe that I know numerical computation and my code well enough, some tests will show me that I don’t.

PRIMA has been tested in this for more than 20 years, summing up the testing time of all the parallel tests. I insist that any porting/translation of PRIMA should go through the same level of tests. Otherwise, we cannot be sure whether it is proper.

I put testing and verification in the very center when developing (Indeed, I feel that many — if not most — libraries have not been sufficiently tested). Today (20240108), I received the following comment:

thank you for modernizing Powell’s solvers and taking verification serious. This is such important work!

I am delighted that my efforts in testing are appreciated. (You should check the cartoon).

How do you test the libraries that you develop?


[This is a copy of What kind of tests are sufficient for the porting or translation of PRIMA? · libprima · Discussion #39 · GitHub with slight adaptations.]

10 Likes

Perhaps external to testing but started using valgrind to check programs using libraries for mem leaks. Also started using block constructs for test modules to ensure local pointers are nullified, local allocatables are deallocated, and local finalizable objects are finalized.

4 Likes

Here are few thoughts on this topic:

  • Users should be encouraged to contribute test problems from their respective domains; a good example of this in practice is the SuiteSparse matrix collection where users can submit their sparse matrices, so that future versions of the library will work well.
  • Since most problems won’t have known solutions, one methodology potentially worth looking at are approval tests. Here is a video about these. The way I imagine this is, the users provide their minimization problem and an initial value of the objective function. When tests are reran the value should be the same or lower. Large changes (for better or worse) should be carefully analyzed to pinpoint the origin of the change. If deemed reasonable, the new objective value can be taken as the new threshold for the approval test.
  • Study how other programming languages communities testing. For example go comes with a testing module built into the standard library. The command go test will run the tests.
  • Don’t misuse GitHub/GitLab actions (CI) for documenting/specifying tests. Prefer to put your testing commands in scripts (Bash/Python/CMake…) which can be easily launched also locally during development. Limit the role of the CI service to automating the platform setup and launching of the test scripts.
  • I’m not sure if anyone has done something like this in Fortran, but I kind of like the way the LLVM Testing Infrastructure works using the lit command. The test commands are embedded in comments:
    ! RUN: %python %S/../test_errors.py %s %flang_fc1 -fopenmp
    ! Check OpenMP compatibility with the DEC STRUCTURE extension
    structure /s/
    end structure
    end
    
    The downside is that there is a lot of work setting up the infrastructure to parse the test commands from the comments. But I think it would be worth experimenting with this in the future.
2 Likes

I haven’t seen a framework like this for Fortran, but I use this a lot in Python using doctest. I have found this extremely useful in the frame of unit testing as the documentation becomes “alive” for the developer. Would love to see something like this for Fortran.

Regarding the philosophy of testing, I find useful to conceptually split between: unit tests, functional tests and regression tests. Each one covers a different level of complexity and integration of the code base. Having them in place also accelerates subsequent developments, specially unit tests, as verifying that one has not broken important parts of the rest of the code becomes very quick and they force one to revisit the “aesthetics” of the user-interface which can lead to a virtuous cycle in which “simplification” can actually lead to more robust and performant code.

This is paramount for serious development!! no matter how hard one tries to foresee all possible scenarios, users will always found a why to find the bugs even those one left some !TODO: fix this weird behavior, but that should be fine for the moment comment-like … so their reports should be embraced gracefully XD

1 Like

Totally agree.

PRIMA does the following.

  • Functional tests are called verifications. The code is verified against a reference version (basically the last release, with appropriate modifications) over a large set (thousands) of randomized test problems to make sure that it produces exactly (bit-to-bit) the same results as the reference. Each new commit of code is verified in this way, so that it does not introduce unintended functional changes. In additon, PRIMA also performs TOUGH tests and stress tests are done periodically to make sure that the code works properly even if the inputs are strange, the objective function encounter failures, or the problem is much larger than the dimension that the solvers are designed to handle.

  • Regression tests are done by profiling. Using Performance Profiles (a standard metric for benchmarking optimization solvers), each commit is compared with three reference versions, namely the original Fortran 77 implementation, the last release, and the last commit, in order to make sure that the new commit does not introduce performance regression.

  • Instead of unit tests, PRIMA adopts the methodology of programming by contract. Each subroutine checks a set of preconditions and postconditions to make sure that the inputs and outputs are “correct”. The preconditions and postconditions are checked only in the debug mode. In the code that users receive, they are disabled by default. In the debug mode, if some subroutine receives strange inputs or produces strange outputs, the program will raise an error so that the developer can check the issue and fix it.

Surely, conducting unit tests and checking pre/postconditions are different, and they cannot replace each other. The disadvantage of pre/postconditions is that they are not separated from the code. The advantage is that they are checked during each and every execution of the subroutine, provided that the code is running in the debug mode. For example, pre/postconditions are checked during the functional tests (verifications) mentioned above. In this way, the longer time passes, the more confidence we will have in our code.

Maybe my understanding of unit tests is wrong, but unit tests feel quite insufficient to me if the datasets or problems involved in the tests are limited and deterministic (of course, they are better than doing nothing). This is partially what I meant when saying the following.

1 Like

This is a very good point and I totally agree. Ideally, each test should take only one line in the GitHub yml file (although this is not always possible, as is the case for many other ideals), and the implementation of the test should be coded in other scripts, so that it can be conducted easily both on local machines and in the cloud.

I watched a nice presentation by Chris Rackauckas, where he separates tests in:

  • Unit test
  • Integration test
  • Interface test
  • Regression test
  • Downstream test

You can find the details at Maintaining Large Scale Julia Ecosystems - JuliaHEP Workshop, Chris Rackauckas.

2 Likes

This talk is a Gem!! thanks for sharing it!!

1 Like

Thanks for sharing your thoughts about testing. At the companies where I have worked, I write lots of little programs with turnaround times of around one week. I do some simple testing – if a subroutine is supposed to fit a certain statistical model, I generate data from that model and check that the estimated parameters are close to the true ones. And I check that a financial trading strategy does not spuriously produce profits on random walk prices :slight_smile:. It is difficult to find time for testing as extensive as you have outlined.

1 Like

A new form of testing that should be considered is to give ChatGPT code with a docstring (comments) explaining what a procedure does and to ask if it sees logic errors. I just found a bug in my Python code and wondered if ChatGPT would have found it. It did. For the code below I asked

Do you see any logic errors in this Python code? Are there any specific changes I should make?

def seasonal_ma_lagged(x:npt.NDArray, ma_length:int, period:int = 1,
    partial=True) -> npt.NDArray:
    """ Compute the seasonal rolling moving average, excluding the current data point.
    For monthly data, period=12 would give the moving average for the same month over
    the last ma_length years """
    n = len(x)
    y = np.zeros(n)
    if partial: # compute average if at least one term is available
        for i in range(n):
            i1 = max(0, i-ma_length*period)
            i2 = max(0, i-period+1)
            xslice = x[i1:i2:period]
            y[i] = np.mean(xslice) if len(xslice) == ma_length else np.nan
    else: # require that ma_length terms be available
        for i in range(n):
            i1 = i-ma_length*period
            i2 = i-period+1
            y[i] = np.nan
            if i1 >= 0:
                xslice = x[i1:i2:period]
                len_slice = len(xslice)
                if len_slice == ma_length:
                    y[i] = np.mean(xslice)
    return y

Amid some verbiage, ChatGPT said

The condition if len(xslice) == ma_length in the partial average calculation seems incorrect. Since this is a partial average, you probably want to calculate the mean as long as there is at least one element (len(xslice) > 0 ), rather than requiring exactly ma_length elements.

Maybe some big companies already have LLMs that scan code and asks developers questions before it is committed to a repository.

1 Like

Instead of keeping comments in the code as mentioned above a method we used for many years was keeping everything in HTML format. Fortran and C code and test data and test scripts were kept in XMP sections (XMP is considered deprecated in HTML5). This allowed the code to be easily placed on internal servers and browsed. Most users restricted themselves to HTML2, except for the use of CSS style sheets, making HTML a simple format to hand construct, much like markdown is used today. It also allowed for links to non-ASCII documentation and images. Simple wrappers around compiler commands were used in make(1) files to automate the extraction and compilation. The makefile generation was automated as well for build libraries
with include files available for dozens of platform (SunOS, Unicos, HP-UX, AIX, Linux, …) so most users only maintained a file of files to build that was used as input for the system. The make(1) files used platform and compiler specific subdirectories for scratch and output space so multiple builds could occur simultaneously on different platforms. So I have seen and used a system where the source (ie. the HTML document or Fortran or C or ksh/sh shells) along with a custom preprocessor (which was incidentally written in Fortran) allowed for automated builds, testing, and documentation generation.

The system still exists except that the HTML (still supported) has been supplanted by MarkDown format. It is rather trivial to make scripts that compile .md files as well as .f90 files, and just extracts the code between ~~fortran and ~~~ lines. Many md-to-html filters exist so the documentation/source can easily be converted to HTML; or used with systems that allow browsing MarkDown files (this Discourse site being an example). The preprocessor used is the prep(1) Fortran preprocessor now; as it allows input to be Markdown files already.

The concepts worked so well for so long that the features provided by perl and python and other languages always seemed very awkward to use, as they were code-oriented with special processing of the code files, instead of based on common ASCII file formats with various file types allowed as part of the documentation instead of the other way around.

I have been told a lot of the code is also now just kept as .f90 files with the browsable versions generated by ford(1).

Allowing the source/document/testing files to be Markdown also works well with github sites.

But lately for testing I have been trying to build as many libraries as I can make public with fpm/ford/github/github CD/CI/git using the fpm package M_framework for unit testing.

I have found one of my favorite test methods is “test to failure” where you test with problems and problem sizes that ultimately (try to) break the libraries. The resulting tests often help produce much more robust versions of the code that detect inappropriate use and/or document the range of usage supported and tested. For command-driven programs (a lot of programs use a simple Unix-like shell language for input) we have found taking existing test command files and randomly selecting lines from the files dramatically improves the code being able to identify common input errors and giving the user useful information on what course of action is required.

We make heavy use of regression testing comparing new values to previous values for numeric libraries as that is usually easy and helps prevent inadvertent result changes across platforms; but find it of more limited value with more complex codes. But for something like a steam table library or testing many basic mathematic procedures it is still a useful test (partly because it is usually simple enough to generate that people actually do it :>).

I started making a sample github site that used fpm for building and the “fpm test” command to run standard CD/CI tests from github as example for others that maybe I will finish one of these days that highlights the newer approach in

that uses more readily available tools than the proprietary approach described above. I think it is useful for new users (including the very common Fortran programmer who has extensive Fortran experience but only recently been exposed to git(1), github/gitlab/…, ford(1), and fpm(1) – which integrate very well with library-based testing frameworks such as those described on the Fortran Wiki, if anyone wants to build on that.

2 Likes

If one wants to go the extra mile in testing, one approach that I rarely see applied in science and engineering is mutation testing.

Mutation testing doesn’t directly test the code. It basically tests the tests and motivates the developer to improve the tests. Random bugs are added to the code, and the tests are run. The modified code is usually referred to as a mutant. If the mutant passes the tests, that indicates the tests are inadequate. Usually the specific bug the mutant introduced gives a good idea of what sort of test is needed to make the tests fail (kill the mutant in mutation testing terminology). The process is automated. For a code with high code coverage, it may take thousands of mutants to identify one that passes tests.

Last year, I wrote a crude mutation tester for Fortran in Python. Rather than properly parsing the code, I simply used regex to identify parts of the code to mutate. This was a proof-of-concept. I plan to eventually release a version of this software, though I think it would be better to use something with a proper parser.

Mutation operators add bugs to the code. Different classes of bugs are introduced with different mutation operators. It would be useful to know the most common types of bugs in your problem as that would inform which mutation operators to add.

The simplest mutation operator is commenting out random (non-empty) lines. I found this to be far more useful than a code coverage report. High line or even branch coverage doesn’t mean that the code is adequately tested. Commenting out random lines found quite a bit of untested code. These lines had no impact on the tests, even though these lines were run. Adding tests for many of them was easy.

Another set of mutation operators I have mutates the code to add off-by-one errors. While I don’t think I found any bugs through this, it did increase my confidence a lot as off-by-one errors are common and not easy to find. Mutation testing is the best approach I’m aware of for finding off-by-one errors.

Lastly, partly due to the crudeness of my mutation tester and partly due to the nature of mutation testing in general, I had to add annotations in comments to indicate that certain mutation operators should not be applied to particular lines. I found quite a few “equivalent mutants” that didn’t change the result of the code. In other words, they weren’t bugs. And I wasn’t necessarily interested in mutations that changed unimportant outputs. (Though I did switch to structured logging so that I could test the outputs more easily.)

Overall, I think mutation testing is worth trying for those who want to test their codes thoroughly. Given the lack of a proper Fortran mutation tester at the moment, I’d recommend writing your own mutation tester adapted to your specific use case. My initial version was developed during a weekend as I recall.

6 Likes

Thanks for your mutation testing idea. A Python script that could pick out lines of a Fortran code that could be deleted without causing a syntax error would be useful, although one could delete lines truly at random and discard non-compilable versions of the code. Besides deleting individual lines one could try adding RETURN before the executable sections of individual procedures and checking that the tests catch this. Setting the RHS of assignments to 0 or some other constant could be tried.

1 Like

My mutation tester takes the random approach. It makes no attempt to produce valid syntax. In each iteration, it modifies a random line, runs make test, and checks the exit code. Consequently, it can’t tell the difference between a compiler error and tests failing. This is sufficient for a proof-of-concept, though not entirely satisfactory. For time efficiency, producing valid code would be best.

You have good ideas for more mutation operators. One mutation operator that I’d like to try that’s similar to commenting/deleting lines is deleting terms in an assignment. Say x = y + z is mutated to x = y. I think this is likely to be quite useful, though it’s more complex to implement. In the verification of computational PDE codes, there are some publications about how some exact/manufactured solutions have zero cross-derivative terms, so they aren’t testing every term. This would help there.

1 Like

To avoid creating syntax errors when removing certain lines, for

if (condition) then
or
else if (condition) then

you could create versions of the code where condition is successively replaced by .true. and .false..

do i=i1,i2
could be replaced by
do while (.false.)
to effectively comment out the body of the loop.

2 Likes

I’ve seen the related term of “invariants” be used in this context.

Looking for relevant invariants is another strategy of testing. For example a 2-D stencil code may look something like this:

integer :: i, j
alpha = dt/h**2
do concurrent(i=1:nx,j=1:ny)
 unew(i,j) = u(i,j) + alpha*((u(i+1,j) - 2*u(i,j) + u(i-1,j)) + &
    (u(i,j+1) - 2*u(i,i) + u(i,j-1)))
end do

Due to the symmetry of the stencil, the result of this loop should be invariant with respect to a swap of variables i and j (notice the error above).

I think it was in some climate codes (MOM6 perhaps?) they scripts or some other method to replace the indexes and help them verify that these types of symmetries hold. this was the presentation: Dimensional and Rotational Testing of MOM6

1 Like

In addition to testing whether the modified code code compiles and produces correct code, the programmer must ensure also that the code is actually executed. Programs typically have sections of code that are executed only with particular inputs, so if the input case does not test that line of code, getting the correct results back after modification didn’t really test anything. Once this situation is recognized, the programmer then modifies the input, creates a new reference case that does execute that code section, and then proceeds. It seems like this last step would be difficult to automate, some kind of programmer+user intervention is indicated.

When I do this manually, I sometimes first modify the code in such a way that I know it will produce incorrect results, and then run the test cases to ensure that it waves the expected warning flags. Then I go back and make the intended modifications, such as a bug fix, or an algorithm change, or an optimization change.

1 Like

I don’t think they’ve been mentioned yet, but ABI compliance tests also exist:

Perhaps these are not relevant for PRIMA, since releases occur in the form of source code.

I suppose such tools are one way to get a “semantic versioning diff”, or at least the confirmation at the object level whether compatibility has been broken. Unfortunately, the tools don’t support Fortran modules, so that is uncharted territory. The authors of the Spack package manager have also been working in this area: https://youtu.be/gWe2K_oCp6A?si=xjrFFNyq-sqjrA_2.

But I guess that ABI compatibility at Fortran level seems to be an utopia. In the System V ABI forum I’ve read that:

For what it’s worth, in my experience working with several FORTRAN compiler vendors to capture a minimal x86-64 ABI before, they actually did not want anything more than what is already in it. They wanted the freedom to do all kinds of tricks in order to achieve performance, some which were proprietary, like how arrays were laid out in memory as well as references, when an ABI could be stifling to them. In the end, since FORTRAN is a much simpler language, source-code compatibility is pretty much a given and therefore binary compatibility is not so important as it is for C and C++.

But if some users on Linux are interested in stability at the shared object level, a library of static Fortran77-like external routines may be the right choice, as it is part of the System V ABI (PDF, 483 KB) (see chapter 9.2, pg. 107 in version 1.0).

1 Like

There are also frameworks for formal verification of software, e.g. search for “TLA+”. There are also formally verified compilers for C (CompCert, Arm) and I suppose other languages like Ada, which guarantee the correctness of compilers (but not necessarily the software!?). Probably you can find out more on pages from NASA or ESA. ESA has published an Independent Software Verification and Validation Handbook.

1 Like

This is a very good point. I (partially) overcome this difficulty by randomized tests that run automatically, with a random seed that changes weekly (thank GitHub Actions!). In this way, we can hope that most hidden bugs will be triggered after a sufficiently long time of testing. Therefore, the more time passes, the more confidence I will have in my code.