What kind of tests are sufficient: Some personal thoughts

@zaikunzhang Can you please explain what you mean by a TOUGH problem? I have a feeling you mean something quite specific but my search skills aren’t up to the task of finding a good explanation.

1 Like

This is where code coverage comes in handy. I’ve never actually done this with my Fortran code, but after a quick Google, it seems this is possible with GCC: Using the GNU Compiler Collection (GCC): Instrumentation Options

1 Like

I have used fortran compilers that will insert code to count the number of times each line is executed. You run the code and then examine the log file. This is usually intended to identify hot spots in the code so that the programmer can focus optimization on those parts, but it also achieves the goal of ensuring that any modified code is executed.

1 Like

I was wondering the same thing myself. The concept reminded me of these blog posts from Bill Rider:

1 Like

The honest and slightly pessimistic answer to this is probably that it will never be sufficient and there will always be bugs, at least in my codes. You will always find edge cases that are not handled, and it is worse when you have a GUI.

From my developer’s days on GUI and interfaces I was focusing the tests on integration tests and acceptance tests. Now that I am doing more backend and scientific programming, I tend to focus on unit tests.

I came up with an easy to remember acronym that covers pretty much the kind of tests that I felt were sufficient: :beers: IPA & SOUR :beers:

Integration testing:

Integration testing involves two or more modules of an application that are combined and tested as a cohesive unit. The primary goal of this testing is to identify any defects related to the interface, communication, and data flow between these modules.

Performance testing

Performance testing involves evaluating the stability and response time of an application by subjecting it to a load.

Acceptance testing

Acceptance testing is a software testing approach in which stakeholders, or customers evaluate the software using real-life scenarios. It also involves tasks such as installing, uninstalling, and updating software. This one is very important. I had way too often issues with missing dependencies (like C++ redist on Windows or MKL or intel redist). So a clean, new environment should be used for testing installation.

System testing

System testing is performed in black box mode without knowing the internal structure of the piece of software. Tests focus on input and output data.

Object testing

Object-oriented testing ensures that each object is working as expected on its own. Since objects encapsulate their own logic, they should be tested independently before making sure they interact correctly with other objects. Mocks, stubs, and fakes can be used to mimic the behaviors of over objects.

Unit testing

Typically, Unit testing focuses on a single procedure through a list of assertions.

Regression testing

Regression testing checks that unchanged features of the program were not affected by any bug fixes, new functionality, code cleanup and refactoring.

And don’t forget “testing is doubting” :sweat_smile:.


[See this video to get a feeling about TOUGH test: https://x.com/historyinmemes/status/1749511750020403222?s=20]

Hi @RobertPincus , TOUGH (Tolerance Of Untamed and Genuine Hazards) Test is a term coined by myself. The idea is similar to the test mentioned by @btrettel :

  • invoke your solver on problems that are so difficult that you do not expect it will work;
  • try to crash your solver using exceedingly difficult problems;
  • push your solver to well above its limit and see how it reacts.

This kind of test will reveal many hidden bugs in your solver, especially those triggered by floating-point exceptions.

For more details, the following screenshot or slide 23 my talk on PRIMA at The 10th International Congress on Industrial and Applied Mathematics.

In PRIMA, the test problems of the TOUGH Test are generated by a MATLAB script based on the MatCUTEst package.

1 Like

For numerical methods, I often follow the questions :

  1. Does it build ?
  2. Does it run ?
  3. Does it get the right answer?

On all target hardware I plan to support. Getting the right answer, for me, since I work on a library for spectral element methods means that the calculus and interpolation operations demonstrate spectral accuracy. This leads to a number of tests for exactness (to machine precision) and estimation of appropriate convergence rates. Further, since the library is meant to be used to build conservation law solvers, I also test example solvers for linear and nonlinear PDEs that use the library in the way we intend it to be used.

As far as I know, the end users are not doing life-or-death calculations and I typically will not put more rigor into testing what is already free and open source software unless their is a paying customer on the other end for me to afford the time to tighten up testing further… and also to afford the insurance.

1 Like

For a program to give correct results, not only must the data be processed correctly, but the data should be correct. Today I discovered that some of my data files have duplicate lines, which I should have caught by checking that the dates in the data files were strictly ascending. Each domain probably has its own common patterns of bad data. For stock prices, one must ensure that stock splits and dividends are properly accounted for and that prices of delisted stocks are not carried forward indefinitely. There is a field called “anomaly detection”. Since some bad data may slip through, non-robust measures such as the mean of computed quantities should be compared with robust measures such as the trimmed mean. One could run a program with deliberately bad data and see if anomalies are detected.


A comment copied from the same discussion on Julia Discourse:


Here is a video of a TOUGH test: https://x.com/historyinmemes/status/1749511750020403222?s=20 :upside_down_face: