Modernising old libraries - error handling

I have taken it upon me to modernise a few libraries from Netlib: use modern language constructions, provide a more modern interface. One thing I am struggling with is the way to handle errors. Most such libraries provide subroutines with some argument that indicates an error or warning. Quite often they also print to “standard output” to indicate what was wrong.

I prefer to have a clean interface:

  • Functions return a value (scalar or if necessary a derived type or an array) but do not change their arguments (no output arguments).
  • Output to screen/standard output gives trouble if you call the function in a write statement, so they should be avoided. But you also want the user to be able to recover from them, so an error stop in case of a more or less serious error is out of the question.
  • Providing information about an unusual condition - for instance in the case of interpolation over two-dimensional data points, you may want to know if extrapolation occurred. But you may not necessarily want to check such a condition with each call to the interpolation function.
  • The library could store a status variable that can cleared and queried, but that is not thread-safe per se.

My question is: what is a good strategy?

It is difficult to imagine a single solution will fit all situations, but maybe two or more methods could do.

5 Likes

For new libraries, I see three approaches

  1. The standard approach of having an (optional) integer stat argument; optionally with a deferred-length character for an error message. In case the user does not query the stat argument the routine will fail gracefully (issuing a message). The library should uphold to some conventions, like using negative values for errors, 0 for success, and positive values for warnings or other possible outcomes. The library should also offer integer parameters, that can be used to query for specific errors.
  2. A status variable encapsulated in a derived type, which is also the entry point for the library (e.g. an integrator object); the derived type should carry routines to query the status flag and provide a message.
  3. A more modern approach would be to use a dedicated library such as ErrorFx: Fortran exception library. While more powerful, it also puts burden on the consumer since they have to learn not only how to use the error library, but also how to build it and link it with their application.

A few more guidelines which I find relevant for programs that might produce command-line output (taken from https://clig.dev):

  • Return zero exit code on success, non-zero on failure. Exit codes are how scripts determine whether a program succeeded or failed, so you should report this correctly. Map the non-zero exit codes to the most important failure modes.

  • Send output to stdout . The primary output for your command should go to stdout . Anything that is machine readable should also go to stdout —this is where piping sends things by default.

  • Send messaging to stderr . Log messages, errors, and so on should all be sent to stderr . This means that when commands are piped together, these messages are displayed to the user and not fed into the next command.

For logging, the stdlib logger_type can be used. One downside of adding a (local) logger is it requires the library to offer an initialization routine that is called by the consumer in the main program. Alternatively, the logger needs to be initialized in the first call

subroutine library_entry_point( ... )
  use library_internals: library_logger
  logical, save :: first_call = .true.

  if (first_call) then
    call library_logger % add_log_unit( ... )
    first_call = .false.
  end if

  ! ... actual library stuff ...

end subroutine

Some extra care is needed to make sure the logger calls are thread-safe.

7 Likes

If a function result or subroutine argument should always be positive, for example an estimated standard deviation or variance, returning a value of -1 can signal an error (if for example fewer than 2 observations were provided). I wonder what value to use when the quantity could be any real number, for example skewness. Some possibilities are -999, huge(x) or -huge(x), or NaN. A related question is what values to use in an array to signal missing floating point data.

I tend to prefer returning a NaN, because then it will be immediately clear something is wrong. A value like -999 does have appeal, as it can be exactly represented, but it can also easily be ignored.

I suspect that using NaNs to signal errors would run into problems when aggressive optimisations are turned on, see this SO post

This looks strange to me, being just opposite to what we are accustomed to in Fortran, IOSTAT specifier in I/O statements or STAT in ALLOCATE being obvious examples.

I develop a fairly large library for a specialized scientific task (the flow of electromagnetic radiation through a planetary atmosphere). For this task the data being computed requires the use of output arguments, especially derived types, some of which may be user-defined.

For this project I adopted the convention that functions do modify output arguments, and their return value is a string. An empty string means success. Non-empty strings indicate errors can can be appended at various layers of the call stack.

If there were ways to improve on this that still allowed the use of polymorphic derived types I’d be interested to hear about them.

I should probably have been a bit more precise about this:

I was thinking of interface like:

result = mydata%interpolate(x,y) ! No error information at all

A simple-looking function that returns a result. You can of course change this to something like:

result = mydata%interpolate(x,y, err)  ! We could ignore the error information, but for now check it
if ( err /= 0 ) then
   write(*,*) 'Oops! - check coordinates: ', x, y
endif

but that feels rather clumsy. The clumsiness is even worse when it comes to (special) mathematical functions - you may get out of the domain for instance. I now tend to use a return value of NaN in that case, but in the orginal code I see various warnings being printed for situations where the approximation is less accurate than in other parts of the domain.

An alternative might be to use a subroutine, then it feels more “natural” to have several output arguments.

Note that an exception mechanism is awkward as well. Then to handle errors in place ,you have to surround every call to the function that might throw an exception, with a handler because otherwise you do not know where the problem is coming from.

I think the best option is #2 of @ivanpribec, especially for modernizing existing libraries, as it does not requires any changes to the API of the procedures, just adding a few extra routines for error handling. And, if the library gets incorporated into a module, as it probably should be done, the access to the error status and message variables is straightforward, even if one makes them private and puts the access to them via module procedures.

The only question would be whether the error status clearing is left to the user or is done automatically on entry to each of the library procedures. This is important if the user does not check the error status after every single invocation of the library routine (be it by mistake/omission or by purpose). But I guess this could be made an option to choose at start.

As for the thread-safety, I guess it would not be a problem for coarrays parallelism, as long as the error handling is done locally, per-image. It might be an issue for OpenMP/MPI codes but those are not standard Fortran, so it is somewhat implementation-dependent anyway.

The opposite is also a possibility. The two conventions are applied inconsistently in practice.

Here are just a few (counter-)examples:

  • Intel MKL Pardiso: zero on success, negative on failure
  • NLopt: positive for success, negative on failure
  • LAPACK dgetrf: zero for success, negative for illegal input, positive if matrix is singular
  • Netlib ODE solver RKC: positive 1-2 for success, positive 3-5 for failure

The majority of Netlib ODE solvers use zero for success, and negative for failure, and then you have RKC which uses the opposite…

This has been a continuous source of frustration for me while working with Fortran libraries. You always need to consult the documentation carefully, and write your own “error handler” subroutine to decipher the error codes and fail gracefully in case of failure.

1 Like

If you use FINDLOC on a set of logical constraints to set an integer error flag named something like err, it is convenient for err = 0 to indicate no error condition and err > 0 to flag an error. I wrote about this here with code here.

1 Like

One could make subroutine arguments ALLOCATABLE, even if they are scalars, and return them ALLOCATED only for proper inputs. This stops the user from using quantities that could not be computed. An example is below. I have not done this in my own codes (where I return an integer error flag) and am just tossing out an idea.

module m
implicit none
contains
subroutine my_sqrt(x,xsqrt)
real                , intent(in)  :: x
real   , allocatable, intent(out) :: xsqrt
if (x >= 0) xsqrt = sqrt(x)
end subroutine my_sqrt
end module m
!
program main
use m, only: my_sqrt
implicit none
real, allocatable :: y
call my_sqrt(4.0,y)
print*,"allocated(y)=",allocated(y)
if (allocated(y)) print*,"y =",y
call my_sqrt(-1.0,y)
print*,"allocated(y)=",allocated(y)
if (allocated(y)) print*,"y =",y
print*,"y=",y ! crash if unallocated agument is used
end program main

gfortran output (ifort is similar):

 allocated(y)= T
 y =   2.00000000    
 allocated(y)= F

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

I believe this pattern is used also in fpm. It reminds me a bit of the std::option in Rust although with quite different semantics.

It’s a useful pattern for high-level code where you are only working with a few derived-type object instances, but I don’t think it’s the type of code you’d like to see in a low-level array kernel loop.

This is also used in test-drive if I am not mistaken

subroutine test_valid(error)
  type(error_type), allocatable, intent(out) :: error

  call check(error, 1 + 2 == 3)
  if (allocated(error)) return

  ! equivalent to the above
  call check(error, 1 + 2, 3)
  if (allocated(error)) return
end subroutine test_valid

To my surprise, in a toy code using an allocatable subroutine output argument and testing its status in the caller was faster than setting an error flag and testing whether it was returned nonzero:

For

module m
implicit none
contains
subroutine my_sqrt_alloc(x,xsqrt)
real                , intent(in)  :: x
real   , allocatable, intent(out) :: xsqrt
if (x >= 0) xsqrt = sqrt(x)
end subroutine my_sqrt_alloc
!
subroutine my_sqrt_err(x,xsqrt,ierr)
real                , intent(in)  :: x
real   , allocatable, intent(out) :: xsqrt
integer             , intent(out) :: ierr
if (x >= 0) then
   xsqrt = sqrt(x)
   ierr = 0
else
   xsqrt = -1.0
   ierr  = 1
end if
end subroutine my_sqrt_err
end module m
!
program main
use m, only: my_sqrt_alloc,my_sqrt_err
implicit none
integer, parameter   :: n = 10**8
real   , allocatable :: x(:)
real   , allocatable :: y
real                 :: sum_sqrt_alloc = 0.0, sum_sqrt_err = 0.0
integer              :: i,ierr
real                 :: t(3)
allocate (x(n))
call random_number(x)
x = x - 0.5
call cpu_time(t(1))
do i=1,n
   call my_sqrt_alloc(x(i),y)
   if (allocated(y)) sum_sqrt_alloc = sum_sqrt_alloc + y
end do
call cpu_time(t(2))
do i=1,n
   call my_sqrt_err(x(i),y,ierr)
   if (ierr == 0) sum_sqrt_err = sum_sqrt_err + y
end do
call cpu_time(t(3))
print "(3a15)"  ,"method","alloc","err"
print "(a15,2f15.2)","sum_sqrt",sum_sqrt_alloc,sum_sqrt_err
print "(a15,2f15.2)","time",t(2:3) - t(1:2)
end program main

on Windows, ifort -O3 results were

         method          alloc            err
       sum_sqrt    16777216.00    16777216.00
           time           4.02           7.31

and gfortran -O3 results were

         method          alloc            err
       sum_sqrt    16777216.00    16777216.00
           time           3.00           5.53

The subroutine with an allocatable argument is shorter, and in a code with many outputs, you don’t need to read the documentation to see what an error code refers to – you just test whether an output is allocated before using it and are reminded by a run-time crash if you neglect to do so and it could not be computed.

To me, a simple

call calculate(input1,input2,...,output1,output2,...,err)

is the best option.
err should be optional, and by convention the subroutine stops with error stop if err is not given.
This allows to make subroutines without I/O pure.

Having a function that returns a derived type could result in a vast number of derived types. To me, it feels clumsy. It basically shows that Fortran lacks an intrinsic list or tuple type.

1 Like

For MPI this shouldn’t be a problem, because MPI runs multiple images/processes.
For OpenMP it should be enough to declare the module error variable(s) as threadprivate.

This has been my typical practice. Unfortunately there is a bug in gfortran that I first encountered a year ago (and was reported a year earlier by someone else) that will break this pattern in all but the simplest of cases. I wish one of the gfortran developers would tackle this one.

I would never use a library that returns NaNs.

Considering the current Fortran standard with rudimentary to no support for this aspect, KISS will be preferable for FOSS libraries in Fortran:

  1. FUNCTION subprograms: employ ERROR STOP for the exceptions that, hopefully, are as such!
  2. SUBROUTINEs: include STAT and ERRMSG arguments and follow intrinsic procedures: a) STAT is INTENT(OUT) and zero value means SUCCESS and b) ERRMSG is INTENT(INOUT)

When using code that follows convention 2 b), I would set ERRMSG to blank before the call and check that it is blank afterwards. Instead I prefer to make ERRMSG an INTENT(OUT) argument that is blank if there is no error.