Compile-time unit checking in Fortran: some practical experiences

btrettel · August 18, 2024, 12:18am

Unit checking has been a longstanding interest of many developers. There are hundreds of implementations of the idea, and much previous discussion in this forum. There’s a great FortranCon video by Arjen Markus on the tradeoffs of compile-time, run-time, and static analysis approaches which might be good to watch before reading this post.

Most implementations seem to be applied to toy problems and not actual production code, which makes me think that real problems that would appear in practice may not have been identified. So, I took it upon myself to write my own compile-time unit checking system for Fortran (similar to quaff) that I call genunits and use it for a new computational fluid dynamics code I am developing. Here I’d like to report some non-obvious things I’ve found in the process.

In summary: I probably will be removing unit checking from my code due to poor performance. Even with compile-time checking, there’s a 11 to 35 times slowdown with current compilers for a SOR Poisson solver. (Edit: Seems that the vast majority of the slowdown can be avoided with inlining as suggested by wspector.) The main advantage seems to be finding certain bugs sooner, not finding bugs that would be missed with rigorous testing. Given that testing has no run-time penalty, it seems to be a better choice at the moment if performance is a concern. For compile-time unit checking to be done properly, I believe it needs to be a compiler feature, preferably a standard feature.

A bit about my system: genunits will read an input file defining the desired unit system and generate custom source code for a Fortran module. Units checking is done with derived types at compile-time through defined operations. This Fortran module can be used, variables can be assigned units when they are defined with the appropriate type (for example, type(unitless) :: x), and most mathematical operations work the same as before. You can see some examples of the system in action in this test file.

Why have compile-time unit checking in general?

Before I implemented genunits, I might have said that the main purpose of genunits would be to catch bugs. But now that I’ve thought about it, I don’t think that unit checking will inherently find bugs that other forms of testing can’t. The advantages of compile-time unit checking instead are that bugs are found earlier in development and often the precise location of a bug is pinpointed with the compiler error. I can say that the bugs in my own code that genunits has found so far have been easy to fix. If I relied only on conventional testing, identifying where the bugs were would have taken longer. But does this benefit outweigh the problems?

Why have unit checking as part of the compiler?

Specifically, what can’t be done with a compile-time derived type implementation that a compiler implementation would allow?

By the way: All of the problems listed below aside from the last can be mitigated, at least to an extent, by compiler developers without adding a units feature.

Run-time performance takes a huge hit even avoiding run-time checks

Unfortunately, run-time performance is by far the biggest disadvantage. Even with optimizations, there’s a minimum order of magnitude increase in run-time for the SOR Poisson solver test I’m using. This is unacceptable for many applications, and surprising to me given that the unit checking is at compile-time. Something about writing units as derived types inherently causes a slow down. I’m not a compiler engineer, so I don’t know what’s going on under the hood. Some specific numbers from my tests:

gfortran -O2: 11.2x slow down
ifort -O2: 13.7x slow down
ifx -O2: 19.1x slow down
nvfortran -fast: 35.4x slow down

Slow compilation and effort needed to mitigate slow compilation

Van Snyder has discussed the huge number of units needed to cover part of what I call a unit system. Having a huge number of units will lead to slow compilation for some compilers.

genunits has been designed to minimize the number of units to make compilation faster. Essentially, genunits requires some seed units that form a basis for the unit system, and a user-specified number of units are generated from that basis in rough order of likelihood of appearance. However, this is not a panacea. As development proceeds, the size of the unit system required tends to expand due to what I call intermediate units, leading to having to adjust the genunits configuration to generate more units, and slowing compilation time. Intermediate units are not used for the defined variables, but appear in expressions as mathematical operations are performed. For example, consider m with units of kg, rho with units of kg/m3, and x, y, and z with units of m. The equation m = rho * x * y * z will include the intermediate unit formed by rho * x, which has units of kg/m2. This unit needs to be part of the unit system for the unit checking to function. If kg/m2 is not part of the unit system, the user will get a compiler error identical to that when there is a unit mismatch despite there being no actual error.

Compiler error messages when units mismatch are often unclear

A compiler implementation could have far more helpful and descriptive error messages. And as I said, many compiler error messages for genunits are false positives in a sense, in that there is no actual unit mismatch, but the unit system needs to be expanded to include required intermediate units.

Exponentiation operators are limited in a derived type implementation

For example, a derived type implementation can’t determine the units of x**2, unless x is unitless. The programmer will have to instead write x*x or use a convenience function like square(x). However, a compiler implementation would not have this limitation.

Closing

My main goal here is to inform potential future Fortran language and compiler developments. I’m also hoping people will have some ideas about how to mitigate the performance issues, but I suspect little can be done on my end.

There’s a lot more I could write about this, but I want to prevent this post from being even longer. I’m happy to answer any questions about this.

certik · August 18, 2024, 1:25am

When you say compile-time checking, you mean the following code?

github.com

btrettel/flt/blob/e0766a7e683b698c982fd4b6e879692af6b2f218/test/test_units.f90#L465


      
          !        end do
                  
          !        write(unit=*, fmt="(a)")
          !    end do
          end subroutine poisson_real
          
          subroutine poisson_units(mmax, nmax, itmax, u)
              integer, intent(in) :: mmax  ! number of interior $x$ grid points
              integer, intent(in) :: nmax  ! number of interior $y$ grid points
              integer, intent(in) :: itmax ! maximum number of iterations allowed
              type(si_energy), allocatable, intent(out) :: u(:, :) ! numerical solution
              
              real(kind=WP), parameter :: A     = 1.0_WP ! $x$ dimension
              real(kind=WP), parameter :: B     = 1.0_WP ! $y$ dimension
              real(kind=WP), parameter :: OMEGA = 1.0_WP ! relaxation parameter
              real(kind=WP), parameter :: TOL   = 0.005_WP ! tolerance for maximum of absolute value of residual
          
              type(si_length) :: hx
              type(si_length) :: hy
              type(unitless)  :: q

Which uses type(si_energy). If so, then I would say that’s expected that it will be a lot slower at runtime, since you use derived types instead of just arrays.

Consequently, I would call these “runtime units”.

And I agree with your conclusion, the only way this can work in practice is if units are part of the compiler and there is no overhead at runtime, and the compiler is implemented in such a way to allow fast compilation.

btrettel · August 18, 2024, 1:49am

Yes, that’s an example. I don’t have the generated module committed, which is probably making what’s happening unclear. See here for the generated module: http://trettel.us/units.f90

As I understand it, there are two ways to implement unit checking with derived types in Fortran. One uses a single derived type and the unit checking is done at run-time (for example: PhysUnits). I would assume that has a heavy run-time cost. The approach I’m taking (also used by quaff) has many derived types and the unit checking is done at compile-time. All the type-bound operators do in the second case is reimplement the operations. The operators in the second case have no explicit code for checking the units. It wasn’t obvious to me that this approach would necessarily have a significant run-time cost.

As I’ve said, I don’t fully understand what compilers are doing under the hood. But it seems to me that it’s possible (though perhaps not easy) in the compile-time unit checking case for the compiler’s optimizer to essentially remove the layer provided by the derived types and get near normal performance. That’s obviously not happening at present.

Edit: I realized after posting this that what I wrote is probably unclear for some, so I’m going to explain how compile-time unit checking works here.

If the derived types don’t check the units in the operators at run-time, then how are the units checked? Basically, only valid operations are implemented. So addition and subtraction are implemented between two identical units, but not between two different units. For multiplication and division, the unit for the result is determined in the generator given particular combinations of left and right arguments. genunits will create all of the appropriate operations, and none of the inappropriate operations. Consequently, if a unit error is made, the compiler won’t know what to do, and a compile-time error will result. This is what I mean by “compile-time unit checking”.

wspector · August 18, 2024, 7:38pm

Have you tried profiling the test code yet? It would be interesting to know where the ‘hot spots’ are. Might be something relatively simple to fix.

I suspect you are generating a lot of procedure calls within hot loops. Perhaps turning on compiler inlining might help.

btrettel · August 18, 2024, 8:07pm

You were spot on with the suggestion about inlining. Thank you.

I had some notes about inlining with ifx that suggested adding -flto. I ran ifx again:

ifx -O2: 21.3x as slow (similar to before)
ifx -O2 -flto: 1.37x as slow

Only a 37% increase in run time is an enormous improvement. Looks like inlining for nvfortran is more complicated and I’ll have to figure out how to do it for gfortran.

Also: I haven’t profiled the code, mostly because I’m not particularly familiar with profiling code. My background is more physics than computation. I’ll look into it, perhaps the remaining slowdown can be reduced even further.

tyranids · August 18, 2024, 8:29pm

You can inline across compilation units (files) in gfortran with -flto

ivanpribec · August 18, 2024, 9:28pm

I’ve never used such unit checking libraries, but I know there is a dozen of them for C++, and the idea has received on and off attention from various standardization committees. I’ve read about external tools which can analyze units, for instance FPT but I don’t know how mature the technology is.

Taking a step back, I think the question of units in code hits upon the interesting duality of code as data and it’s opposite, data (or “configuration”) as code. Could the system be built in a way that a configuration phase checks the units, and then passes them forward to the computation phase?

My experience is profiling is getting easier nowadays with multiple open source and vendor tools availables. Here are a few I’m aware of:

In case you belong to an EU academic organization, you can also get in touch with programmes that offer support or can do the profiling for you, such as:

The easiest tool to start with IMO is the Intel Application Performance Snapshot (a subtool of the VTune profiler). Assuming you have VTune installed in the standard location all you need to do is:

$ source /opt/intel/oneapi/setvars.sh
$ aps <my_fortran_application>

This will generate a HTML report you can view in your browser. Here is some exemplary output I got measuring the assembly phase of a sparse matrix program:

The numbers don’t mean much unless you know what to expect, here a few things I observed

The first positive sign is the application is using “packed” double precision floating points operations (DP FLOPS) or vector/SIMD instructions in other terms.
The IPC (instructions per cycle) is > 1, meaning the application is making use of instruction pipelining and it is likely instruction-bound. An IPC < 1 likely mean the memory is stalling (the arithmetic units are waiting for something to load). (Take this with a grain of salt.)
The value 31 GFLOPS is impressive if you think about it, 30 billion operations per second; just for reference, the Cray X-MP could reach 800 MFLOPS. I happen to know that 31 GFLOPS is still far from the peak on this processor (which is okay for me).
APS reports a memory bandwidth of 1.15 GB/s on average; my own calculation gave me 1.5 GB/s. My CPU (Intel i7-11700K) has a maximum bandwidth memory bandwidth of 50 GB/s. This means there is still room for improvement, but in practice you can rarely achieve the maximum rates (unless you go deep into performance tricks, at the expense of code simplicity). From my knowledge of the kernel, I know that it isn’t very memory heavy, and the computation part is most likely the bottleneck.

The next thing I did was to launch the “Hotspot Analysis” with the Intel VTune profiler. This included a nice visualization called the Flame Graph, showing the active call stacks; the wider a block, the more time was spent in that routine (note the x-axis is not time!):

Already we can see that the application appears to be spending a lot of time in DGETRF and DGETRS (the LU factorization routines from LAPACK). By looking at the Top-down Tree view I could confirm that is where 70 % of time is spent (see image below). Of the remaining 30 % of time, 15 % was spent in reading the input files (the blocks on the left part of the flame graph), and 15 % in the procedures that constructs the matrices and copy the values into the sparse matrix. Based on the profiling I decided further tuning was not needed for now. On 8-cores I’m currently assembling about 850000 matrix rows per second. When we look at the wall-time (the thread timeline in the image above), the serial section of reading the input file (ASCII txt) takes longer than the calculation part does.

gak · August 19, 2024, 1:00am

I don’t have anything directly to add to this conversation. I do know, though, that the Camfort project is working on lightweight verification of Fortran codes, and the first thing mentioned in the specifications section of the project overview is unit checking.

I’ve found the developers really responsive to questions about the project. Camfort itself is written in Haskell.

wspector · August 19, 2024, 3:14am

Great! With gfortran, you’ll want to look at the -finline… options. The gprof utility is very handy for profiling codes. The gfortran -pg option is needed to properly generate code for gprof.

Note that the -finline and -pg options are documented in the gcc and g++ man pages. The gfortran man page only documents Fortran-specific options.

Topic		Replies	Views
Computations with units (meters, seconds, ...) Poll	52	3340	October 31, 2021
Is there a tool like FRAMA_C for Fortran?	17	1019	August 24, 2021
Comments in ResearchGate about Fortran	16	1011	February 11, 2021
Fortran Compiler Testing Framework	21	2793	July 17, 2022
Why is my code compiled with GFortran on Windows slower than on Ubuntu?	51	5691	May 3, 2022