I’ve never used such unit checking libraries, but I know there is a dozen of them for C++, and the idea has received on and off attention from various standardization committees. I’ve read about external tools which can analyze units, for instance FPT but I don’t know how mature the technology is.
Taking a step back, I think the question of units in code hits upon the interesting duality of code as data and it’s opposite, data (or “configuration”) as code. Could the system be built in a way that a configuration phase checks the units, and then passes them forward to the computation phase?
My experience is profiling is getting easier nowadays with multiple open source and vendor tools availables. Here are a few I’m aware of:
In case you belong to an EU academic organization, you can also get in touch with programmes that offer support or can do the profiling for you, such as:
The easiest tool to start with IMO is the Intel Application Performance Snapshot (a subtool of the VTune profiler). Assuming you have VTune installed in the standard location all you need to do is:
$ source /opt/intel/oneapi/setvars.sh
$ aps <my_fortran_application>
This will generate a HTML report you can view in your browser. Here is some exemplary output I got measuring the assembly phase of a sparse matrix program:
The numbers don’t mean much unless you know what to expect, here a few things I observed
- The first positive sign is the application is using “packed” double precision floating points operations (DP FLOPS) or vector/SIMD instructions in other terms.
- The IPC (instructions per cycle) is > 1, meaning the application is making use of instruction pipelining and it is likely instruction-bound. An IPC < 1 likely mean the memory is stalling (the arithmetic units are waiting for something to load). (Take this with a grain of salt.)
- The value 31 GFLOPS is impressive if you think about it, 30 billion operations per second; just for reference, the Cray X-MP could reach 800 MFLOPS. I happen to know that 31 GFLOPS is still far from the peak on this processor (which is okay for me).
- APS reports a memory bandwidth of 1.15 GB/s on average; my own calculation gave me 1.5 GB/s. My CPU (Intel i7-11700K) has a maximum bandwidth memory bandwidth of 50 GB/s. This means there is still room for improvement, but in practice you can rarely achieve the maximum rates (unless you go deep into performance tricks, at the expense of code simplicity). From my knowledge of the kernel, I know that it isn’t very memory heavy, and the computation part is most likely the bottleneck.
The next thing I did was to launch the “Hotspot Analysis” with the Intel VTune profiler. This included a nice visualization called the Flame Graph, showing the active call stacks; the wider a block, the more time was spent in that routine (note the x-axis is not time!):
Already we can see that the application appears to be spending a lot of time in DGETRF and DGETRS (the LU factorization routines from LAPACK). By looking at the Top-down Tree view I could confirm that is where 70 % of time is spent (see image below). Of the remaining 30 % of time, 15 % was spent in reading the input files (the blocks on the left part of the flame graph), and 15 % in the procedures that constructs the matrices and copy the values into the sparse matrix. Based on the profiling I decided further tuning was not needed for now. On 8-cores I’m currently assembling about 850000 matrix rows per second. When we look at the wall-time (the thread timeline in the image above), the serial section of reading the input file (ASCII txt) takes longer than the calculation part does.