Are there really any reliable ways to time Fortran code?

This thread was prompted by @JohnCampbell suggestion in this thread to use SYSTEM_CLOCK instead off CPU_TIME. In general I agree with his suggestion but wanted to see if it holds true when you use optimization etc. Therefore, I wrote a set of timer routines and a test program see here that times a summation over 1.E8 iterations. I tested on a Linux Mint 21.3 system (Ubuntu 22.04 LTS) on an AMD Ryzen 5 5600X. I tested using Fortran’s intrinsic DATE_AND_TIME, SYSTEM_CLOCK, CPU_TIME routines along with C gettimeofday and clock_gettime functions. clock_gettime has several different types of clocks you can specify. The most reliable appears to be the CLOCK_MONOTONE option. I also tried the Linux specific CLOCK_THREAD_CPUTIME_ID option which returns the CPU time for the current thread. When you compile the test program with no optimization (-O0) the program returns fairly consistent times for all the timers (around .25 seconds). However, when you turn on optimization (-O2) the results vary a lot. For the compilers I tried (ifx, ifort, gfortran, nvfortran and AMD flang). SYSTEM_CLOCK and DATE_AND_TIME return a zero elapsed time to 15 decimal places. I expected a drastic drop in time with optimization but not zero. However, given how fast modern processors are I can see how even 1.E8 iterations might not be enough work to register a computable elapsed time. Consistent with John’s observation, only the CPU time functions showed anything like a observable elapsed time. Therefore, I will have to agree with John about CPU_TIME. This raises the question are current procedures for timing code when optimization is turned on really sufficient. Please try the test program and let me know if I’m not doing something right and your results on other systems. I think the C gettimeofday and clock_gettime are POSIX so they may work on Windows (but don’t hold your breath). I think if you really need to time code its best to use a dedicated profiler that can descend to the loop level.

Update:

I think the reason that SYSTEM_CLOCK and DATE_AND_TIME are showing zero delta time is that there is an interval between when values returned by these routines are actually updated. On my Linux system, this appears to be on the order of 10E-5 to 10E-6 seconds for SYSTEM_CLOCK depending on how often you run the program. What is probably happening is that with -O2 optimization, the time taken in the summation loops is less than the SYSTEM_CLOCK update increment so it will always return the starting and ending times as the same value. I think the moral of this story (at least for me) is don’t trust any of the Fortran timers for doing fine grain profiling of loops or code segments. That job should be done by a dedicated profiling tool like gprof etc. I think most of the Fortran timers are O.K. for computing total run time (ie wall clock time etc) of a program but using them for timing loops is probably not the best idea.

1 Like

I think that is correct, and also on most compilers that interval depends on the integer and real KINDs of the arguments. If you use int64 arguments, for example, then you will likely get finer grained clock counts.

As far as which to use, it depends on what you want to time. The cpu time (typically the sum of the cpu times over all threads) is different than the wall clock elapsed time. Consider a program that is waiting for you to respond to a prompt for input. While you are scratching your head wondering what value to type in, the cpu time is not accumulating whereas the wall time is ticking right along.

Agreed. Another issue with CPU_TIME is when you run threaded applications. If I remember correctly, it reports total CPU time accumulated by all processors. Thats one reason I looked at the C clock_gettime function. With the CLOCK_MONOTONE clock id it appears to consistely return values that I can almost believe. Can’t say that about CPU_TIME and to a certain extent SYSTEM_CLOCK. Another issue with SYSTEM_CLOCK is that once it reaches its COUNTS_MAX value COUNTS cycles back to zero. However, with 64 bit integers, COUNTS_MAX is such a large number that in most practical applications it will probably never be reached.

Note. I updated my test_timer program to add a guesstimate of what the update intervals for SYSTEM_CLOCK and DATE_AND_TIME are.

Though I also think that there are lots of caveats about measuring time, in the particular case above, isn’t the “zero time” because -O2 optimizes away the entire calculation (because sum is not used later)? Indeed, if I insert

print *, "sum = ", sum` 

before each write statement, the timing becomes about 0.09 sec (*2) on my PC (*1). Below is the comparison of the results (all with gfortran-12.3 + -D__linux__).

  • Original test_timers.F90 + options = -O0 -march=native
C gettimeofday
  delta time C ctd                 = 0.354956150054932
C clock_gettime with MONOTONIC clock
  delta time C cgt monotonic       = 0.353862497024238
C clock_gettime with REALTIME clock
  delta time C cgt realtime        = 0.353832483291626
C clock_gettime with CLOCK_THREAD_CPUTIME_ID clock (Linux only)
  delta time C cgt thread cputime  = 0.353834986000000
DATE_AND_TIME
  delta time DT                    = 0.353999999817461
CPU_TIME
  delta time CPU                   = 0.355595000000000
SYSTEM_CLOCK
  delta time SC                    = 0.355354215949774
  • Original test_timers.F90 + options = -O2 -march=native
C gettimeofday
 delta time C ctd                 = 0.00000000000000
C clock_gettime with MONOTONIC clock
 delta time C cgt monotonic       = 0.698491930961609E-7
C clock_gettime with REALTIME clock
 delta time C cgt realtime        = 0.238418579101562E-6
C clock_gettime with CLOCK_THREAD_CPUTIME_ID clock (Linux only)
 delta time C cgt thread cputime  = 0.100100000000004E-5
DATE_AND_TIME
 delta time DT                    = 0.00000000000000
CPU_TIME
 delta time CPU                   = 0.999999999999916E-6
SYSTEM_CLOCK
  delta time SC                    = 0.307336449623108E-7
  • Modified test_timers.F90 (with print *,sum) + options = -O2 -march=native
C gettimeofday
 sum =    1000100000000.0000     
 delta time C ctd                 = 0.885128974914551E-1
C clock_gettime with MONOTONIC clock
 sum =    1000100000000.0000     
 delta time C cgt monotonic       = 0.884691430255771E-1
C clock_gettime with REALTIME clock
 sum =    1000100000000.0000     
 delta time C cgt realtime        = 0.889539718627930E-1
C clock_gettime with CLOCK_THREAD_CPUTIME_ID clock (Linux only)
 sum =    1000100000000.0000     
 delta time C cgt thread cputime  = 0.890282150000000E-1
DATE_AND_TIME
 sum =    1000100000000.0000     
 delta time DT                    = 0.889999999199063E-1
CPU_TIME
 sum =    1000100000000.0000     
 delta time CPU                   = 0.888979999999999E-1
SYSTEM_CLOCK
  sum =    1000100000000.0000     
 delta time SC                    = 0.889583956450224E-1

(*1) My PC is Ubuntu22 + Ryzen5700X, but the maximum CPU frequency is suppressed to 3400 MHz via BIOS (no boost) so that the machine will not get very hot.

(*2) 0.09 sec x 3.4 GHz ~ 3 * 10^8 operations, which seems close to three additions x the number of loops (n = JMAX * IMAX = 10^8), but I’m not sure to what extent this kind of direct counting corresponds to the actual timing. Also there may be some “warm-up” time for CPU to increase the frequency (e.g. 2200 → 3400 MHz for the above PC).


BTW, I think it will be very useful to gather various info and TIPS for performing benchmarking, profiling, and related tools in some Fortran-lang page (or even in one dedicated thread in the Discourse site), so that people can refer to it conveniently. Then I will definitely read it because I have little experience with such profiling tools yet.

To get more info, I have tried changing the number of iterations from the original value (n = 10**8) down to n = 10 for all the timers available in timers.F90 (linked in the first post). From the results below, it seems (for this particular test) that (i) DATE_AND_TIME breaks down very soon; (ii) SYSTEM_CLOCK is among the best and better than CPU_TIME; and (iii) the highest resolution (for one sample of calculation) may be 10 microseconds or so (?).

(all data with gfortran-12.3 -O2 -march=native + Ryzen5700X / 3.4GHz (fixed))
gettimeofday
    n = 10**7 : time = 0.886297225952148E-2
    n = 10**6 : time = 0.884056091308594E-3
    n = 10**5 : time = 0.891685485839844E-4
    n = 10**4 : time = 0.905990600585938E-5
    n = 10**3 : time = 0.953674316406250E-6
    n = 10**2 : time = 0.953674316406250E-6
    n = 10**1 : time = 0.00000000000000
clock_gettime with MONOTONIC clock
    n = 10**7 : time = 0.886544678360224E-2
    n = 10**6 : time = 0.895485281944275E-3
    n = 10**5 : time = 0.885054469108582E-4
    n = 10**4 : time = 0.902730971574783E-5
    n = 10**3 : time = 0.991858541965485E-6
    n = 10**2 : time = 0.170432031154633E-6
    n = 10**1 : time = 0.800937414169312E-7
clock_gettime with REALTIME clock
    n = 10**7 : time = 0.884652137756348E-2
    n = 10**6 : time = 0.884294509887695E-3
    n = 10**5 : time = 0.884532928466797E-4
    n = 10**4 : time = 0.882148742675781E-5
    n = 10**3 : time = 0.953674316406250E-6
    n = 10**2 : time = 0.00000000000000
    n = 10**1 : time = 0.00000000000000
clock_gettime with CLOCK_THREAD_CPUTIME_ID clock
    n = 10**7 : time = 0.884679000000000E-2
    n = 10**6 : time = 0.884829000000000E-3
    n = 10**5 : time = 0.894180000000000E-4
    n = 10**4 : time = 0.986800000000012E-5
    n = 10**3 : time = 0.177399999999996E-5
    n = 10**2 : time = 0.100100000000004E-5
    n = 10**1 : time = 0.981999999999849E-6
DATE_AND_TIME
    n = 10**7 : time = 0.900000007823110E-2
    n = 10**6 : time = 0.100000016391277E-2
    n = 10**5 : time = 0.00000000000000
    n = 10**4 : time = 0.00000000000000
    n = 10**3 : time = 0.00000000000000
    n = 10**2 : time = 0.00000000000000
    n = 10**1 : time = 0.00000000000000
CPU_TIME
    n = 10**7 : time = 0.884799999999999E-2
    n = 10**6 : time = 0.889000000000000E-3
    n = 10**5 : time = 0.899999999999998E-4
    n = 10**4 : time = 0.100000000000000E-4
    n = 10**3 : time = 0.300000000000018E-5
    n = 10**2 : time = 0.200000000000005E-5
    n = 10**1 : time = 0.999999999999916E-6
SYSTEM_CLOCK
    n = 10**7 : time = 0.886124838143587E-2
    n = 10**6 : time = 0.884152948856354E-3
    n = 10**5 : time = 0.885073095560074E-4
    n = 10**4 : time = 0.887643545866013E-5
    n = 10**3 : time = 0.931322574615479E-6
    n = 10**2 : time = 0.129453837871552E-6
    n = 10**1 : time = 0.298023223876953E-7

Thanks. These are very interesting results. I also thought that the compiler might be optimizing away the summation but was not sure how to prove it. I think your results reinforces the point I was trying to make with this exercise that you need to be aware of the differences and potential problems that can occur between the various intrinsic Fortran timing functions if you want to profile loops etc at a fine grain level. For Linux users, I think the THREAD_CPUTIME option for clock_gettime might be of use if you want an estimate of CPU time per processor. Obviously, using DATE_AND_TIME is not the best option but its one I’ve seen used before.

1 Like

I always use system_clock with int64 arguments. That gives you the higher precision timing you want, and make sure your program actually uses the output of whatever computation. Usually I just write a value to the console after doing whatever benchmark. If you don’t actually use the variable you’re working on in the benchmark code, a good compiler will optimize it away completely when you go above -O0

Timing is essential in (my) coarray programming as well. But in context with Unified Acceleration programming I do first check if a (future) device/accelerator does actually support a clock at all, and if not so use alternative (inexact) ways to measure time, e.g. using (explicit) parallel loop iteration counts. See the following code snippet from my channel implementation (still working on this) to initiate a self-adjusting timer:

!****************************************************************
  ! initiate the timer for the ChannelReceive
  call cpu_time (fo % m_r_cpuTime1)
  call system_clock(count_rate = fo % m_r_countRate)

  !**************
  ! initiate the timer for the ChannelReceive (using system_clock):
  if (fo % m_r_countRate == 0) then
    fo % m_l_system_clock_supported = .false. ! system/device has no clock
  else
    fo % m_l_system_clock_supported = .true. ! system/device has a clock
  end if

  !**************
  ! initiate the timer for the ChannelReceive (using cpu_time):
  if (fo % m_r_cpuTime1 < 0) then ! system/device does not support cpu_time
    fo % m_l_cpu_time_supported = .false.
  else ! system/device does support cpu_time
    fo % m_l_cpu_time_supported = .true.
  end if

  !**************
  if (fo % m_l_system_clock_supported) then ! the system/device does support system_clock / has a clock
    call system_clock(count = fo % m_i_time1) ! initiate the timer
    fo % m_r_timeLimitInSec = 0.1 ! each channel's IsReceive does reinitiate the timer
    fo % m_l_timeLimitInSecCustomizedSetting = .false.
    fo % m_l_automaticTimeLimitAdjustmentEnabled = .true.
    fo % m_r_maxValueInSecForAutomaticTimeLimitAdjustment = 8.0
    ! default value, customized setting through the set_timeLimitInSeconds procedure
    ! below
  else if (fo % m_l_cpu_time_supported) then ! system/device does support cpu_time
    call cpu_time (fo % m_r_cpuTime1)
    fo % m_r_timeLimitInSec = 0.1 ! each channel's IsReceive does reinitiate the timer
    fo % m_l_timeLimitInSecCustomizedSetting = .false.
    fo % m_l_automaticTimeLimitAdjustmentEnabled = .true.
    fo % m_r_maxValueInSecForAutomaticTimeLimitAdjustment = 8.0
    ! default value, customized setting through the set_timeLimitInSeconds procedure
    ! below
  else ! system_clock and cpu_time are not supported
    fo % m_i_parallelLoopIterationCounter = 0
    fo % m_i_loopIterationLimit = 1000
    fo % m_l_automaticIterationLimitAdjustmentEnabled = .true.
    fo % m_i_maxValueForAutomaticIterationLimitAdjustment = 25000
  end if  
  !**************

Timing comes up here relatively frequently so I was gathering up some of the info discussed here and in other threads and seeing what might help improve the fpm-compatible version of stopwatch and some fpm plugin prototypes for using gprof and gdb that I started putting together in a WIP package

\ GitHub - lockstockandbarrel/uranus

and looking around at the state of the art I was surprised at the sophistication of some of the tools in @Beliavsky’s lists. There is more out there than I thought.

That little site I just made is basically a temporary sandbox for trying out things to add to M_stopwatch but It got me wondering.

Are people more interested in tooling that simplifies using tools like gcov, gdb, gprof, valgrind, and related vendor tools from fpm(1), in packages like M_stopwatch, or just in some simple examples to act as frameworks to insert into their codebase to do simple timing with?

I was toying with making the output from SYSTEM_CLOCK automatically use an appropriate format based on the size of CLOCK_RATE and found that on some systems like CygWin gfortran is returning 1000000000 when the clock under CygWin on a PC has a resolution at least 100 times less than that; so I am wondering how much that number can be trusted; and that if if CLOCK_RATE is an 128-bit real that IFX does an ICE on at least two systems while IFORT does not with the versions I have; but that you get the highest value the system has using 64-bit values so it is not a big deal. Just that any ICE is good to eliminate.

My biggest issues with SYSTEM_CLOCK are that once you reach COUNT_MAX the clock counts cycle back to zero. Also, you have no way of knowing a priori where COUNTS is in the 0 to COUNT_MAX cycle, COUNTS becomes somewhat arbitrary. Granted with 64 bit integers COUNT_MAX is a very large number that you will probably never reach but you never know. That was another reason for looking at some of the clock_gettime options because they don’t have “discontinuities” based on correcting for leap seconds etc. Again, SYSTEM_CLOCK is still your second best bet if you don’t want to use a profiler. At least thats what i see based on the results shown in this thread.

Also, I’ve seen some people use MPI’s WTIME (and I guess there is an OpenMP wtime as well). I’m not sure what they are based on but they might be another option to consider.

Update: It appears that openMPI is using the C gettimeofday function as the default for WTIME but allows systems to call clock_gettiime if that is a “cheaper” alternative. Not sure about the other MPI’s. See here for an explanation of openMPI WTIME.

For the common case for serial codes if both DATE_AND_TIME() and SYSTEM_CLOCK() are returning values if on the first call you
do a delta between the milliseconds of DATE_AND_TIME() and the SYSTEM_CLOCK() and know the rate of the system clock, you could create a date composed of an int64 and a real64 that is an epoch date (Unix Epoch Time, Julian time, …), up to a second in the integer component (probably adjusted to be in the Zulu time zone) and then use the SYSTEM_CLOCK information only to calculate the fractional second.

This user-define type for tracking time that is a large integer for the whole seconds and a 64-bit float comes from earlier discussions moderated by @beliavsky (I think). Lets call it a ZULU. Because what calendrical and clock information is available varies so much, and a lot of things being timed may be on GPUs more and more (that are unlikely to fully support a clock and date and/or have their own clock(s)) I am not sure how well it will hold for the general case, but I was thinking of starting with making a procedure to return a ZULU and see how well that works; plus add a d2z and z2d function to the M_time module to convert to and from DAT times to allow for flexible formatting of the dates. So if the start time is a zulu time you only depend on the SYSTEM_CLOCK() value to calculate the second, and you should be able to calculate how many rollovers
As long as the rate is large it is reasonable to assume no more than one rollover; but from the descriptions of the routines it should be set up to work will rollovers at least down to one day. So that was were I was heading until I hit a show-stopper – combine the available info to make a ZULU time as accurate as available on the platform that is a user-defined type composed of a 64-bit int for the seconds of an epoch time and a 64-bit real for the calendrical second. If that works a lot of the rest is easy. My biggest concern is how well that will hold up with parallel code on GPUs.

Either way, I think i will rewrite M_time to internally use ZULU units

I have been working for few months on a benchmarking library: benchmark.f. I initially did not plan to release it so soon but looking at these discussions about benchmarking gave me the extra motivation to finish it.

The library allows you to time any function/subroutine in your code. It as simple as

type(runner) :: br
 
benchmark(br, run(arg1, arg2, foo))

Under the hood, the code does several things for you

  • retrieve hardware info
  • retrieve compiler info
  • warm up
  • repeat the function calls to either reach a minimum time or a maximum number of execution. The clock can be picked among the ones discussed in this thread and by default system_clock is used.
  • repeat the previous step until reaching steady state (checked by t-test on the mean)
  • compute statistics and report the results

Most of the parameters can be customized either in code or through a configuration file.

I spent some time creating a documentation with various examples that I hope will be useful. Give it a try and do not hesitate to reach out if you have any suggestions.

3 Likes

Use the clock on the wall, preferably one with a second hand. Once you have stopped chuckling over this suggestion, read on …

If your program executes too quickly for you to measure accurately enough using the second hand on a wall clock why are you bothering with its execution time? If you really really want to measure the time of a short program then put it in a loop and execute it 100, 1000, 10^n-th times until you get accurate enough timings from the clock.

Look, this approach doesn’t need to worry if the system clock rolls over during execution; what the implementation differences between this platform and that one are; what happens with parallel programs; nor any of the other technical problems raised in previous answers.

1s accuracy is all you need for programs with short execution times; for long execution times you can ignore the second hand. What use are the 14 additional insignificant figures you get from calculating the run time using 64-bit floats, even 32-bit floats?

1 Like

A good point; but hard to script and automate. But in retrospect it is odd I have not seen anyone using the GNU time(1) command in the forum, or the shell built-ins; particularly the tcsh/csh build-in that you can have report statistics from any command taking over N seconds. I lot of HPC batch jobs include uses like that to give overall command timing information; and it is seen quite a bit in just writing a little program that calls a routine a lot and then just timing it externally as in “time cmd”.
Not sure what the MSWIndows equivalent is. I use the GNU command quite often myself. You might want to add that to

https://fortranwiki.org/fortran/show/Timing

and the verbose mode of “fpm run” and “fpm test” would easily be able to display wallclock run times by timing the calls to external commands; which would be a nice feature I think.

Originally there was a wrapper routine for all calls to EXECUTE_COMMAND_LINE. If that is still true, it would only take a few lines. If DATE_AND_TIME is used several packages on the Fortran Wiki page that work with FPM can do the time diff between start and end times; or just do it with the values returned by the function; doing it externally with a wallclock or a command is really the simplest if what is running is a simple benchmark loop program.

The attributes of all timing routines discussed vary significantly between operating systems.

Three points that should be remembered.

  1. “wall clock” timers should be more precise, as they just monitor a timer in the OS. ( in windows, RDTSC is the best timer, but not easily accessable , which is a pitty)
  2. “cpu_time” timers depend on how frequently the OS updates the cpu usage pool. ( Win 10 does this only 64 times per second, which is pathetic for short duration calcs.)
  3. if the timer reports no change ( between calls, ie delta time = 0 ), then the calculation duration is too small for the precision of the timer being used. This can be identified in some posts above.
1 Like

So to your point I started the sub page

https://fortranwiki.org/fortran/show/timer

that just runs a command and returns the wallclock time in D-HH:MM:SS.SSS format
as a starter page. Feel free to finish it.

I also made links to that new page and some other related Fortran resources. If anyone else has others please add them. A good example of using gprof(1) might be useful.

Not sure how portable the example program is to legacy OS systems. If anyone can see if it builds on MSWindows and/or other compilers I do not have access at the moment.

Good points and certainly not obvious ones. As there was mention to above I was surprised the clock sampling rates returned are often nowhere near the actual sample rates, combined with your points – timing anything below a macro level is far more problematic than I realized unless you stay within one programming environment and come to understand it’s low-level behavior!

Microsoft used to make available a timethis utility. I copied it to my PC running Windows 10, where it still works.

#!/bin/bash
cd /tmp
git clone https://github.com/lockstockandbarrel/uranus
cd uranus/TimeThis
fpm run -- dir|tee -a $0
#fpm install  # assuming set up for default install to work and want to, remove leading #
exit
README.md  app	build  fpm.toml
TimeThis :  Command Line :  dir
TimeThis :    Start Time :  Fri Sep 27 13:04:41 2024
TimeThis :      End Time :  Fri Sep 27 13:04:41 2024
TimeThis :  Elapsed Time :  0-00:00:00.75

Should be a look-alike? From what I gleaned it did not seem to have any options.

In standard Fortran.