Module to time code

To time sections N sections of a code, one may declare real :: times(:N+1) and insert
call cpu_time(times(i)) in the right places. It’s simple but tedious, and if you want to add a new check point between two check points, you need to renumber times(:). At the end of the program you need to label and print the elapsed times. I wrote a simple module to streamline this, printing both CPU and wall times given check points. A main program could be

program xwatch
! 11/20/2021 08:15 PM driver for watch and print_elapsed_times
use kind_mod , only: dp
use watch_mod, only: watch,print_elapsed_times
implicit none
integer          , parameter :: n = 10**7, iunit = 20
character (len=*), parameter :: xfile = "temp.bin"
real(kind=dp)                :: x(n),xchk(n)
call random_seed()
call watch("init")
call random_number(x)
call watch("random_number()")
print*,"sum(x) =",sum(x)
call watch("sum(x)")
print*,"sum(sin(x)) =",sum(sin(x))
call watch("sum(sin(x))")
open (unit=iunit,file=xfile,action="write",status="replace",form="unformatted")
write (iunit) x
close (iunit)
call watch("wrote x")
open (unit=iunit,file=xfile,action="read",status="old",form="unformatted")
read (iunit) xchk
call watch("read x")
print*,"max_diff =",maxval(abs(x-xchk))
call watch("check")
call print_elapsed_times()
end program xwatch

for the modules

module kind_mod
implicit none
integer, public, parameter :: dp = kind(1.0d0)
end module kind_mod
module watch_mod
use kind_mod, only: dp
implicit none
public              :: watch,print_elapsed_times
integer, parameter  :: max_times = 1000
real(kind=dp)       :: cpu_times(max_times)
integer             :: wall_times(max_times)
integer, save       :: ntimes = 0
character (len=100) :: time_labels(max_times)
subroutine watch(label)
character (len=*), intent(in), optional :: label
ntimes = ntimes + 1
if (ntimes <= max_times) then
   call cpu_time(cpu_times(ntimes))
   call system_clock(wall_times(ntimes))
   if (present(label)) then
      time_labels(ntimes) = label
      time_labels(ntimes) = ""
   end if
end if
end subroutine watch
subroutine print_elapsed_times()
integer :: i, itick
call system_clock(count_rate = itick)
write (*,"(/,a20,2a12)") "task","cpu_time","wall_time"
do i=2,ntimes
   write (*,"(a20,2f12.6)") trim(time_labels(i)),cpu_times(i)-cpu_times(i-1), &
   if (i == ntimes) write (*,"(a20,2f12.6)") "TOTAL",cpu_times(i)-cpu_times(1), &
end do
end subroutine print_elapsed_times
end module watch_mod

Sample output is

 sum(x) =   4999193.4915036112     
 sum(sin(x)) =   4596353.7277946780     
 max_diff =   0.0000000000000000     

                task    cpu_time   wall_time
     random_number()    0.078125    0.078000
              sum(x)    0.000000    0.000000
         sum(sin(x))    0.375000    0.375000
             wrote x    0.093750    0.313000
              read x    0.031250    0.047000
               check    0.031250    0.031000
               TOTAL    0.609375    0.844000

Wall and CPU times tend to be very close except for I/O. I could add an option to print the elapsed times in descending order to clarify where a program spends more time.


There are other timing libraries like StopWatch but I like this one better than any other I have seen, even though I maintain a copy of StopWatch (the original is public domain from NIST but is no longer maintained).

I use vendor-supplied utilities or gprof(1) for profiling, but sometimes you just want built-in timing info or something quick for testing, and this looks to be very nice. I hope you turn it into an fpm(1) package, I think it would make a great addition to the fpm(1) registry.

Timing-related issues come up frequently on this forum, I think this would be nice in fpm(1) and/or stdlib(1) so it could easily be used, at least with code that is being discussed where those are in context.

I don’t think this approach is viable. You appear to be timing functions that take a few processor cycles, with a CPU_time function that on windows is updated about once every 60 million processor cycles.
I can only comment for windows OS where the only possible timer available for this approach could be RDTSC, while the intrinsic SYSTEM_CLOCK at best is updated every 1024 processor cycles, but typically with default integers, much less frequently.
A closer inspection of cpu_time reports will demonstrate the problem.This is why programmers may choose 100 million calls of a task to get a sensible time report.
I understand linux based timers may be more accurate, but I douby if they approach the accuracy required. I would like to know what timing resolution is available on other O/S.
For use in Fortran programs, we need access to more precise timers and certainly better documentation of the timer precision. Those who have tried to utilise rdtsc on windows would appreciate the complexity of using this timer. The Fortran compiler should include this.


I think cpu_time and system_clock are OK for some uses. Could an example be given of timing a Fortran code at a higher resolution using other methods? Ideally a wrapper could be provided for that too.

Yes, it would be nice to have a reliable benchmark library with nice reporting like BenchmarkTools.jl.

Here is how I currently benchmark:

! The sizes must be divisible by 512 (=64 doubles)
integer(i8), parameter :: sizes(*) = [ &
    512, &           
    1024, &         ! 1 KB                         
    2 * 1024, &                                                              
    4 * 1024, &                          
    3 * 1024, &    
    6 * 1024, &                          
    8 * 1024, &
    10 * 1024, &                                          
    16 * 1024, &                                               
    32 * 1024, &                               
    64 * 1024, &   
    96 * 1024, &                         
    128 * 1024, &
    196 * 1024, &                                          
    256 * 1024, &                                              
    400 * 1024, &                    
    512 * 1024, &  
    600 * 1024, &                           
    800 * 1024, &
    900 * 1024, &                                              
    1024 * 1024, &    ! 1 MB      
    1400 * 1024, & 
    1800 * 1024, &                               
    2 * 1024 * 1024, &
    4 * 1024 * 1024, &
    8 * 1024 * 1024, &                             
    16 * 1024 * 1024, & ! 16 MB
    32 * 1024 * 1024  &
do j = 1, size(sizes)
    Ntile = sizes(j) / 8 ! Double precision (8 bytes) as array element size
    M = 1024*10000*6*10*2 / Ntile
    if (Ntile > 32768) M = M / 5
    if (M == 0) M = 1
    allocate(r(Ntile), x(Ntile))
    call random_number(x)
    x = x*(xmax-xmin)+xmin

    call cpu_time(t1)
    do k = 1, M
        call array_read(Ntile, x)
    end do
    call cpu_time(t2)
    time_read = (t2-t1)/(M*Ntile)

    call cpu_time(t1)
    do k = 1, M
        call array_write(Ntile, r)
    end do
    call cpu_time(t2)
    time_write = (t2-t1)/(M*Ntile)

    call cpu_time(t1)
    if (benchmark_type == benchmark_type_fast) then
        ! This is the fast high accuracy version, the candidate for inclusion
        ! into LFortran's runtime library
        do k = 1, M
            call kernel_sin1(Ntile, x, r)
        end do
    elseif (benchmark_type == benchmark_type_fastest) then
        ! This is the fastest possible but low accuracy version
        ! It still uses (-pi/2, pi/2) reduction
        do k = 1, M
            call kernel_sin4(Ntile, x, r)
        end do
    elseif (benchmark_type == benchmark_type_fastest2) then
        ! This is the fastest possible but low accuracy version
        ! It uses (-pi, pi) reduction
        do k = 1, M
            call kernel_sin42pi(Ntile, x, r)
        end do
    elseif (benchmark_type == benchmark_type_gfortran_sin) then
        ! The default gfortran sin
        do k = 1, M
            call kernel_gfortran_sin(Ntile, x, r)
        end do
        error stop "Benchmark type not implemented"
    end if
    call cpu_time(t2)
    time_kernel = (t2-t1)/(M*Ntile)

    print "(i10, i10, es15.6, es15.6, es15.6)", Ntile, M, time_kernel, time_read, time_write
    ! To prevent the compiler to optimize out the above loop
    open(newunit=u, file="log.txt", status="replace", action="write")
    write(u, *) r(1:10)

    deallocate(r, x)
end do

You have to loop it to last roughly 1s (could probably by less), then you get very accurate timings.

I’ve got what I think is a nice module for timing sections of code. It implements a lightweight method for creating and maintaining a nested tree of timers. The tree is automatically generated via the nested starting and stopping of named timers. You can read more about it here, and the module is here. It uses the intrinsic system_clock and so is probably not suitable for timing very small sections of code, as has been discussed above.


[quote=“Beliavsky, post:4, topic:2313”]
timing a Fortran code at a higher resolution
[/quote] could be achieved on Windows OS by providing a link to integer*8 Function RDTSC_tick ()
This could be named something else, eg Processor_ticks () and provided for all OS.

A number of problems are suggested for this timer, but none cancel out the benefit of a higher resolution timer. These include the time can vary between threads, which can be overcome by timing outside !$OMP regions or accepting the accuracy available.
There is also a question about if the cpu boost clock can affect the Rdtsc clock rate reported, which I don’t fully understand.
I have received code for gFortran Rdtsc_ticks () and Rdtsc_initialise (), but they are not robust.
There is also an issue of availability of Rdtsc_rate (), which can be overcome by calibrating against another timer, over the program total run time.
It would be much better if these implementation issues could be resolved by the compiler when providing 8-byte integer functions and documenting the available precision and latency.
It is not a good outcome if the timer latency is much greater than what is being timed, although this needs to be understood when considering timing of short snippets of code, as in the initial post.

I have used these timers for trying to understand OpenMP delays for constructs (which was another disappointing outcome, especially when expressed as processor cycles)