Large wall time, small CPU_TIME, what does that tell you?

Sure. This one seems working:

Code:

   real*8, allocatable :: large_array(:,:,:)
   integer :: i,j,k, ni,nj,nk
   real*8    :: tic,toc, start, end

   ni = 500
   nj = 1000
   nk = 1500
   allocate ( large_array(ni,nj,nk) )

   write (*,*) 'GOOD access'
   call CPU_TIME(start)
   tic = wall_time () ! wall time counter
   
   do k = 1,nk
     do j = 1,nj
       do i = 1,ni
         large_array(i,j,k) = i+j+k
       end do
     end do
   end do

   toc = wall_time() ! wall time counter
   call CPU_TIME(end)

   write(*,*) 'cpu time is: ', end-start
   write(*,*) 'wall time is: ', toc-tic

   write (*,*) 'BAD  access'
   call CPU_TIME(start)
   tic = wall_time () ! wall time counter
   
   do i = 1,ni
     do j = 1,nj
       do k = 1,nk
         large_array(i,j,k) = i+j+k
       end do
     end do
   end do

   toc = wall_time() ! wall time counter
   call CPU_TIME(end)

   write(*,*) 'cpu time is: ', end-start
   write(*,*) 'wall time is: ', toc-tic

  contains

   real function wall_time ()
    integer*8 :: ticks, rate, start_tick = -1
    call system_clock ( ticks, rate )
     if ( start_tick == -1) start_tick = ticks
     ticks = ticks - start_tick
    wall_time = dble(ticks)/dble(rate)
    write (*,*) 'wall_time :',ticks, rate, wall_time
   end function wall_time 
1 Like

This is the code I used in my previous post. I explained why it failed and two ways to fix it (one suggested by you earlier), and I posted timing results with the fixed version.

Portability is not the problem, it is just an extension that is no longer necessary to use.

Back in the f77 days, I routinely used real*8 declarations because it was more portable than the standard conventions at that time. That’s right, it was nonstandard, but more portable. But now there are several alternatives to specify kind values that are standard and that can be used instead, and that’s the way I think we should address this issue here in this discussion group so that new fortran programmers can get off to a good start in the language.

1 Like

Thanks! I see. This issue may be remotely kind of related with the following two posts:

1 Like

Just out of curiosity, is it also possible to try changing the first allocation statement from

allocate ( large_array(ni,nj,nk) )

to

allocate ( large_array(ni,nj,nk), source=0.0d0 )

on your laptop (I assume Intel + VisualStudio has been used throughout)?
(Here I just focus only on timing, not coding style.) On my mac, flang-21.18 shows an overhead of 0.3–0.4 sec in wall time for the ā€œGOODā€ case (because it involves the initial assignment of the array), while gfortran-15 does not seem to show any difference with the above change… (possibly gfortran does not perform the assignment with ā€œsourceā€ until the array elements are actually used? Then the timing seems more consistent on my mac.)

Also, it may be better to print the result of the array somehow after printing the time to avoid the unrealistic ā€œoptimizationā€ (like some calculation is optimized away), though in the above code this did not change the timing results on my mac…

Interestingly, the BAD case is slower than the GOOD case by about x20 for flang -O3,while the speed is almost the same for gfortran-15 -O3 (when I also output the sum of array values after time measurement). But the runtime is 2 times faster for flang than gfortran, so the timing seems complicated… (I will try the comparison more carefully later.)

RE the remaining difference of CPU_TIME and wall time on your laptop, I wonder if there is some ā€œWindows-specificā€ overhead or security check for OS-related functionalities like time measurement (whose effects may be different for CPU_TIME and system_clock…?)

1 Like

The reason the original code results in zero wall time was not due to integer or floating point overflow. It is due to the loss of precision when converting the integer values to, eventually, real32 floating point. Note that in the statement

wall_time = dble(ticks)/dble(rate)

one might think this would be alright because the initial conversions are real64 values, and then only at the end converted to real32. So lets print out a couple of actual values to see what is happening. I added the line:

print*, ticks, dble(ticks), rate, dble(rate), wall_time  

at the end of the function, and here are some values that result:

 GOOD access
  1769195415213892000   1.7691954152138921E+018           1000000000   1000000000.0000000        1.76919539E+09
  1769195416641231000   1.7691954166412311E+018           1000000000   1000000000.0000000        1.76919539E+09
 cpu time is :    1.42736697    
 wall time is:    0.00000000    
 BAD  access
  1769195416641265000   1.7691954166412649E+018           1000000000   1000000000.0000000        1.76919539E+09
  1769195424213143000   1.7691954242131430E+018           1000000000   1000000000.0000000        1.76919539E+09
 cpu time is :    7.57191992    
 wall time is:    0.00000000   

This is with gfortran on MacOS. As you can see, there appears to be no problem with the int64 values themselves; i.e. no overflow, and the values increase as expected between calls. And there is no problem here with the conversion to real64; some precision is lost, of course (64 integer bits going to 53 mantissa bits), but the difference is still large enough so that the computed times increase the correct amount. There is no problem with conversion of rate either, that integer value is converted correctly. But the conversion to real32 keeps only the high-order bits and discards the low-order bits, and the high-order mantissa bits are the same for those two large values, so the returned times are the same. Thus the differences of the returned times are zero, as we know they should be now that the individual steps are exposed.

There are three ways to fix this problem. One way is to declare all the times, including the wall_time() function result, to be real64. That change results in the output:

 GOOD access
  1769192977478843000   1.7691929774788429E+018           1000000000   1000000000.0000000        1769192977.4788430     
  1769192978912242000   1.7691929789122419E+018           1000000000   1000000000.0000000        1769192978.9122419     
 cpu time is:    1.4334149999999999     
 wall time is:    1.4333989620208740     
 BAD  access
  1769192978912272000   1.7691929789122719E+018           1000000000   1000000000.0000000        1769192978.9122720     
  1769192985396408000   1.7691929853964081E+018           1000000000   1000000000.0000000        1769192985.3964081     
 cpu time is:    6.4840510000000000     
 wall time is:    6.4841361045837402     

Another way to fix the problem is to declare ticks and rate to be int32 instead of int64. With a real32 function result, this gives the output:

 GOOD access
  1817530319   1817530319.0000000             1000   1000.0000000000000        1817530.38    
  1817531752   1817531752.0000000             1000   1000.0000000000000        1817531.75    
 cpu time is :    1.43310905    
 wall time is:    1.37500000    
 BAD  access
  1817531752   1817531752.0000000             1000   1000.0000000000000        1817531.75    
  1817539168   1817539168.0000000             1000   1000.0000000000000        1817539.12    
 cpu time is :    7.41588593    
 wall time is:    7.37500000

This output might seem paradoxical because a less precise calculation (ticks and rate are now less precise than before) results in a more accurate final result, but it does make sense when you look at the computation step by step. The high-order bits in the ticks value no longer exist, leaving just the important low-order bits.

The third way to fix this problem is to manually eliminate the common high-order bits in the ticks value before returning the function result. This approach was given above by @JohnCampbell. Here is my version of that code:

   function wall_time ()
      real(real32) :: wall_time
      integer(int64) :: ticks, rate
      integer(int64), save :: tick0 = -1
      logical, save :: first = .true.
      call system_clock ( ticks, rate )
      if ( first ) then
         first = .false.
         tick0 = ticks
      endif
      wall_time = real(ticks-tick0)/real(rate)
      print*, ticks, tick0, (ticks-tick0), rate, wall_time
      return
   end function wall_time

  GOOD access
  1769194889704100000  1769194889704100000                    0           1000000000   0.00000000    
  1769194891166727000  1769194889704100000           1462627000           1000000000   1.46262693    
 cpu time is :    1.46265602    
 wall time is:    1.46262693    
 BAD  access
  1769194891166762000  1769194889704100000           1462662000           1000000000   1.46266198    
  1769194897533354000  1769194889704100000           7829254000           1000000000   7.82925415    
 cpu time is :    6.36665297    
 wall time is:    6.36659241 

I explicitly give the SAVE attribute in the declarations of first and tick0. This is redundant, of course, since initialized variables have an implicit SAVE, but this fortran feature is nonintuitive and often trips up new fortran programmers, so the explicit declaration seems best here in this discussion.

You might need to look at the assembler code to see, but I would guess that with -O3 the ā€œBADā€ loop order is rearranged to be identical to the ā€œGOODā€ loop order, resulting in the observed identical times for gfortran. Apparently flang doesn’t do that?

1 Like