Large wall time, small CPU_TIME, what does that tell you?

Sure. This one seems working:

Code:

   real*8, allocatable :: large_array(:,:,:)
   integer :: i,j,k, ni,nj,nk
   real*8    :: tic,toc, start, end

   ni = 500
   nj = 1000
   nk = 1500
   allocate ( large_array(ni,nj,nk) )

   write (*,*) 'GOOD access'
   call CPU_TIME(start)
   tic = wall_time () ! wall time counter
   
   do k = 1,nk
     do j = 1,nj
       do i = 1,ni
         large_array(i,j,k) = i+j+k
       end do
     end do
   end do

   toc = wall_time() ! wall time counter
   call CPU_TIME(end)

   write(*,*) 'cpu time is: ', end-start
   write(*,*) 'wall time is: ', toc-tic

   write (*,*) 'BAD  access'
   call CPU_TIME(start)
   tic = wall_time () ! wall time counter
   
   do i = 1,ni
     do j = 1,nj
       do k = 1,nk
         large_array(i,j,k) = i+j+k
       end do
     end do
   end do

   toc = wall_time() ! wall time counter
   call CPU_TIME(end)

   write(*,*) 'cpu time is: ', end-start
   write(*,*) 'wall time is: ', toc-tic

  contains

   real function wall_time ()
    integer*8 :: ticks, rate, start_tick = -1
    call system_clock ( ticks, rate )
     if ( start_tick == -1) start_tick = ticks
     ticks = ticks - start_tick
    wall_time = dble(ticks)/dble(rate)
    write (*,*) 'wall_time :',ticks, rate, wall_time
   end function wall_time 
1 Like

This is the code I used in my previous post. I explained why it failed and two ways to fix it (one suggested by you earlier), and I posted timing results with the fixed version.

Portability is not the problem, it is just an extension that is no longer necessary to use.

Back in the f77 days, I routinely used real*8 declarations because it was more portable than the standard conventions at that time. That’s right, it was nonstandard, but more portable. But now there are several alternatives to specify kind values that are standard and that can be used instead, and that’s the way I think we should address this issue here in this discussion group so that new fortran programmers can get off to a good start in the language.

1 Like

Thanks! I see. This issue may be remotely kind of related with the following two posts:

1 Like

Just out of curiosity, is it also possible to try changing the first allocation statement from

allocate ( large_array(ni,nj,nk) )

to

allocate ( large_array(ni,nj,nk), source=0.0d0 )

on your laptop (I assume Intel + VisualStudio has been used throughout)?
(Here I just focus only on timing, not coding style.) On my mac, flang-21.18 shows an overhead of 0.3–0.4 sec in wall time for the “GOOD” case (because it involves the initial assignment of the array), while gfortran-15 does not seem to show any difference with the above change
 (possibly gfortran does not perform the assignment with “source” until the array elements are actually used? Then the timing seems more consistent on my mac.)

Also, it may be better to print the result of the array somehow after printing the time to avoid the unrealistic “optimization” (like some calculation is optimized away), though in the above code this did not change the timing results on my mac


Interestingly, the BAD case is slower than the GOOD case by about x20 for flang -O3,while the speed is almost the same for gfortran-15 -O3 (when I also output the sum of array values after time measurement). But the runtime is 2 times faster for flang than gfortran, so the timing seems complicated
 (I will try the comparison more carefully later.)

RE the remaining difference of CPU_TIME and wall time on your laptop, I wonder if there is some “Windows-specific” overhead or security check for OS-related functionalities like time measurement (whose effects may be different for CPU_TIME and system_clock
?)

1 Like

The reason the original code results in zero wall time was not due to integer or floating point overflow. It is due to the loss of precision when converting the integer values to, eventually, real32 floating point. Note that in the statement

wall_time = dble(ticks)/dble(rate)

one might think this would be alright because the initial conversions are real64 values, and then only at the end converted to real32. So lets print out a couple of actual values to see what is happening. I added the line:

print*, ticks, dble(ticks), rate, dble(rate), wall_time  

at the end of the function, and here are some values that result:

 GOOD access
  1769195415213892000   1.7691954152138921E+018           1000000000   1000000000.0000000        1.76919539E+09
  1769195416641231000   1.7691954166412311E+018           1000000000   1000000000.0000000        1.76919539E+09
 cpu time is :    1.42736697    
 wall time is:    0.00000000    
 BAD  access
  1769195416641265000   1.7691954166412649E+018           1000000000   1000000000.0000000        1.76919539E+09
  1769195424213143000   1.7691954242131430E+018           1000000000   1000000000.0000000        1.76919539E+09
 cpu time is :    7.57191992    
 wall time is:    0.00000000   

This is with gfortran on MacOS. As you can see, there appears to be no problem with the int64 values themselves; i.e. no overflow, and the values increase as expected between calls. And there is no problem here with the conversion to real64; some precision is lost, of course (64 integer bits going to 53 mantissa bits), but the difference is still large enough so that the computed times increase the correct amount. There is no problem with conversion of rate either, that integer value is converted correctly. But the conversion to real32 keeps only the high-order bits and discards the low-order bits, and the high-order mantissa bits are the same for those two large values, so the returned times are the same. Thus the differences of the returned times are zero, as we know they should be now that the individual steps are exposed.

There are three ways to fix this problem. One way is to declare all the times, including the wall_time() function result, to be real64. That change results in the output:

 GOOD access
  1769192977478843000   1.7691929774788429E+018           1000000000   1000000000.0000000        1769192977.4788430     
  1769192978912242000   1.7691929789122419E+018           1000000000   1000000000.0000000        1769192978.9122419     
 cpu time is:    1.4334149999999999     
 wall time is:    1.4333989620208740     
 BAD  access
  1769192978912272000   1.7691929789122719E+018           1000000000   1000000000.0000000        1769192978.9122720     
  1769192985396408000   1.7691929853964081E+018           1000000000   1000000000.0000000        1769192985.3964081     
 cpu time is:    6.4840510000000000     
 wall time is:    6.4841361045837402     

Another way to fix the problem is to declare ticks and rate to be int32 instead of int64. With a real32 function result, this gives the output:

 GOOD access
  1817530319   1817530319.0000000             1000   1000.0000000000000        1817530.38    
  1817531752   1817531752.0000000             1000   1000.0000000000000        1817531.75    
 cpu time is :    1.43310905    
 wall time is:    1.37500000    
 BAD  access
  1817531752   1817531752.0000000             1000   1000.0000000000000        1817531.75    
  1817539168   1817539168.0000000             1000   1000.0000000000000        1817539.12    
 cpu time is :    7.41588593    
 wall time is:    7.37500000

This output might seem paradoxical because a less precise calculation (ticks and rate are now less precise than before) results in a more accurate final result, but it does make sense when you look at the computation step by step. The high-order bits in the ticks value no longer exist, leaving just the important low-order bits.

The third way to fix this problem is to manually eliminate the common high-order bits in the ticks value before returning the function result. This approach was given above by @JohnCampbell. Here is my version of that code:

   function wall_time ()
      real(real32) :: wall_time
      integer(int64) :: ticks, rate
      integer(int64), save :: tick0 = -1
      logical, save :: first = .true.
      call system_clock ( ticks, rate )
      if ( first ) then
         first = .false.
         tick0 = ticks
      endif
      wall_time = real(ticks-tick0)/real(rate)
      print*, ticks, tick0, (ticks-tick0), rate, wall_time
      return
   end function wall_time

  GOOD access
  1769194889704100000  1769194889704100000                    0           1000000000   0.00000000    
  1769194891166727000  1769194889704100000           1462627000           1000000000   1.46262693    
 cpu time is :    1.46265602    
 wall time is:    1.46262693    
 BAD  access
  1769194891166762000  1769194889704100000           1462662000           1000000000   1.46266198    
  1769194897533354000  1769194889704100000           7829254000           1000000000   7.82925415    
 cpu time is :    6.36665297    
 wall time is:    6.36659241 

I explicitly give the SAVE attribute in the declarations of first and tick0. This is redundant, of course, since initialized variables have an implicit SAVE, but this fortran feature is nonintuitive and often trips up new fortran programmers, so the explicit declaration seems best here in this discussion.

You might need to look at the assembler code to see, but I would guess that with -O3 the “BAD” loop order is rearranged to be identical to the “GOOD” loop order, resulting in the observed identical times for gfortran. Apparently flang doesn’t do that?

2 Likes

Sure.
With allocate ( large_array(ni,nj,nk) ) I got


With allocate ( large_array(ni,nj,nk), source=0.0d0 ) I got

Yes I am using OneAPI+VS2019. Also used OneAPI+VS2022. Results are similar.

About the small CPU_TIME and large wall time. I found it is a little bit strange.
→ On my laptop which is a company laptop, with intel i7-1260p, I ran that serial Fortran + C++ code (no heavy I/O) I mentioned at the beginning multiple times, reported CPU_TIME is always around 40 seconds, while the reported wall time are always around 120 seconds.
→ Now the same code, I run it on my own laptop which has xeon-2186M (this and i7-1260p both has AVX2, so the code built with -xHost flag on i7-1260p can be used on xeon-2186M), every time the reported wall time is around 100 seconds, so, similar with my company laptop, this reported wall time is consistent. However the reported CPU_TIME are very inconsistent. Sometimes it reports CPU_TIME as just 40 seconds, sometimes it reports CPU_TIME as about 100 seconds which is very similar with wall time.

So overall, I would just say, at least for this code, for some reason, cpu_time is just inconsistent, CPU_TIME may be more useful for measuring a small simple piece of code. While wall time is more consistent and reliable for a large complicated code.

1 Like

This is pretty much the opposite of what is supposed to be observed. It is the wall times that are supposed to vary widely from run to run (as your cpu responds to external emails, or network pauses, or browser refreshes, etc.), while the cpu time should be the consistent value. I do not know why you are seeing the opposite.

In my previous reply, I was talking about how the high-order bits were cancelled, but then I showed the results as decimal numbers. I should have showed them in binary or hex format to demonstrate that. With the write statement

      write(*,'(3z17.16,es11.4,z17.16,*(1x,es11.4))') &
           & ticks, tick0, (ticks-tick0), real(ticks-tick0), rate, real(rate), dble(rate)-real(rate), wall_time

Here is the output:

 GOOD access
 188DB6E6D313E658 188DB6E6D313E658 0000000000000000 0.0000E+00 000000003B9ACA00  1.0000E+09  0.0000E+00  0.0000E+00
 188DB6E6E9AF15E0 188DB6E6D313E658 00000000169B2F88 3.7927E+08 000000003B9ACA00  1.0000E+09  0.0000E+00  3.7927E-01
 cpu time is :   0.378356934    
 wall time is:   0.379269004    
 BAD  access
 188DB6E6E9AF6BD0 188DB6E6D313E658 00000000169B8578 3.7929E+08 000000003B9ACA00  1.0000E+09  0.0000E+00  3.7929E-01
 188DB6E710B69DE8 188DB6E6D313E658 000000003DA2B790 1.0341E+09 000000003B9ACA00  1.0000E+09  0.0000E+00  1.0341E+00
 cpu time is :   0.654802918    
 wall time is:   0.654782951    

with this, you really can see the high-order bits cancelling out, leaving just the important low-order bits, which can then be successfully converted to the real32 function value.

You can also still see that the cpu time is slightly larger than the wall time. I still do not know for certain what is causing that. I’m still guessing, as before, that it is some multitasking that is occurring somewhere within some system library routine.

1 Like

Exceeding the available physical memory and resorting to virtual memory would be a likely explanation of the cause of a significant disruption to cpu usage and a significant wall time delay.

@CRquantum, are you able to provide measures of available memory vs memory required for the initial problem?

For my type of pc usage, I rarely have multiple programs running that demand a lot of memory, so these memory estimates are easy to make. Even with OpenMP multi-thread, the private memory demand of each thread is usually fairly clear.

1 Like

Thanks much!
The program is a small serial code (no mpi, no openMP), only cost about 5MB memory, definitely no more than 10MB all the most :laughing:, and very small and infrequent I/O. Today I ran the code several times on my company i7-1260p laptop and my personal xeon 2186M laptop, results are listed as follows:

Company i7-1260p (efficiency cores are all disabled in BIOS) laptop (Win 11 enterprise edition):
reported CPU time, consistently about 40±5 seconds.
reported Wall time, consistently about 110±10 seconds.
There are some background service running such as:


or windows defender sometimes perhaps. But they does not matter it seems. I disabled thinkpad’s Virtual Lock Sensor, but my CPU_TIME and wall time discrepancy is still there.

Personal xeon 2186M laptop (Win 11 Pro):
reported CPU time, consistently about 120±5 seconds.
reported Wall time, consistently about 120±5 seconds.
Sometimes windows defender are running perhaps.
if I watch a youtube video at the same time, the reported cpu time and wall time are still simialr with reported above.

I think the cpu time and wall time reported on my personal laptop are reasonable, and they are similar.
On my company laptop, the wall time (110±10 seconds) is reasonable (i7-1260p and xeon 2186M single core performance are similar, so wall time around 110-120 seconds are reasonable). However the CPU time reported are just way much lower than wall time. It gives a me a feeling that the CPU time on i7-1260p did not really record all the CPU time. This may be an example to show that, sometimes CPU time may not be a reliable measurement.

The original author of code use CPU_TIME all the time in the Fortran part of code (even when the code is in MPI mode), main reason is at that time (about 20 years ago) the author did not aware that Fortran can actually have wall time :laughing:

1 Like