real*8, allocatable :: large_array(:,:,:)
integer :: i,j,k, ni,nj,nk
real*8 :: tic,toc, start, end
ni = 500
nj = 1000
nk = 1500
allocate ( large_array(ni,nj,nk) )
write (*,*) 'GOOD access'
call CPU_TIME(start)
tic = wall_time () ! wall time counter
do k = 1,nk
do j = 1,nj
do i = 1,ni
large_array(i,j,k) = i+j+k
end do
end do
end do
toc = wall_time() ! wall time counter
call CPU_TIME(end)
write(*,*) 'cpu time is: ', end-start
write(*,*) 'wall time is: ', toc-tic
write (*,*) 'BAD access'
call CPU_TIME(start)
tic = wall_time () ! wall time counter
do i = 1,ni
do j = 1,nj
do k = 1,nk
large_array(i,j,k) = i+j+k
end do
end do
end do
toc = wall_time() ! wall time counter
call CPU_TIME(end)
write(*,*) 'cpu time is: ', end-start
write(*,*) 'wall time is: ', toc-tic
contains
real function wall_time ()
integer*8 :: ticks, rate, start_tick = -1
call system_clock ( ticks, rate )
if ( start_tick == -1) start_tick = ticks
ticks = ticks - start_tick
wall_time = dble(ticks)/dble(rate)
write (*,*) 'wall_time :',ticks, rate, wall_time
end function wall_time
This is the code I used in my previous post. I explained why it failed and two ways to fix it (one suggested by you earlier), and I posted timing results with the fixed version.
Portability is not the problem, it is just an extension that is no longer necessary to use.
Back in the f77 days, I routinely used real*8 declarations because it was more portable than the standard conventions at that time. Thatâs right, it was nonstandard, but more portable. But now there are several alternatives to specify kind values that are standard and that can be used instead, and thatâs the way I think we should address this issue here in this discussion group so that new fortran programmers can get off to a good start in the language.
Just out of curiosity, is it also possible to try changing the first allocation statement from
allocate ( large_array(ni,nj,nk) )
to
allocate ( large_array(ni,nj,nk), source=0.0d0 )
on your laptop (I assume Intel + VisualStudio has been used throughout)?
(Here I just focus only on timing, not coding style.) On my mac, flang-21.18 shows an overhead of 0.3â0.4 sec in wall time for the âGOODâ case (because it involves the initial assignment of the array), while gfortran-15 does not seem to show any difference with the above change⊠(possibly gfortran does not perform the assignment with âsourceâ until the array elements are actually used? Then the timing seems more consistent on my mac.)
Also, it may be better to print the result of the array somehow after printing the time to avoid the unrealistic âoptimizationâ (like some calculation is optimized away), though in the above code this did not change the timing results on my macâŠ
Interestingly, the BAD case is slower than the GOOD case by about x20 for flang -O3,while the speed is almost the same for gfortran-15 -O3 (when I also output the sum of array values after time measurement). But the runtime is 2 times faster for flang than gfortran, so the timing seems complicated⊠(I will try the comparison more carefully later.)
RE the remaining difference of CPU_TIME and wall time on your laptop, I wonder if there is some âWindows-specificâ overhead or security check for OS-related functionalities like time measurement (whose effects may be different for CPU_TIME and system_clockâŠ?)
The reason the original code results in zero wall time was not due to integer or floating point overflow. It is due to the loss of precision when converting the integer values to, eventually, real32 floating point. Note that in the statement
wall_time = dble(ticks)/dble(rate)
one might think this would be alright because the initial conversions are real64 values, and then only at the end converted to real32. So lets print out a couple of actual values to see what is happening. I added the line:
at the end of the function, and here are some values that result:
GOOD access
1769195415213892000 1.7691954152138921E+018 1000000000 1000000000.0000000 1.76919539E+09
1769195416641231000 1.7691954166412311E+018 1000000000 1000000000.0000000 1.76919539E+09
cpu time is : 1.42736697
wall time is: 0.00000000
BAD access
1769195416641265000 1.7691954166412649E+018 1000000000 1000000000.0000000 1.76919539E+09
1769195424213143000 1.7691954242131430E+018 1000000000 1000000000.0000000 1.76919539E+09
cpu time is : 7.57191992
wall time is: 0.00000000
This is with gfortran on MacOS. As you can see, there appears to be no problem with the int64 values themselves; i.e. no overflow, and the values increase as expected between calls. And there is no problem here with the conversion to real64; some precision is lost, of course (64 integer bits going to 53 mantissa bits), but the difference is still large enough so that the computed times increase the correct amount. There is no problem with conversion of rate either, that integer value is converted correctly. But the conversion to real32 keeps only the high-order bits and discards the low-order bits, and the high-order mantissa bits are the same for those two large values, so the returned times are the same. Thus the differences of the returned times are zero, as we know they should be now that the individual steps are exposed.
There are three ways to fix this problem. One way is to declare all the times, including the wall_time() function result, to be real64. That change results in the output:
GOOD access
1769192977478843000 1.7691929774788429E+018 1000000000 1000000000.0000000 1769192977.4788430
1769192978912242000 1.7691929789122419E+018 1000000000 1000000000.0000000 1769192978.9122419
cpu time is: 1.4334149999999999
wall time is: 1.4333989620208740
BAD access
1769192978912272000 1.7691929789122719E+018 1000000000 1000000000.0000000 1769192978.9122720
1769192985396408000 1.7691929853964081E+018 1000000000 1000000000.0000000 1769192985.3964081
cpu time is: 6.4840510000000000
wall time is: 6.4841361045837402
Another way to fix the problem is to declare ticks and rate to be int32 instead of int64. With a real32 function result, this gives the output:
GOOD access
1817530319 1817530319.0000000 1000 1000.0000000000000 1817530.38
1817531752 1817531752.0000000 1000 1000.0000000000000 1817531.75
cpu time is : 1.43310905
wall time is: 1.37500000
BAD access
1817531752 1817531752.0000000 1000 1000.0000000000000 1817531.75
1817539168 1817539168.0000000 1000 1000.0000000000000 1817539.12
cpu time is : 7.41588593
wall time is: 7.37500000
This output might seem paradoxical because a less precise calculation (ticks and rate are now less precise than before) results in a more accurate final result, but it does make sense when you look at the computation step by step. The high-order bits in the ticks value no longer exist, leaving just the important low-order bits.
The third way to fix this problem is to manually eliminate the common high-order bits in the ticks value before returning the function result. This approach was given above by @JohnCampbell. Here is my version of that code:
function wall_time ()
real(real32) :: wall_time
integer(int64) :: ticks, rate
integer(int64), save :: tick0 = -1
logical, save :: first = .true.
call system_clock ( ticks, rate )
if ( first ) then
first = .false.
tick0 = ticks
endif
wall_time = real(ticks-tick0)/real(rate)
print*, ticks, tick0, (ticks-tick0), rate, wall_time
return
end function wall_time
GOOD access
1769194889704100000 1769194889704100000 0 1000000000 0.00000000
1769194891166727000 1769194889704100000 1462627000 1000000000 1.46262693
cpu time is : 1.46265602
wall time is: 1.46262693
BAD access
1769194891166762000 1769194889704100000 1462662000 1000000000 1.46266198
1769194897533354000 1769194889704100000 7829254000 1000000000 7.82925415
cpu time is : 6.36665297
wall time is: 6.36659241
I explicitly give the SAVE attribute in the declarations of first and tick0. This is redundant, of course, since initialized variables have an implicit SAVE, but this fortran feature is nonintuitive and often trips up new fortran programmers, so the explicit declaration seems best here in this discussion.
You might need to look at the assembler code to see, but I would guess that with -O3 the âBADâ loop order is rearranged to be identical to the âGOODâ loop order, resulting in the observed identical times for gfortran. Apparently flang doesnât do that?
Yes I am using OneAPI+VS2019. Also used OneAPI+VS2022. Results are similar.
About the small CPU_TIME and large wall time. I found it is a little bit strange.
â On my laptop which is a company laptop, with intel i7-1260p, I ran that serial Fortran + C++ code (no heavy I/O) I mentioned at the beginning multiple times, reported CPU_TIME is always around 40 seconds, while the reported wall time are always around 120 seconds.
â Now the same code, I run it on my own laptop which has xeon-2186M (this and i7-1260p both has AVX2, so the code built with -xHost flag on i7-1260p can be used on xeon-2186M), every time the reported wall time is around 100 seconds, so, similar with my company laptop, this reported wall time is consistent. However the reported CPU_TIME are very inconsistent. Sometimes it reports CPU_TIME as just 40 seconds, sometimes it reports CPU_TIME as about 100 seconds which is very similar with wall time.
So overall, I would just say, at least for this code, for some reason, cpu_time is just inconsistent, CPU_TIME may be more useful for measuring a small simple piece of code. While wall time is more consistent and reliable for a large complicated code.
This is pretty much the opposite of what is supposed to be observed. It is the wall times that are supposed to vary widely from run to run (as your cpu responds to external emails, or network pauses, or browser refreshes, etc.), while the cpu time should be the consistent value. I do not know why you are seeing the opposite.
In my previous reply, I was talking about how the high-order bits were cancelled, but then I showed the results as decimal numbers. I should have showed them in binary or hex format to demonstrate that. With the write statement
GOOD access
188DB6E6D313E658 188DB6E6D313E658 0000000000000000 0.0000E+00 000000003B9ACA00 1.0000E+09 0.0000E+00 0.0000E+00
188DB6E6E9AF15E0 188DB6E6D313E658 00000000169B2F88 3.7927E+08 000000003B9ACA00 1.0000E+09 0.0000E+00 3.7927E-01
cpu time is : 0.378356934
wall time is: 0.379269004
BAD access
188DB6E6E9AF6BD0 188DB6E6D313E658 00000000169B8578 3.7929E+08 000000003B9ACA00 1.0000E+09 0.0000E+00 3.7929E-01
188DB6E710B69DE8 188DB6E6D313E658 000000003DA2B790 1.0341E+09 000000003B9ACA00 1.0000E+09 0.0000E+00 1.0341E+00
cpu time is : 0.654802918
wall time is: 0.654782951
with this, you really can see the high-order bits cancelling out, leaving just the important low-order bits, which can then be successfully converted to the real32 function value.
You can also still see that the cpu time is slightly larger than the wall time. I still do not know for certain what is causing that. Iâm still guessing, as before, that it is some multitasking that is occurring somewhere within some system library routine.
Exceeding the available physical memory and resorting to virtual memory would be a likely explanation of the cause of a significant disruption to cpu usage and a significant wall time delay.
@CRquantum, are you able to provide measures of available memory vs memory required for the initial problem?
For my type of pc usage, I rarely have multiple programs running that demand a lot of memory, so these memory estimates are easy to make. Even with OpenMP multi-thread, the private memory demand of each thread is usually fairly clear.
Thanks much!
The program is a small serial code (no mpi, no openMP), only cost about 5MB memory, definitely no more than 10MB all the most , and very small and infrequent I/O. Today I ran the code several times on my company i7-1260p laptop and my personal xeon 2186M laptop, results are listed as follows:
Company i7-1260p (efficiency cores are all disabled in BIOS) laptop (Win 11 enterprise edition):
reported CPU time, consistently about 40±5 seconds.
reported Wall time, consistently about 110±10 seconds.
There are some background service running such as:
or windows defender sometimes perhaps. But they does not matter it seems. I disabled thinkpadâs Virtual Lock Sensor, but my CPU_TIME and wall time discrepancy is still there.
Personal xeon 2186M laptop (Win 11 Pro):
reported CPU time, consistently about 120±5 seconds.
reported Wall time, consistently about 120±5 seconds.
Sometimes windows defender are running perhaps.
if I watch a youtube video at the same time, the reported cpu time and wall time are still simialr with reported above.
I think the cpu time and wall time reported on my personal laptop are reasonable, and they are similar.
On my company laptop, the wall time (110±10 seconds) is reasonable (i7-1260p and xeon 2186M single core performance are similar, so wall time around 110-120 seconds are reasonable). However the CPU time reported are just way much lower than wall time. It gives a me a feeling that the CPU time on i7-1260p did not really record all the CPU time. This may be an example to show that, sometimes CPU time may not be a reliable measurement.
The original author of code use CPU_TIME all the time in the Fortran part of code (even when the code is in MPI mode), main reason is at that time (about 20 years ago) the author did not aware that Fortran can actually have wall time