real*8, allocatable :: large_array(:,:,:)
integer :: i,j,k, ni,nj,nk
real*8 :: tic,toc, start, end
ni = 500
nj = 1000
nk = 1500
allocate ( large_array(ni,nj,nk) )
write (*,*) 'GOOD access'
call CPU_TIME(start)
tic = wall_time () ! wall time counter
do k = 1,nk
do j = 1,nj
do i = 1,ni
large_array(i,j,k) = i+j+k
end do
end do
end do
toc = wall_time() ! wall time counter
call CPU_TIME(end)
write(*,*) 'cpu time is: ', end-start
write(*,*) 'wall time is: ', toc-tic
write (*,*) 'BAD access'
call CPU_TIME(start)
tic = wall_time () ! wall time counter
do i = 1,ni
do j = 1,nj
do k = 1,nk
large_array(i,j,k) = i+j+k
end do
end do
end do
toc = wall_time() ! wall time counter
call CPU_TIME(end)
write(*,*) 'cpu time is: ', end-start
write(*,*) 'wall time is: ', toc-tic
contains
real function wall_time ()
integer*8 :: ticks, rate, start_tick = -1
call system_clock ( ticks, rate )
if ( start_tick == -1) start_tick = ticks
ticks = ticks - start_tick
wall_time = dble(ticks)/dble(rate)
write (*,*) 'wall_time :',ticks, rate, wall_time
end function wall_time
This is the code I used in my previous post. I explained why it failed and two ways to fix it (one suggested by you earlier), and I posted timing results with the fixed version.
Portability is not the problem, it is just an extension that is no longer necessary to use.
Back in the f77 days, I routinely used real*8 declarations because it was more portable than the standard conventions at that time. Thatās right, it was nonstandard, but more portable. But now there are several alternatives to specify kind values that are standard and that can be used instead, and thatās the way I think we should address this issue here in this discussion group so that new fortran programmers can get off to a good start in the language.
Just out of curiosity, is it also possible to try changing the first allocation statement from
allocate ( large_array(ni,nj,nk) )
to
allocate ( large_array(ni,nj,nk), source=0.0d0 )
on your laptop (I assume Intel + VisualStudio has been used throughout)?
(Here I just focus only on timing, not coding style.) On my mac, flang-21.18 shows an overhead of 0.3ā0.4 sec in wall time for the āGOODā case (because it involves the initial assignment of the array), while gfortran-15 does not seem to show any difference with the above change⦠(possibly gfortran does not perform the assignment with āsourceā until the array elements are actually used? Then the timing seems more consistent on my mac.)
Also, it may be better to print the result of the array somehow after printing the time to avoid the unrealistic āoptimizationā (like some calculation is optimized away), though in the above code this did not change the timing results on my macā¦
Interestingly, the BAD case is slower than the GOOD case by about x20 for flang -O3,while the speed is almost the same for gfortran-15 -O3 (when I also output the sum of array values after time measurement). But the runtime is 2 times faster for flang than gfortran, so the timing seems complicated⦠(I will try the comparison more carefully later.)
RE the remaining difference of CPU_TIME and wall time on your laptop, I wonder if there is some āWindows-specificā overhead or security check for OS-related functionalities like time measurement (whose effects may be different for CPU_TIME and system_clockā¦?)
The reason the original code results in zero wall time was not due to integer or floating point overflow. It is due to the loss of precision when converting the integer values to, eventually, real32 floating point. Note that in the statement
wall_time = dble(ticks)/dble(rate)
one might think this would be alright because the initial conversions are real64 values, and then only at the end converted to real32. So lets print out a couple of actual values to see what is happening. I added the line:
at the end of the function, and here are some values that result:
GOOD access
1769195415213892000 1.7691954152138921E+018 1000000000 1000000000.0000000 1.76919539E+09
1769195416641231000 1.7691954166412311E+018 1000000000 1000000000.0000000 1.76919539E+09
cpu time is : 1.42736697
wall time is: 0.00000000
BAD access
1769195416641265000 1.7691954166412649E+018 1000000000 1000000000.0000000 1.76919539E+09
1769195424213143000 1.7691954242131430E+018 1000000000 1000000000.0000000 1.76919539E+09
cpu time is : 7.57191992
wall time is: 0.00000000
This is with gfortran on MacOS. As you can see, there appears to be no problem with the int64 values themselves; i.e. no overflow, and the values increase as expected between calls. And there is no problem here with the conversion to real64; some precision is lost, of course (64 integer bits going to 53 mantissa bits), but the difference is still large enough so that the computed times increase the correct amount. There is no problem with conversion of rate either, that integer value is converted correctly. But the conversion to real32 keeps only the high-order bits and discards the low-order bits, and the high-order mantissa bits are the same for those two large values, so the returned times are the same. Thus the differences of the returned times are zero, as we know they should be now that the individual steps are exposed.
There are three ways to fix this problem. One way is to declare all the times, including the wall_time() function result, to be real64. That change results in the output:
GOOD access
1769192977478843000 1.7691929774788429E+018 1000000000 1000000000.0000000 1769192977.4788430
1769192978912242000 1.7691929789122419E+018 1000000000 1000000000.0000000 1769192978.9122419
cpu time is: 1.4334149999999999
wall time is: 1.4333989620208740
BAD access
1769192978912272000 1.7691929789122719E+018 1000000000 1000000000.0000000 1769192978.9122720
1769192985396408000 1.7691929853964081E+018 1000000000 1000000000.0000000 1769192985.3964081
cpu time is: 6.4840510000000000
wall time is: 6.4841361045837402
Another way to fix the problem is to declare ticks and rate to be int32 instead of int64. With a real32 function result, this gives the output:
GOOD access
1817530319 1817530319.0000000 1000 1000.0000000000000 1817530.38
1817531752 1817531752.0000000 1000 1000.0000000000000 1817531.75
cpu time is : 1.43310905
wall time is: 1.37500000
BAD access
1817531752 1817531752.0000000 1000 1000.0000000000000 1817531.75
1817539168 1817539168.0000000 1000 1000.0000000000000 1817539.12
cpu time is : 7.41588593
wall time is: 7.37500000
This output might seem paradoxical because a less precise calculation (ticks and rate are now less precise than before) results in a more accurate final result, but it does make sense when you look at the computation step by step. The high-order bits in the ticks value no longer exist, leaving just the important low-order bits.
The third way to fix this problem is to manually eliminate the common high-order bits in the ticks value before returning the function result. This approach was given above by @JohnCampbell. Here is my version of that code:
function wall_time ()
real(real32) :: wall_time
integer(int64) :: ticks, rate
integer(int64), save :: tick0 = -1
logical, save :: first = .true.
call system_clock ( ticks, rate )
if ( first ) then
first = .false.
tick0 = ticks
endif
wall_time = real(ticks-tick0)/real(rate)
print*, ticks, tick0, (ticks-tick0), rate, wall_time
return
end function wall_time
GOOD access
1769194889704100000 1769194889704100000 0 1000000000 0.00000000
1769194891166727000 1769194889704100000 1462627000 1000000000 1.46262693
cpu time is : 1.46265602
wall time is: 1.46262693
BAD access
1769194891166762000 1769194889704100000 1462662000 1000000000 1.46266198
1769194897533354000 1769194889704100000 7829254000 1000000000 7.82925415
cpu time is : 6.36665297
wall time is: 6.36659241
I explicitly give the SAVE attribute in the declarations of first and tick0. This is redundant, of course, since initialized variables have an implicit SAVE, but this fortran feature is nonintuitive and often trips up new fortran programmers, so the explicit declaration seems best here in this discussion.
You might need to look at the assembler code to see, but I would guess that with -O3 the āBADā loop order is rearranged to be identical to the āGOODā loop order, resulting in the observed identical times for gfortran. Apparently flang doesnāt do that?