Speed of ALL(x==y) for large arrays

Based on the times for the code below, it appears that gfortran -O3 evaluates all(x==y) very fast for large arrays and is smart enough to not evaluate the full expressions x==y or sin(x)==sin(y). My loop equivalent to all(x==y) is slower. Generating random x(:) and setting y(:) with merge takes the most time. I am a little surprised at the speed of all(), and I may need to amend a tip I posted.

module m
implicit none
private
public :: wp, all_equal
integer, parameter :: wp = kind(1.0d0)
contains
pure function all_equal(x,y) result(x_eq_y)
real(kind=wp), intent(in) :: x(:),y(:)
logical                   :: x_eq_y
integer                   :: i,n
x_eq_y = .false.
n = size(x)
if (size(y) /= n) return
do i=1,n
   if (x(i) /= y(i)) return
end do
x_eq_y = .true.
end function all_equal
end module m
!
program xall_equal
use m, only: wp, all_equal
implicit none
integer, parameter :: n = 10**8, niter = 10, nt = 6
real(kind=wp)      :: x(n),y(n),t(nt),dt(nt-1)
integer            :: iter
logical            :: tf_1,tf_2,tf_3
real(kind=wp), parameter :: frac_diff = 1.0d-5
dt = 0.0_wp
do iter=1,niter
   call cpu_time(t(1))
   call random_number(x)           ; call cpu_time(t(2))
   ! set y equal to x except for the fraction frac_diff of elements
   y = merge(x,2.0_wp,x>frac_diff) ; call cpu_time(t(3))
   tf_1 = all_equal(x,y)           ; call cpu_time(t(4))
   tf_2 = all(x == y)              ; call cpu_time(t(5))
   ! see if the compiler fully evaluates sin(x) and sin(y), which would be slow
   tf_3 = all(sin(x) == sin(y))    ; call cpu_time(t(6))
   if (tf_1 .neqv. tf_2 .or. tf_2 .neqv. tf_3) error stop "tf_1, tf_2, tf_3 should be equal"
   dt = dt + t(2:nt) - t(1:nt-1)
end do
print "(/,'times',/,*(a12,1x))","RNG","merge","all_equal","all(x==y)","all_sin"
print "(*(f12.8,1x))",dt
end program xall_equal

output:

times
         RNG        merge    all_equal    all(x==y)      all_sin
 12.75000000   7.51562500   0.00000000   0.01562500   0.10937500
3 Likes

If all were really short circuiting, wouldn’t it take the same amount of time as your all_equal function?

Modifying the program to run the simulation 10 times
module m
implicit none
private
public :: wp, all_equal
integer, parameter :: wp = kind(1.0d0)
contains
pure function all_equal(x,y) result(x_eq_y)
real(kind=wp), intent(in) :: x(:),y(:)
logical                   :: x_eq_y
integer                   :: i,n
x_eq_y = .false.
n = size(x)
if (size(y) /= n) return
do i=1,n
   if (x(i) /= y(i)) return
end do
x_eq_y = .true.
end function all_equal
end module m
!
program xall_equal
use m, only: wp, all_equal
implicit none
integer, parameter :: n = 10**8, niter = 10, nt = 6, nruns = 10
real(kind=wp)      :: x(n),y(n),t(nt),dt(nt-1)
integer            :: irun,iter
logical            :: tf_1,tf_2,tf_3
real(kind=wp), parameter :: frac_diff = 1.0d-5
print "(/,'times',/,*(a12,1x))","RNG","merge","all_equal","all(x==y)","all_sin"
do irun=1,nruns
   dt = 0.0_wp
   do iter=1,niter
      call cpu_time(t(1))
      call random_number(x)           ; call cpu_time(t(2))
      ! set y equal to x except for the fraction frac_diff of elements
      y = merge(x,2.0_wp,x>frac_diff) ; call cpu_time(t(3))
      tf_1 = all_equal(x,y)           ; call cpu_time(t(4))
      tf_2 = all(x == y)              ; call cpu_time(t(5))
      ! see if the compiler fully evaluates sin(x) and sin(y), which would be slow
      tf_3 = all(sin(x) == sin(y))    ; call cpu_time(t(6))
      if (tf_1 .neqv. tf_2 .or. tf_2 .neqv. tf_3) error stop "tf_1, tf_2, tf_3 should be equal"
      dt = dt + t(2:nt) - t(1:nt-1)
   end do
   print "(*(f12.8,1x))",dt
end do
end program xall_equal

both all and all_equal run too fast to reliably measure their execution times, as shown below. Since all_equal is short-circuiting, I think all must be too.

times
         RNG        merge    all_equal    all(x==y)      all_sin
 10.45312500   2.81250000   0.00000000   0.00000000   0.09375000
  9.67187500   2.31250000   0.00000000   0.00000000   0.10937500
  9.67187500   2.39062500   0.00000000   0.00000000   0.12500000
  9.64062500   2.29687500   0.00000000   0.01562500   0.14062500
  9.76562500   2.39062500   0.00000000   0.00000000   0.07812500
  9.64062500   2.32812500   0.01562500   0.00000000   0.03125000
  9.65625000   2.34375000   0.00000000   0.00000000   0.14062500
  9.84375000   2.37500000   0.00000000   0.00000000   0.10937500
  9.70312500   2.34375000   0.00000000   0.00000000   0.07812500
  9.67187500   2.32812500   0.00000000   0.00000000   0.10937500
2 Likes

Instead of using merge, which can cause the short-circuiting to occur after only a few elements, it may be better to change only a single value.

      ! set y equal to x except for one value
      y = x
      y(max(int(n * x(1)),n/2)) = 2

times
         RNG        merge    all_equal    all(x==y)      all_sin
  1.05091300   0.63109800   0.19972300   0.19834900   4.15434400

So, likely a negligible difference between your function and all. You can prove this with the -fdump-tree-optimized option.

1 Like

Shouldn’t we expect the timing to be quite dependent on the distribution of the data values and the specifics of the compiler and its runtime?

I put in @kargl’s suggested change, and added

   tf_4 = .not.any(x /= y)         ; call cpu_time(t(7))

and obtained the following timings on an NUC with an i7-10710:

           RNG          merge      all_equal      all(x==y)        all_sin .not.any(x/=y)
    3.75100000     1.03100000     0.46800000     0.44000000     5.37400000     0.37300000

[side comment: The forum formatter seems to regard program text following a semicolon in code as a comment – as it is in some assembly languages]

(base) urbanjs@venus:/tmp$ ifort -O3 tmp2.f90
(base) urbanjs@venus:/tmp$ ./a.out

times
         RNG        merge    all_equal    all(x==y)      all_sin
 36.04356200   1.48177000   0.00139300   1.09523400  16.40145700

(base) urbanjs@venus:/tmp$ nvfortran -O3 tmp2.f90
(base) urbanjs@venus:/tmp$ ./a.out
gfortran -O3 tmp2.f90
./a.out

times
         RNG        merge    all_equal    all(x==y)      all_sin
  2.64983916   1.87623382   0.00148511   1.44972777  20.16094804

(base) urbanjs@venus:/tmp$ gfortran -O3 tmp2.f90
(base) urbanjs@venus:/tmp$ ./a.out

times
         RNG        merge    all_equal    all(x==y)      all_sin
  7.74313400   1.88132500   0.00248500   0.00233300   0.04209800

Why Fortranners are notorious for rolling their own. Nothing in the standard says one way or another when and where a short-circuit will occur, but ANY and ALL take any type and order (so I use them a lot when intelligibility and convenience make it justified).

Optimization is still very platform-specific although it gets better all the time. And the differences in the PRNG might be a quality issue as well (trading speed for quality) so once you are into optimizing taking anything for granted is a risk, and profiling the code (which is also part art) becomes more important, as what is no problem in one PE (programming environment) might be in another, or flipping a compiler switch can change everything.

PS: short-circuiting is almost certainly the biggest cause in the time differences, and there are times when memory use would be more important, but notice that when they are the same the fastest time above for the sin(x) == sin(y) comparison becomes the slowest; so if that was the common case you would see totally different issues

         RNG        merge    all_equal    all(x==y)      all_sin
  7.81182000   1.48567100   1.50350300   1.61926500  29.71101300 gfortran
 36.24548100   1.42185800   1.04758700   1.09871900  16.39151000 ifort
  2.65723276   1.48142457   1.57551217   1.46997046  21.24921322 nvfortran