Speed of array intrinsics

The array intrinsic functions of Fortran are convenient, but it appears that if you are doing multiple calculations on the same array, it may be faster to write your own function. For the program

module min_max_mod
implicit none
integer, parameter :: wp = kind(1.0d0)
contains
pure function min_max(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max
!
pure function min_max_mean(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(3)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
   y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean
end module min_max_mod
!
program xmin_max
use min_max_mod
implicit none
integer :: i,n
integer, parameter :: nt = 5, nlen = 30
real(kind=wp), allocatable :: x(:)
real(kind=wp)              :: extremes(2,2),t(nt),xmin_max_mean(3,2)
character (len=nlen) :: labels(nt-1) = [character(len=nlen) :: &
   "minval(x), maxval(x)","min_max(x)", &
   "minval(x),maxval(x),sum(x)/n","min_max_mean(x)"]
character (len=*), parameter :: fmt_cr = "(a20,*(1x,f16.12))"
n = 10**8
allocate (x(n))
call random_number(x)
call cpu_time(t(1))
extremes(:,1) = [minval(x),maxval(x)]
call cpu_time(t(2))
extremes(:,2) = min_max(x)
call cpu_time(t(3))
xmin_max_mean(:,2) = [minval(x),maxval(x),sum(x)/n]
call cpu_time(t(4))
xmin_max_mean(:,1) = min_max_mean(x)
call cpu_time(t(5))
print fmt_cr,"min, max:",extremes(:,1)
print fmt_cr,"min, max:",extremes(:,2)
print fmt_cr,"min, max, mean:",xmin_max_mean(:,1)
print fmt_cr,"min, max, mean:",xmin_max_mean(:,2)
print "(/,a8,5x,a30)","time","task"
print "(f8.4,5x,a30)",(t(i+1)-t(i),trim(labels(i)),i=1,nt-1)
end program xmin_max

the output for gfortran -O3 on WSL is

           min, max:   0.000000009614   0.999999996113
           min, max:   0.000000009614   0.999999996113
     min, max, mean:   0.000000009614   0.999999996113   0.499997321868
     min, max, mean:   0.000000009614   0.999999996113   0.499997321868

    time                               task
  0.2214               minval(x), maxval(x)
  0.1194                         min_max(x)
  0.3329       minval(x),maxval(x),sum(x)/n
  0.2380                    min_max_mean(x)

and for ifort -O3 on WSL it is

           min, max:   0.000000003725   0.999999968801
           min, max:   0.000000003725   0.999999968801
     min, max, mean:   0.000000003725   0.999999968801   0.499970722298
     min, max, mean:   0.000000003725   0.999999968801   0.499970722298

    time                               task
  0.1138               minval(x), maxval(x)
  0.0624                         min_max(x)
  0.1568       minval(x),maxval(x),sum(x)/n
  0.0657                    min_max_mean(x)
2 Likes

If one wants the largest and smallest elements of an array of intrinsic type, the compiler’s optimization quality is what counts. In your function min_max, 2(n-1) comparisons are required. If each element of the array is of a user-defined type composite variable, it may be worthwhile to reduce the number of comparisons. For example, the following algorithm gets by with 3.n/2 - 2 comparisons, when n is even.

  1. Take the first two elements, compare their keys and use the results to initialize key_min and key_max.
  2. Take the next pair, identify which of the two is larger. Compare the smaller to the previous min. Compare the larger to the previous max.
  3. Repeat Step 2 until all pairs are processed.

If the size of the array, n, is odd, modify the algorithm to suit.

2 Likes

Thanks for the suggestion, but I am finding that algorithm to be slower. For the code

module min_max_mod
implicit none
integer, parameter :: wp = kind(1.0d0)
contains
pure function min_max(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1) = x(1)
y(2) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
end do
end function min_max
!
pure function min_max_pairs(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
   y(1) = x(1)
   y(2) = x(2)
   return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
   y(1) = x(1)
   y(2) = x(2)
else
   y(1) = x(2)
   y(2) = x(1)
end if
do i=3,n-1,2
   if (x(i+1) > x(i)) then
      if (y(1) > x(i))   y(1) = x(i)
      if (y(2) < x(i+1)) y(2) = x(i+1)
   else
      if (y(1) > x(i+1)) y(1) = x(i+1)
      if (y(2) < x(i))   y(2) = x(i)
   end if
end do
if (odd) then
   y(1) = min(y(1),x(n))
   y(2) = max(y(1),x(n))
end if
end function min_max_pairs
!
pure function min_max_local(x) result(y)
! same as min_max_pairs but sets 
! xi to x(i) and xi1 to x(i+1)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(2)
real(kind=wp)             :: xi,xi1
integer                   :: i,n
logical                   :: odd
n = size(x)
if (n < 1) return
if (n == 1) then
   y(1) = x(1)
   y(2) = x(2)
   return
end if
odd = mod(n,2) == 1
if (x(2) > x(1)) then
   y(1) = x(1)
   y(2) = x(2)
else
   y(1) = x(2)
   y(2) = x(1)
end if
do i=3,n-1,2
   xi1 = x(i+1)
   xi  = x(i)
   if (xi1 > xi) then
      if (y(1) > xi)   y(1) = xi
      if (y(2) < xi1) y(2) = xi1
   else
      if (y(1) > xi1) y(1) = xi1
      if (y(2) < xi)   y(2) = xi
   end if
end do
if (odd) then
   y(1) = min(y(1),x(n))
   y(2) = max(y(1),x(n))
end if
end function min_max_local
!
pure function min_max_mean(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y(3)
integer                   :: i,n
n = size(x)
if (n < 1) return
y(1)   = x(1)
y(2:3) = y(1)
do i=2,n
   if (x(i) < y(1)) y(1) = x(i)
   if (x(i) > y(2)) y(2) = x(i)
   y(3) = y(3) + x(i)
end do
if (n > 0) y(3) = y(3)/n
end function min_max_mean
end module min_max_mod
!
program xmin_max
use min_max_mod
implicit none
integer :: i,n
integer, parameter :: nt = 7, nlen = 30, ncol_extremes = 4
integer                    :: icol
real(kind=wp), allocatable :: x(:)
real(kind=wp)              :: extremes(2,ncol_extremes),t(nt),xmin_max_mean(3,2)
character (len=nlen) :: labels(nt-1) = [character(len=nlen) :: &
   "minval(x), maxval(x)","min_max(x)","min_max_pairs(x)", &
   "min_max_local(x)","minval(x),maxval(x),sum(x)/n","min_max_mean(x)"]
character (len=*), parameter :: fmt_cr = "(a20,*(1x,f16.12))"
n = 10**8
allocate (x(n))
call random_number(x)
call cpu_time(t(1))
extremes(:,1) = [minval(x),maxval(x)]
call cpu_time(t(2))
extremes(:,2) = min_max(x)
call cpu_time(t(3))
extremes(:,3) = min_max_pairs(x)
call cpu_time(t(4))
extremes(:,4) = min_max_local(x)
call cpu_time(t(5))
xmin_max_mean(:,2) = [minval(x),maxval(x),sum(x)/n]
call cpu_time(t(6))
xmin_max_mean(:,1) = min_max_mean(x)
call cpu_time(t(7))
do icol=1,ncol_extremes
   print fmt_cr,"min, max:",extremes(:,icol)
end do
print fmt_cr,"min, max, mean:",xmin_max_mean(:,1)
print fmt_cr,"min, max, mean:",xmin_max_mean(:,2)
print "(/,a8,5x,a30)","time","task"
print "(f8.4,5x,a30)",(t(i+1)-t(i),trim(labels(i)),i=1,nt-1)
end program xmin_max

with gfortran -O3 on WSL2 I get

           min, max:   0.000000002249   0.999999986684
           min, max:   0.000000002249   0.999999986684
           min, max:   0.000000002249   0.999999986684
           min, max:   0.000000002249   0.999999986684
     min, max, mean:   0.000000002249   0.999999986684   0.499991319230
     min, max, mean:   0.000000002249   0.999999986684   0.499991319230

    time                               task
  0.2200               minval(x), maxval(x)
  0.1264                         min_max(x)
  0.2363                   min_max_pairs(x)
  0.2429                   min_max_local(x)
  0.3305       minval(x),maxval(x),sum(x)/n
  0.2371                    min_max_mean(x)

where min_max_pairs and min_max_local are the functions that try to reduce the number of comparisons, and with ifort -O3 on WSL2 I get

           min, max:   0.000000003725   0.999999968801
           min, max:   0.000000003725   0.999999968801
           min, max:   0.000000003725   0.999999968801
           min, max:   0.000000003725   0.999999968801
     min, max, mean:   0.000000003725   0.999999968801   0.499970722298
     min, max, mean:   0.000000003725   0.999999968801   0.499970722298

    time                               task
  0.1142               minval(x), maxval(x)
  0.0630                         min_max(x)
  0.2162                   min_max_pairs(x)
  0.2153                   min_max_local(x)
  0.1548       minval(x),maxval(x),sum(x)/n
  0.0674                    min_max_mean(x)

Interesting cases.
I found for my Win7-gfortran 11.1 on an i5-2300:
using the pairs approach was the slowest approach,
using more optimisation helped, but
using -O3 -ffast-math with intrisnics was the fastest.

Why don’t pairs do better :frowning:

Of interest, my minor changes to min_max, min_max_pair and timer, which achieved very little were:

  pure function min_max_new(x) result(y)
  ! same as min_max but different starting case
  real(kind=wp), intent(in) :: x(:)
  real(kind=wp)             :: y(2)
  integer                   :: i,n

  n = size(x)
  if (n < 1) return

   y = x(n)

   do i = 1,n-1
     if ( x(i) < y(1) ) y(1) = x(i)
     if ( x(i) > y(2) ) y(2) = x(i)
   end do

  end function min_max_new

  pure function min_max_dev(x) result(y)
  ! same as min_max_pairs, but different start case eliminates ODD
  real(kind=wp), intent(in) :: x(:)
  real(kind=wp)             :: y(2)
  integer                   :: i,n

  n = size(x)
  if (n < 1) return

   y = x(n)

   do i = 1,n-1,2
     if ( x(i) > x(i+1) ) then
       if ( x(i+1) < y(1) ) y(1) = x(i+1)
       if ( x(i)   > y(2) ) y(2) = x(i)  
     else
       if ( x(i)   < y(1) ) y(1) = x(i)  
       if ( x(i+1) > y(2) ) y(2) = x(i+1)
     end if
   end do

  end function min_max_dev

  subroutine elapse_time (sec)
  real(kind=wp)  :: sec
  integer*8 :: tick, rate
!z  call cpu_time (sec)
  call system_clock ( tick, rate )
  sec = dble(tick) / dble(rate)
  end subroutine elapse_time

It’s not the fastest for me, but it’s indeed 2x faster with -ffast-math or -Ofast

I think because this involves too many if/then/else that break some optimizations, while the intrinsics min/max are quite fast. As a matter of fact, with gfortran v1 = min(v1,a(i)) is much faster than if (a(i) < v1) v1 = a(i), at least with safe optimizations.

I’ve compared minval/maxval, 2 separate hand-made loops with min/max intrinsics, 1 single hand-made loop with min/max intrinsics, and 1 single hand-made loop with 2 flavoes of if tests:

$ gfortran -O3 minmax.f90 && ./a.out 
minval/maxval      0.180815E-07   0.100000E+01     time=   0.218761E+00
2 loops            0.180815E-07   0.100000E+01     time=   0.968360E-01
1 loop             0.180815E-07   0.100000E+01     time=   0.469800E-01
1 loop with if1    0.180815E-07   0.100000E+01     time=   0.163348E+00
1 loop with if2    0.180815E-07   0.100000E+01     time=   0.127367E+00

$ gfortran -O3 -Ofast minmax.f90 && ./a.out 
minval/maxval      0.183796E-07   0.100000E+01     time=   0.958210E-01
2 loops            0.183796E-07   0.100000E+01     time=   0.874210E-01
1 loop             0.183796E-07   0.100000E+01     time=   0.453200E-01
1 loop with if1    0.183796E-07   0.100000E+01     time=   0.766570E-01
1 loop with if2    0.183796E-07   0.100000E+01     time=   0.834430E-01

In all cases the single loop with min/max intrinsics is the fastest. What is weird is that without -Ofast, the code with 2 loops is much faster than minval/maxval.

program minmaxs
use iso_fortran_env
implicit none

integer, parameter :: wp=real64
real(wp), allocatable :: a(:)
real(wp) :: v1, v2
integer :: N = 2**26
integer(real64) :: tic, toc, rate

allocate( a(N) )
call random_number(a)

v1 = minval(a)
v2 = maxval(a)

call system_clock(tic,rate)
v1 = minval(a)
v2 = maxval(a)
call system_clock(toc)
write(*,"(A,2E15.6,A,E15.6)") &
   "minval/maxval   ", v1, v2, "     time=", (real(toc-tic,kind=real64))/rate

call system_clock(tic)
call min_max(a,v1,v2)
call system_clock(toc)
write(*,"(A,2E15.6,A,E15.6)") &
   "2 loops         ", v1, v2, "     time=", (real(toc-tic,kind=real64))/rate

call system_clock(tic)
call minmax(a,v1,v2)
call system_clock(toc)
write(*,"(A,2E15.6,A,E15.6)") &
   "1 loop          ", v1, v2, "     time=", (real(toc-tic,kind=real64))/rate

call system_clock(tic)
call minmax_if(a,v1,v2)
call system_clock(toc)
write(*,"(A,2E15.6,A,E15.6)") &
   "1 loop with if1 ", v1, v2, "     time=", (real(toc-tic,kind=real64))/rate

call system_clock(tic)
call minmax_if2(a,v1,v2)
call system_clock(toc)
write(*,"(A,2E15.6,A,E15.6)") &
   "1 loop with if2 ", v1, v2, "     time=", (real(toc-tic,kind=real64))/rate


contains


   subroutine min_max(a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      mi = min(mi,a(i))
   end do
   do i = 2, n
      ma = max(ma,a(i))
   end do
   end subroutine


   subroutine minmax(a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      mi = min(mi,a(i))
      ma = max(ma,a(i))
   end do
   end subroutine
   
   
   subroutine minmax_if(a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      if (a(i) < mi) mi = a(i)
      if (a(i) > ma) ma = a(i)
   end do
   end subroutine
  
  
   subroutine minmax_if2(a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      if (a(i) < mi) then
         mi = a(i)
      else if (a(i) > ma) then
         ma = a(i)
      end if
   end do
   end subroutine
   
end program

If any of you are curious about a C++ solution:

// cpp_minmax.cpp
#include <algorithm> // C++11
#include <span> // C++20


extern "C"
void cpp_minmax(int n, double *af, double *mi, double *ma)
{
  const std::span<const double> a{af,(size_t) n};
  auto [min,max] = std::minmax_element(a.begin(),a.end());

  // unpack iterators
  *mi = *min;
  *ma = *max;
}

For compilation you need to set the standard explicitly: g++ -std=c++20 -Wall -O3 -c cpp_minmax.cpp.

The corresponding Fortran interface is:

interface
   subroutine cpp_minmax(n,a,mi,ma) bind(c,name="cpp_minmax")
      use, intrinsic :: iso_c_binding, only: c_int, c_double
      integer(c_int), intent(in), value :: n
      real(c_double), intent(in) :: a(n)
      real(c_double), intent(out) :: mi, ma
   end subroutine
end interface

On my old Thinkpad with an Intel Ivy Bridge (2012), and compiled with GCC 10, it is slower than the pair of Fortran intrinsics.

2 Likes

@PierU
A couple of ideas about the testing:

  • the order of the test can be an issue, so I repeated the tests npass times
  • final average is reported
  • I don’t like “integer(real64)”

After all that, the results were consistent in my testing, but different to yours.
Making the test order random can help, but probably not justified in this case.
Other compile options might change a bit ?

program minmaxs
  use iso_fortran_env
  implicit none
  
  integer, parameter :: wp=real64
  integer, parameter :: tp=int64
  real(wp), allocatable :: a(:)
  real(wp) :: v1, v2, sec, secs(5)
  integer  :: N = 2**26, ncase=5, npass=6, pass, k
  integer(tp) :: tic, toc, rate
  character :: case_name(0:5) *16 = [ "initialise      ",  & 
                                      "minval/maxval   ",  &
                                      "2 loops         ",  &
                                      "1 loop          ",  &
                                      "1 loop with if1 ",  &
                                      "1 loop with if2 " ]

  call system_clock (tic,rate)
  allocate( a(N) )
  call random_number(a)
  v1 = minval(a)
  v2 = maxval(a)
  sec = delta_sec ()
  write (*,1001)   case_name(0), v1, v2, "     time=", sec

  do pass = 1,npass
    write (*,1002) ' '
    do k = 1,ncase
      call system_clock (tic)
      select case (k)
           
        case (1)    ! "minval/maxval"
          v1 = minval (a)
          v2 = maxval (a)
          
        case (2)     ! "2 loops"
          call min_max (a,v1,v2)
          
        case (3)     ! "1 loop"
          call minmax (a,v1,v2)
          
        case (4)     ! "1 loop with if1"
          call minmax_if (a,v1,v2)
          
        case (5)     ! "1 loop with if2"
          call minmax_if2 (a,v1,v2)
      end select
      sec = delta_sec ()
      write (*,1001)   case_name(k), v1, v2, "     time=", sec
      secs(k) = secs(k) + sec
    end do ! k
  end do  ! pass

  write (*,1002) ' '
  do k = 1,5
    write (*,1002) case_name(k), secs(k)/(npass)
  end do

 1001 format ( A, 2ES15.6, A, f15.8 )
 1002 format ( A, f15.8 )


contains


   subroutine min_max (a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      mi = min(mi,a(i))
   end do
   do i = 2, n
      ma = max(ma,a(i))
   end do
   end subroutine


   subroutine minmax (a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      mi = min(mi,a(i))
      ma = max(ma,a(i))
   end do
   end subroutine
   
   
   subroutine minmax_if (a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      if (a(i) < mi) mi = a(i)
      if (a(i) > ma) ma = a(i)
   end do
   end subroutine
  
  
   subroutine minmax_if2 (a,mi,ma)
   real(wp), intent(in) :: a(:)
   real(wp), intent(out) :: mi, ma
   integer :: n, i
   
   n = size(a)
   mi = a(1) ; ma = a(1)
   do i = 2, n
      if (a(i) < mi) then
         mi = a(i)
      else if (a(i) > ma) then
         ma = a(i)
      end if
   end do
   end subroutine
   
   real(wp) function delta_sec ()
   call system_clock (toc)
   delta_sec = dble(toc-tic)/dble(rate)
   end function delta_sec
   
end program
..
minval/maxval      2.021972E-08   1.000000E+00     time=     0.09090000
2 loops            2.021972E-08   1.000000E+00     time=     0.29000000
1 loop             2.021972E-08   1.000000E+00     time=     0.13980000
1 loop with if1    2.021972E-08   1.000000E+00     time=     0.08430000
1 loop with if2    2.021972E-08   1.000000E+00     time=     0.08440000

minval/maxval        0.09086667
2 loops              0.29300000
1 loop               0.14276667
1 loop with if1      0.08636667
1 loop with if2      0.08710000

(hopefully no errors from the changes)

1 Like

This is very intresting. I also noticed that the maxloc intrinsic is slower than writing my own code. It’s nice to see this confirmed also here.

Just a quick comment: quite often I have to find not only the max but also the argmax of a vector. To do this, I wrote my own subroutine as

subroutine mymax(max_val,max_ind,x)
real(8), intent(in) :: x(:)
real(8), intent(out) :: max_val
integer, intent(out) :: max_ind
! Locals
integer :: nx, ix
real(8) :: temp

! Execution
nx = size(x)

max_val = -huge(0.d0)
max_ind = -1

do ix = 1,nx
   temp = x(ix)
   if (temp>max_val) then
       max_val = temp
       max_ind = ix
   endif
enddo

end subroutine mymax

Any suggestion for improvement would be really appreciated! Especially regarding to performance, since I call this subroutine within large nested loops

1 Like

This seems like a straightforward implementation in fortran (no loop unrolling, strip mining, multiple threads, etc.), so I wonder why the intrinsic maxloc() would be slower than this code. Is it because of the extra effort required to process the DIM, MASK, KIND, and BACK optional arguments? How much slower, 5x, or 5%?

1 Like

Recently I tested the speed of minloc between compilers. The test was essentially this:

call get_command_argument(1,str)
read(str,*) n

allocate(a(n))
call random_number(a)

call system_clock(t1,count_rate=rate)
idx = minloc(a,dim=1)
call system_clock(t2)
print *, idx, a(idx), "built-in", real(t2 - t1)/rate

For n = 10000000 I got the following results, measured on an Intel(R) Xeon(R) Platinum 8380 CPU,

The flags were the following:

Compiler Time (s) Flags
nagfor 7.2 1.75E-02 -O4 -target=native
ifx 2025.1.0 4.39E-03 -O2 -xHOST
ifx 2025.1.0 3.31E-03 -O2 -xHOST -mprefer-vector-width=512
ifort 18.0.5 3.51E-03 -O2 -xHOST
ifort 18.0.5 3.06E-03 -O2 -xHOST -qopt-zmm-usage=high
ifort 17.0.6 3.16E-03 -O2 -xHOST -qopt-zmm-usage=high
gfortran 12.2 1.76E-02 -O2 -march=native
flang-new 19.1.1 5.76E-02 -O2 -march=native
nvfortran 23.3-0 2.48E-02 -fast -mcpu=native

Interestingly, ifort has a very fast minloc that uses SIMD registers. In ifx the performance was worse, however in the oneAPI 2025 release, ifx once again has a fast minloc. When compiled with -mprefer-vector-width=512 it uses AVX512 registers as seen from the following hot loop:

.LBB0_12:
	vmovups	(%rsi,%rax,4), %zmm7
	vpbroadcastq	%rax, %zmm8
	vcmpltps	%zmm3, %zmm7, %k1
	vpaddq	%zmm4, %zmm8, %zmm1 {%k1}
	kshiftrw	$8, %k1, %k2
	vpaddq	%zmm5, %zmm8, %zmm2 {%k2}
	vpmovqd	%zmm8, %ymm8
	vmovdqa	%ymm8, %ymm9
	vinserti64x4	$1, %ymm8, %zmm9, %zmm8
	vpaddd	%zmm6, %zmm8, %zmm0 {%k1}
	vminps	%zmm3, %zmm7, %zmm3
	addq	$16, %rax
	cmpq	%rdx, %rax
	jb	.LBB0_12

I first noticed that ifort has a fast minloc/maxloc implementation here: Performance of vectorized code in ifort and ifx - #21 by ivanpribec

3 Likes

Very interesting, thanks! Would you say that after the oneapi 2025 release, ifort and ifx have performance parity regarding minloc? I ask this because I use maxloc all the time in my projects, and as of now I am still with ifort

1 Like

Between the data being random and multiple cases not being run timings like this can
be misleading. Merging your subroutine with some of the above examples I can get
any of the four to be the fastest depending on the compiler switches, the compiler, and how many changes to the maximum value occur during the search.

Run the following multiple times and you are likely to get considerably different results each time.

Oddly, even when I ran a loop on the same data I was seeing more variation than I expected.

If you use fpm try it with --profile debug and --profile release as a simple example of how much difference compiler switches make.

Overall, I found using the intrinsic and high compiler optimization the most reliable in this case with the compilers I tried, but timing is a tricky business so particularly with random data run multiple times and track the fastest, slowest, and average for each method; as a single pass can be very deceptive and might even be thrown off by concurrent system load.

Instrumented version of your example
module min_max_mod
implicit none
integer, parameter :: wp = kind(1.0d0)
contains

pure subroutine mymax_temp(max_val,max_ind,x)
real(kind=wp), intent(in)  :: x(:) 
real(kind=wp), intent(out) :: max_val 
integer, intent(out)       :: max_ind 

integer                    :: ix 
real(kind=wp)              :: temp 

max_val = -huge(0.d0)
max_ind = -1

do ix = 1,size(x)
   temp = x(ix)
   if (temp>max_val) then
       max_val = temp
       max_ind = ix
   endif
enddo

end subroutine mymax_temp

pure subroutine mymax_loc(max_val,max_ind,x)
real(kind=wp), intent(in)  :: x(:) 
real(kind=wp), intent(out) :: max_val 
integer, intent(out)       :: max_ind 

integer                    :: ix  

max_val = -huge(0.d0)
max_ind = 1

do ix = 1,size(x)
   if (x(ix)>x(max_ind)) then
       max_ind=ix         
   endif
enddo

max_val=x(max_ind)

end subroutine mymax_loc

pure subroutine mymax_intrinsic(max_val,max_ind,x)
real(kind=wp), intent(in)  :: x(:) 
real(kind=wp), intent(out) :: max_val 
integer, intent(out)       :: max_ind 

max_ind = maxloc(x,dim=1)
max_val = x(max_ind)

end subroutine mymax_intrinsic

pure subroutine mymax_notemp(max_val,max_ind,x)
real(kind=wp), intent(in)  :: x(:) 
real(kind=wp), intent(out) :: max_val 
integer, intent(out)       :: max_ind 

integer                    :: nx, ix 

nx = size(x)

max_val = -huge(0.d0)
max_ind = -1

do ix = 1,nx
   if (x(ix)>max_val) then
       max_val = x(ix)
       max_ind = ix
   endif
enddo

end subroutine mymax_notemp

end module min_max_mod
!
program xmin_max
use min_max_mod
implicit none
integer                      :: n 
real(kind=wp), allocatable   :: x(:) 
real(kind=wp)                :: t(2) 
real(kind=wp)                :: max_val
integer                      :: max_ind 
character(len=:),allocatable :: title
character(len=*),parameter   :: fmt_cr = "(a20,a20,1x,f16.12,i9,f8.4)" 

n = 10**8
allocate(x(n))
call random_number(x)

!do i=1,4

   call setup('temp')
   call mymax_temp(max_val,max_ind,x)
   call printme()
   
   call setup('notemp')
   call mymax_notemp(max_val,max_ind,x)
   call printme()
   
   call setup('intrinsic')
   call mymax_intrinsic(max_val,max_ind,x)
   call printme()

   call setup('loc')
   call mymax_loc(max_val,max_ind,x)
   call printme()

!enddo

contains
subroutine setup(header)
character(len=*),intent(in) :: header
   title=header
   call cpu_time(t(1))
end subroutine setup

subroutine printme()
   call cpu_time(t(2))
   print fmt_cr,title,"max, loc:",max_val, max_ind,t(2)-t(1)
end subroutine printme

end program xmin_max
2 Likes

With MacOS on an M2 arm64 cpu with gfortran -O, I get the following timings.

$ a.out
                temp           max, loc:   0.999999998174 45366188  0.0437
              notemp           max, loc:   0.999999998174 45366188  0.0336
           intrinsic           max, loc:   0.999999998174 45366188  0.0320
                 loc           max, loc:   0.999999998174 45366188  0.0445

I ran it several times and the timings are stable to about 1ms. Without the -O, the times are all about 2x larger (even the intrinsic). It looks like the gfortran intrinsic was implemented well for this case.

1 Like

Can you please post the full code that you used? I would like to run it myself as well.

1 Like

Sure. Notes about my driver:

  • the array size is read from the command-line parameter
  • in the table above, I only report times of the built-in function
  • my original goal was to get a fast SIMD version, but I had no success with !$omp simd directives
  • the lines printing the compiler version and options are commented out because they aren’t unsupported on older ifort versions
test_minloc_reduction
! test_minloc_reduction.f90
module fast_minloc
implicit none
private

public :: minloc_thread
public :: minloc_reduce
public :: minloc_simd

type :: minloc_type
    real :: val = huge(1.0)
    integer :: idx = 0
end type
!$omp declare reduction(min:minloc_type:omp_out=minloc_min(omp_out,omp_in))

contains

    function minloc_min(a,b) result(c)
        type(minloc_type), intent(in) :: a, b
        type(minloc_type) :: c
        c = merge(a,b,a%val <= b%val)
    end function

    function minloc_reduce(a) result(idx)
        real, intent(in) :: a(:)
        integer :: idx

        type(minloc_type) :: t
        integer :: i

        t = minloc_type(huge(a),0)
        !$omp parallel do reduction(min: t)
        do i = 1, size(a)
            t = minloc_min(t,minloc_type(a(i),i))
        end do

        idx = t%idx

    end function

    function minloc_simd(a) result(idx)
        real, intent(in) :: a(:)
        integer :: idx

        ! Type needed for custom reduction
        type :: ridx
           real :: val = huge(1.0)
           integer :: idx = 0
        end type

!$omp declare reduction(min: ridx: &
!$omp& omp_out = merge(omp_out,omp_in,omp_out%val < omp_in%val))

        integer :: i
        type(ridx) :: r

        if (size(a) == 0) then
            idx = 0
            return
        else
            r = ridx(a(1),1)            
            !$omp simd reduction(min: r)
            do i = 2, size(a)
                if (a(i) < r%val) r = ridx(a(i),i)
            end do
            idx = r%idx
        end if

    end function

    function minloc_thread(a) result(idx)
        !$ use omp_lib
        real, intent(in) :: a(:)

        integer :: n, nt, it, nc, lb, ub, mt
        integer :: idx

        real, allocatable :: b(:)
        integer, allocatable :: t(:)

        n = size(a)

        if (n == 0) then
            idx = 0
            return
        else

        mt = 1
        !$ mt = omp_get_max_threads()
        allocate(b(mt),source=huge(1.0))
        allocate(t(mt),source=0)

        !$omp parallel default(private) shared(a, b, t, n, idx)

            nt = 1
            it = 0
        !$  nt = omp_get_num_threads()
        !$  it = omp_get_thread_num()

            ! Divide array into chunks
            nc = n / nt

            lb = it*nc + 1
            ub = min((it+1)*nc, n)

            t(it+1) = minloc(a(lb:ub),dim=1) + lb - 1
            b(it+1) = a(t(it+1))
            !$omp barrier

            !$omp single
            idx = t(minloc(b(1:nt),dim=1))
            !$omp end single nowait

        !$omp end parallel

        end if

    end function

end module


program main
!use, intrinsic :: iso_fortran_env, only: &
!    compiler_version, compiler_options

use, intrinsic :: iso_fortran_env, only: int64

use fast_minloc
implicit none

real, allocatable :: a(:)
integer :: idx, i, n
integer(kind=int64) :: t1, t2, rate
character(len=64) :: str

!print '(A)', compiler_version()
!print '(A)', compiler_options()

call get_command_argument(1,str)
read(str,*) n

allocate(a(n))
call random_number(a)

call system_clock(t1,count_rate=rate)
idx = minloc(a,dim=1)
call system_clock(t2)
print *, idx, a(idx), "built-in", real(t2 - t1)/rate

call system_clock(t1)
idx = minloc_thread(a)
call system_clock(t2)
print *, idx, a(idx), "threaded", real(t2 - t1)/rate

call system_clock(t1)
idx = minloc_reduce(a)
call system_clock(t2)
print *, idx, a(idx), "user-defined reduction", real(t2-t1)/rate

call system_clock(t1)
idx = minloc_simd(a)
call system_clock(t2)
print *, idx, a(idx), "simd user-defined reduction", real(t2 - t1)/rate

end program

In my experiment I also tried to use multi-threading to evaluate minloc, but I found no benefit for N < 5000000.

2 Likes

Here is a version with aligned output, for example

compiler version: GCC version 15.0.0 20241215 (experimental)
compiler options: -march=tigerlake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mavx512vbmi -mavx512ifma -mavx512vpopcntdq -mavx512vbmi2 -mgfni -mvpclmulqdq -mavx512vnni -mavx512bitalg -mno-avx512bf16 -mavx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mmovdir64b -mmovdiri -mno-mwaitx -mno-pconfig -mno-pku -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex -mno-avxvnniint16 -mno-sm3 -mno-sha512 -mno-sm4 -mno-apxf -mno-usermsr -mno-avx10.2-256 -mno-avx10.2-512 -mno-amx-avx512 -mno-amx-tf32 -mno-amx-transpose -mno-amx-fp8 -mno-movrs -mno-amx-movrs --param=l1-cache-size=48 --param=l1-cache-line-size=64 --param=l2-cache-size=12288 -mtune=tigerlake -O3 -flto
n = 10000000

    minloc        minval                        method      time
    890628    0.00000006                      built-in  0.008685
    890628    0.00000006                      threaded  0.008564
    890628    0.00000006        user-defined reduction  0.015995
    890628    0.00000006   simd user-defined reduction  0.008881
1 Like

With the same option and same gfortran version an an MSWindows platform not intentionally running anything else I see, out of a set of 40 executes, very wide vulnerability. On Linux platforms I see something close to what you see. Not really an MSWindows user so not planning on tracking it down but I have seen huge fluctuations on MSWIndows many times in the past on other platforms, but who knows what MSWindows is doing in the background half the time.

       loc   max, loc:   0.999999983796 57248502  0.0940
       loc   max, loc:   0.999999988597 47456428  0.3750

      temp   max, loc:   0.999999970771 91888470  0.0320
      temp   max, loc:   0.999999983796 57248502  1.1090

    notemp   max, loc:   0.999999995812 55145976  0.0620
    notemp   max, loc:   0.999999983796 57248502  0.2970

 intrinsic   max, loc:   0.999999970771 91888470  0.0780
 intrinsic   max, loc:   0.999999995812 55145976  0.2340

I little bit of self promotion here :grin:.
If you want to time function calls with some precision, you can always use benchmark.f. It has been developed for doing just that.
It should be easy to use but don’t hesitate to reach out if you have questions or issues

5 Likes

I have noticed high CPU usage from Microsoft Compatibility Telemetry, for example.

1 Like

The use of SIMD instructions for finding the index of minimum/maximum location is explained well here: http://0x80.pl/notesen/2018-10-03-simd-index-of-min.html.
Using AVX2 instructions gives a speed-up of factor 10x compared to scalar loops.

I’m not too familiar with ARM64 assembly, but it still looks like a scalar loop to me (Compiler Explorer). I imagine it is possible to do something faster using Arm NEON instructions.

1 Like