Recently I tested the speed of minloc between compilers. The test was essentially this:
call get_command_argument(1,str)
read(str,*) n
allocate(a(n))
call random_number(a)
call system_clock(t1,count_rate=rate)
idx = minloc(a,dim=1)
call system_clock(t2)
print *, idx, a(idx), "built-in", real(t2 - t1)/rate
For n = 10000000
I got the following results, measured on an Intel(R) Xeon(R) Platinum 8380 CPU,
The flags were the following:
Compiler | Time (s) | Flags |
---|---|---|
nagfor 7.2 | 1.75E-02 | -O4 -target=native |
ifx 2025.1.0 | 4.39E-03 | -O2 -xHOST |
ifx 2025.1.0 | 3.31E-03 | -O2 -xHOST -mprefer-vector-width=512 |
ifort 18.0.5 | 3.51E-03 | -O2 -xHOST |
ifort 18.0.5 | 3.06E-03 | -O2 -xHOST -qopt-zmm-usage=high |
ifort 17.0.6 | 3.16E-03 | -O2 -xHOST -qopt-zmm-usage=high |
gfortran 12.2 | 1.76E-02 | -O2 -march=native |
flang-new 19.1.1 | 5.76E-02 | -O2 -march=native |
nvfortran 23.3-0 | 2.48E-02 | -fast -mcpu=native |
Interestingly, ifort
has a very fast minloc that uses SIMD registers. In ifx the performance was worse, however in the oneAPI 2025 release, ifx
once again has a fast minloc. When compiled with -mprefer-vector-width=512
it uses AVX512 registers as seen from the following hot loop:
.LBB0_12:
vmovups (%rsi,%rax,4), %zmm7
vpbroadcastq %rax, %zmm8
vcmpltps %zmm3, %zmm7, %k1
vpaddq %zmm4, %zmm8, %zmm1 {%k1}
kshiftrw $8, %k1, %k2
vpaddq %zmm5, %zmm8, %zmm2 {%k2}
vpmovqd %zmm8, %ymm8
vmovdqa %ymm8, %ymm9
vinserti64x4 $1, %ymm8, %zmm9, %zmm8
vpaddd %zmm6, %zmm8, %zmm0 {%k1}
vminps %zmm3, %zmm7, %zmm3
addq $16, %rax
cmpq %rdx, %rax
jb .LBB0_12
I first noticed that ifort has a fast minloc/maxloc implementation here: Performance of vectorized code in ifort and ifx - #21 by ivanpribec