Speed of array intrinsics

Recently I tested the speed of minloc between compilers. The test was essentially this:

call get_command_argument(1,str)
read(str,*) n

allocate(a(n))
call random_number(a)

call system_clock(t1,count_rate=rate)
idx = minloc(a,dim=1)
call system_clock(t2)
print *, idx, a(idx), "built-in", real(t2 - t1)/rate

For n = 10000000 I got the following results, measured on an Intel(R) Xeon(R) Platinum 8380 CPU,

The flags were the following:

Compiler Time (s) Flags
nagfor 7.2 1.75E-02 -O4 -target=native
ifx 2025.1.0 4.39E-03 -O2 -xHOST
ifx 2025.1.0 3.31E-03 -O2 -xHOST -mprefer-vector-width=512
ifort 18.0.5 3.51E-03 -O2 -xHOST
ifort 18.0.5 3.06E-03 -O2 -xHOST -qopt-zmm-usage=high
ifort 17.0.6 3.16E-03 -O2 -xHOST -qopt-zmm-usage=high
gfortran 12.2 1.76E-02 -O2 -march=native
flang-new 19.1.1 5.76E-02 -O2 -march=native
nvfortran 23.3-0 2.48E-02 -fast -mcpu=native

Interestingly, ifort has a very fast minloc that uses SIMD registers. In ifx the performance was worse, however in the oneAPI 2025 release, ifx once again has a fast minloc. When compiled with -mprefer-vector-width=512 it uses AVX512 registers as seen from the following hot loop:

.LBB0_12:
	vmovups	(%rsi,%rax,4), %zmm7
	vpbroadcastq	%rax, %zmm8
	vcmpltps	%zmm3, %zmm7, %k1
	vpaddq	%zmm4, %zmm8, %zmm1 {%k1}
	kshiftrw	$8, %k1, %k2
	vpaddq	%zmm5, %zmm8, %zmm2 {%k2}
	vpmovqd	%zmm8, %ymm8
	vmovdqa	%ymm8, %ymm9
	vinserti64x4	$1, %ymm8, %zmm9, %zmm8
	vpaddd	%zmm6, %zmm8, %zmm0 {%k1}
	vminps	%zmm3, %zmm7, %zmm3
	addq	$16, %rax
	cmpq	%rdx, %rax
	jb	.LBB0_12

I first noticed that ifort has a fast minloc/maxloc implementation here: Performance of vectorized code in ifort and ifx - #21 by ivanpribec

3 Likes