Speed of array intrinsics

ivanpribec · April 12, 2025, 10:23am

Recently I tested the speed of minloc between compilers. The test was essentially this:

call get_command_argument(1,str)
read(str,*) n

allocate(a(n))
call random_number(a)

call system_clock(t1,count_rate=rate)
idx = minloc(a,dim=1)
call system_clock(t2)
print *, idx, a(idx), "built-in", real(t2 - t1)/rate

For n = 10000000 I got the following results, measured on an Intel(R) Xeon(R) Platinum 8380 CPU,

The flags were the following:

Compiler	Time (s)	Flags
nagfor 7.2	1.75E-02	`-O4 -target=native`
ifx 2025.1.0	4.39E-03	`-O2 -xHOST`
ifx 2025.1.0	3.31E-03	`-O2 -xHOST -mprefer-vector-width=512`
ifort 18.0.5	3.51E-03	`-O2 -xHOST`
ifort 18.0.5	3.06E-03	`-O2 -xHOST -qopt-zmm-usage=high`
ifort 17.0.6	3.16E-03	`-O2 -xHOST -qopt-zmm-usage=high`
gfortran 12.2	1.76E-02	`-O2 -march=native`
flang-new 19.1.1	5.76E-02	`-O2 -march=native`
nvfortran 23.3-0	2.48E-02	`-fast -mcpu=native`

Interestingly, ifort has a very fast minloc that uses SIMD registers. In ifx the performance was worse, however in the oneAPI 2025 release, ifx once again has a fast minloc. When compiled with -mprefer-vector-width=512 it uses AVX512 registers as seen from the following hot loop:

.LBB0_12:
	vmovups	(%rsi,%rax,4), %zmm7
	vpbroadcastq	%rax, %zmm8
	vcmpltps	%zmm3, %zmm7, %k1
	vpaddq	%zmm4, %zmm8, %zmm1 {%k1}
	kshiftrw	$8, %k1, %k2
	vpaddq	%zmm5, %zmm8, %zmm2 {%k2}
	vpmovqd	%zmm8, %ymm8
	vmovdqa	%ymm8, %ymm9
	vinserti64x4	$1, %ymm8, %zmm9, %zmm8
	vpaddd	%zmm6, %zmm8, %zmm0 {%k1}
	vminps	%zmm3, %zmm7, %zmm3
	addq	$16, %rax
	cmpq	%rdx, %rax
	jb	.LBB0_12

I first noticed that ifort has a fast minloc/maxloc implementation here: Performance of vectorized code in ifort and ifx - #21 by ivanpribec

Topic		Replies	Views
LFortran now supports all intrinsic functions Announcements	44	1456	January 5, 2025
Code slower than C++ version Help	25	1254	October 14, 2021
Computing the mean: accuracy and speed	20	1522	June 3, 2023
Array intrinsics performances/accuracy	20	885	May 20, 2023
Fast_math: A collection of functions for fast number crunching using Fortran Announcements	12	1235	October 3, 2023

Speed of array intrinsics

Related topics