Code slower than C++ version

lmiq · October 14, 2021, 2:56pm

This is probably deliberate to keep a safer option as the default. What I miss there is that @fastmath (or the compiler option in Fortran) should skip those checks, shouldn’t it? ~~(it does not)~~

oscardssmith · October 14, 2021, 3:05pm

three_median2 and three_median1 (and associated assembly) is actually an incorrect implementation if you are following strict ieee rules. three_median2(0.0,-0.0,0.0)==-0.0 which and wrong, and both three_median1 and three_median2 produce NaN for three_median2(0.0,NaN,0.0)==NaN.

If you define three_median1 as @fastmath max(min(a,b),min(max(a,b),c)), Julia is able to optimize to the same assembly as three_median2

certik · October 14, 2021, 3:25pm

Perfect. Thanks for the feedback @oscardssmith. @lmiq, do you want to try it if you now get the same performance with three_median1?

lmiq · October 14, 2021, 4:11pm

Yes, it does:

julia> function three_median1(a, b, c)
         res = @fastmath max(min(a,b),min(max(a,b),c))
         return res
       end
three_median1 (generic function with 1 method)

julia> @btime test($x,$three_median1)
  10.053 μs (0 allocations: 0 bytes)
5017.136463388665

(I think I had put @fastmath on the call to the function on the test function, and that didn’t work)

one curiosity, in this last version I though I would do one comparison less:

julia> function three_median3(a,b,c)
           a1, a2  = a < b ? (a, b) : (b, a)
           a3 = a2 < c ? a2 : c
           res = a3 > a1 ? a3 : a1
           return res
       end
three_median3 (generic function with 1 method)

but the native code is the same:

julia> @code_native three_median3(1.0,1.0,1.0)
	.text
; ┌ @ REPL[32] within `three_median3'
	vminsd	%xmm1, %xmm0, %xmm3
	vmaxsd	%xmm0, %xmm1, %xmm0
; │ @ REPL[32]:3 within `three_median3'
	vminsd	%xmm2, %xmm0, %xmm0
; │ @ REPL[32] within `three_median3'
	vmaxsd	%xmm3, %xmm0, %xmm0
; │ @ REPL[32]:5 within `three_median3'
	retq
	nopw	%cs:(%rax,%rax)
; └

(the compiler decided that swapping the values is not worth saving one comparison, something like that).

But, at the end, Fortran is being able to optimize the min/max version to optimal, isn’t it? Does it behave like Julia in this regard (having NaN checks without fastmath and skipping those with it)?
(I don’t know how to check the assembly codes in these cases).

ivanpribec · October 14, 2021, 4:54pm

You can check them with Godbolt:

Here is the latest Intel Fortran compiler Classic with -O3:

three_median_:
        movsd     xmm2, QWORD PTR [rdi]                         #5.3
        movsd     xmm1, QWORD PTR [rsi]                         #5.3
        movaps    xmm0, xmm2                                    #7.1
        maxsd     xmm2, xmm1                                    #7.1
        minsd     xmm0, xmm1                                    #7.1
        minsd     xmm2, QWORD PTR [rdx]                         #7.1
        maxsd     xmm0, xmm2                                    #7.1
        ret

And here is the gfortran generated assembly from the link in @lkedward’s reply:

three_median_:
        movsd   xmm1, QWORD PTR [rdi]
        movsd   xmm2, QWORD PTR [rsi]
        movapd  xmm0, xmm1
        minsd   xmm1, xmm2
        maxsd   xmm0, xmm2
        minsd   xmm0, QWORD PTR [rdx]
        maxsd   xmm0, xmm1
        ret

lmiq · October 14, 2021, 5:37pm

Nice tool. The version with the conditionals only seems to generate less instructions: Compiler Explorer

With conditionals only probably there is a small performance gain, although as far as I understood from the documentation the max/min function is dealing with NaNs (I was expecting that --fast-math or some other flag made the assemblies converge, but I couldn’t find that flag, if it exists).

Topic		Replies	Views
Speed of array intrinsics	22	1956	April 14, 2025
Comparing Fortran and Julia's Bessel function performance	69	4844	October 23, 2022
79 Languages speed competition: Can we make Fortran win?	43	4274	January 15, 2023
Performance, C vs. Fortran	25	4374	July 7, 2021
Simple summation 8x slower than in Julia	89	14999	April 2, 2022

Code slower than C++ version

Related topics