Code slower than C++ version

This is probably deliberate to keep a safer option as the default. What I miss there is that @fastmath (or the compiler option in Fortran) should skip those checks, shouldn’t it? (it does not)

1 Like

three_median2 and three_median1 (and associated assembly) is actually an incorrect implementation if you are following strict ieee rules. three_median2(0.0,-0.0,0.0)==-0.0 which and wrong, and both three_median1 and three_median2 produce NaN for three_median2(0.0,NaN,0.0)==NaN.

If you define three_median1 as @fastmath max(min(a,b),min(max(a,b),c)), Julia is able to optimize to the same assembly as three_median2

3 Likes

Perfect. Thanks for the feedback @oscardssmith. @lmiq, do you want to try it if you now get the same performance with three_median1?

1 Like

Yes, it does:

julia> function three_median1(a, b, c)
         res = @fastmath max(min(a,b),min(max(a,b),c))
         return res
       end
three_median1 (generic function with 1 method)

julia> @btime test($x,$three_median1)
  10.053 μs (0 allocations: 0 bytes)
5017.136463388665

(I think I had put @fastmath on the call to the function on the test function, and that didn’t work)

one curiosity, in this last version I though I would do one comparison less:

julia> function three_median3(a,b,c)
           a1, a2  = a < b ? (a, b) : (b, a)
           a3 = a2 < c ? a2 : c
           res = a3 > a1 ? a3 : a1
           return res
       end
three_median3 (generic function with 1 method)

but the native code is the same:

julia> @code_native three_median3(1.0,1.0,1.0)
	.text
; ┌ @ REPL[32] within `three_median3'
	vminsd	%xmm1, %xmm0, %xmm3
	vmaxsd	%xmm0, %xmm1, %xmm0
; │ @ REPL[32]:3 within `three_median3'
	vminsd	%xmm2, %xmm0, %xmm0
; │ @ REPL[32] within `three_median3'
	vmaxsd	%xmm3, %xmm0, %xmm0
; │ @ REPL[32]:5 within `three_median3'
	retq
	nopw	%cs:(%rax,%rax)
; └

(the compiler decided that swapping the values is not worth saving one comparison, something like that).

But, at the end, Fortran is being able to optimize the min/max version to optimal, isn’t it? Does it behave like Julia in this regard (having NaN checks without fastmath and skipping those with it)?
(I don’t know how to check the assembly codes in these cases).

2 Likes

You can check them with Godbolt:

Here is the latest Intel Fortran compiler Classic with -O3:

three_median_:
        movsd     xmm2, QWORD PTR [rdi]                         #5.3
        movsd     xmm1, QWORD PTR [rsi]                         #5.3
        movaps    xmm0, xmm2                                    #7.1
        maxsd     xmm2, xmm1                                    #7.1
        minsd     xmm0, xmm1                                    #7.1
        minsd     xmm2, QWORD PTR [rdx]                         #7.1
        maxsd     xmm0, xmm2                                    #7.1
        ret   

And here is the gfortran generated assembly from the link in @lkedward’s reply:

three_median_:
        movsd   xmm1, QWORD PTR [rdi]
        movsd   xmm2, QWORD PTR [rsi]
        movapd  xmm0, xmm1
        minsd   xmm1, xmm2
        maxsd   xmm0, xmm2
        minsd   xmm0, QWORD PTR [rdx]
        maxsd   xmm0, xmm1
        ret
3 Likes

Nice tool. The version with the conditionals only seems to generate less instructions: Compiler Explorer

With conditionals only probably there is a small performance gain, although as far as I understood from the documentation the max/min function is dealing with NaNs (I was expecting that --fast-math or some other flag made the assemblies converge, but I couldn’t find that flag, if it exists).

3 Likes