Simplicial interpolation with sorting network

Dear all:

I am reading this paper regarding some tricks for faster interpolation on both rectangular and simplex:

https://dl.acm.org/doi/pdf/10.1145/3423184

I cannot say I fully understand this paper, but section 6 talks about using a sorting network to deal with the necessary small-scale sorting that identifies the correct simplex containing the desired point.

Is there any fortran implementation on this “sorting network” thing? I saw that there are discussions on how this great community has implemented numerous sorting algorithm in stdlib, but not yet seen anything regarding this sorting network. One paper cited by the paper above is this one:

https://dl.acm.org/doi/pdf/10.1145/1468075.1468121

Thanks a lot in advance for helping me in the first place!

Hui-Jun Chen

Sorting networks are pretty simple, I am using those (e.g. in quicksort-like sorting for small array sections). Here is an example for arrays of size 4:

subroutine sort_network_arr4_int(x)
   integer, dimension(1:4), intent(inout) :: x

   integer :: z1, z2

   _COMPAREARRSWAP_(1, 2, x, z1)
   _COMPAREARRSWAP_(3, 4, x, z2)

   _COMPAREARRSWAP_(1, 3, x, z1)
   _COMPAREARRSWAP_(2, 4, x, z2)

   _COMPAREARRSWAP_(2, 3, x, z1)
end subroutine sort_network_arr4_int

with macro

#define _COMPAREARRSWAP_(i,j,x,z)      if (x(i)>x(j)) then; z=x(i);x(i)=x(j);x(j)=z; end if

I use a macro to avoid any suprises with non-inlining by compilers.
I have never taken time to look at the assembler code of this and still wonder, whether compilers are able to do the (independent) compare-swap operations in parallel, using some kind of simd-cmov (is there any?)

Is there any better way to formulate such networks in fortran (better=better performance)?

1 Like

A possible simd instruction for the compare would be a variant of VPCMPx or PCMPGTx
The simd instruction for conditional copy could be some blend instruction like PBLENDx. But I doubt that compilers are able to explore this.

Looking at the assembly of your subroutine in Compiler Explorer (compiled with gfortran -O3 -march=skylake), it contains a number of jumps:

__foo_MOD_sort_network_arr4_int:
        vmovq   xmm0, QWORD PTR [rdi]
        vpextrd edx, xmm0, 1
        vmovd   esi, xmm0
        cmp     esi, edx
        jle     .L4
        mov     eax, esi
        vpshufd xmm0, xmm0, 225
        mov     esi, edx
        vmovq   QWORD PTR [rdi], xmm0
        mov     edx, eax
.L4:
        vmovq   xmm0, QWORD PTR [rdi+8]
        vpextrd ecx, xmm0, 1
        vmovd   eax, xmm0
        cmp     eax, ecx
        jle     .L5
        mov     r8d, eax
        vpshufd xmm0, xmm0, 225
        mov     eax, ecx
        vmovq   QWORD PTR [rdi+8], xmm0
        mov     ecx, r8d
.L5:
        cmp     esi, eax
        jle     .L6
        mov     DWORD PTR [rdi+8], esi
        mov     DWORD PTR [rdi], eax
        mov     eax, esi
.L6:
        cmp     edx, ecx
        jle     .L7
        mov     DWORD PTR [rdi+4], ecx
        mov     DWORD PTR [rdi+12], edx
        mov     edx, ecx
.L7:
        cmp     eax, edx
        jge     .L9
        vmovd   xmm1, eax
        vpinsrd xmm0, xmm1, edx, 1
        vmovq   QWORD PTR [rdi+4], xmm0
.L9:
        ret

Following this lecture, I naively converted it to this:

subroutine sort4(x)
   integer, intent(inout) :: x(4)

   integer :: z1, z2

    call cmpswap(x(1),x(2))
    call cmpswap(x(3),x(4))

    call cmpswap(x(1),x(3))
    call cmpswap(x(2),x(4))

    call cmpswap(x(2),x(3))

contains

    subroutine cmpswap(a,b)
        integer, intent(inout) :: a, b
        integer :: tmp
            tmp = min(a,b)
            b = max(a,b)
            a = tmp
    end subroutine

end subroutine sort4

and the assembly produced looks like this:

__foo_MOD_sort4:
        vmovd   xmm0, DWORD PTR [rdi+4]
        vmovd   xmm3, DWORD PTR [rdi]
        vmovd   xmm2, DWORD PTR [rdi+12]
        vpminsd xmm4, xmm3, xmm0
        vpmaxsd xmm3, xmm3, xmm0
        vmovd   xmm0, DWORD PTR [rdi+8]
        vpminsd xmm1, xmm0, xmm2
        vpmaxsd xmm0, xmm0, xmm2
        vpmaxsd xmm5, xmm1, xmm4
        vpminsd xmm2, xmm0, xmm3
        vpminsd xmm1, xmm1, xmm4
        vpmaxsd xmm0, xmm0, xmm3
        vpminsd xmm4, xmm2, xmm5
        vmovd   eax, xmm0
        vpmaxsd xmm2, xmm2, xmm5
        vmovd   edx, xmm4
        vpinsrd xmm2, xmm2, eax, 1
        vpinsrd xmm1, xmm1, edx, 1
        vpunpcklqdq     xmm1, xmm1, xmm2
        vmovdqu XMMWORD PTR [rdi], xmm1
        ret

When called like this:

    a = [3,1,4,2]
    call sort_network_arr4_int(a)
    print *, a

    a = [3,1,4,2]
    call sort4(a)
    print *, a

I get the output

           1           2           3           4
           1           2           3           4
4 Likes

That’s a good idea using min/max (even if not really parallel, at least it has no conditional jumps). However, I would have expected that compilers would use cmov instead of jumps. But maybe there are some data dependency problems as it requires three conditional moves, or compilers are not able to translate the three moves in the if-block into three separate cmov’s.

It looks like ifx (-O3 -march=skylake) does it this way when using the min/max approach:

foo_mp_sort4_:
        mov     eax, dword ptr [rdi]
        mov     ecx, dword ptr [rdi + 4]
        cmp     eax, ecx
        mov     edx, ecx
        cmovl   edx, eax
        cmovg   ecx, eax
        mov     eax, dword ptr [rdi + 8]
        mov     esi, dword ptr [rdi + 12]
        cmp     eax, esi
        mov     r8d, esi
        cmovl   r8d, eax
        cmovg   esi, eax
        cmp     edx, r8d
        mov     eax, r8d
        cmovl   eax, edx
        cmovg   r8d, edx
        mov     dword ptr [rdi], eax
        cmp     ecx, esi
        mov     eax, esi
        cmovl   eax, ecx
        cmovg   esi, ecx
        mov     dword ptr [rdi + 12], esi
        cmp     eax, r8d
        mov     ecx, r8d
        cmovl   ecx, eax
        cmovle  eax, r8d
        mov     dword ptr [rdi + 8], eax
        mov     dword ptr [rdi + 4], ecx
        ret

That was meant with regards to the original compare-swap macro, for which no compiler generates any cmov code. For min/max the cmov or vpmin/vpmax variants are probably equally fast.

I just checked my 3-way quicksort with original and min/max variants and networks for sections from length 2 to 8. The gain was not small, about 3-5% for both compilers gfortran and ifort. I would have expected more.
Maybe I will run perf to see where the hotspots are how much time is actually spent in those networks. Just out of curiosity.

I just remembered I read about sorting networks here:

There was also this article from the DeepMind team (hotly debated on HackerNews),

where they managed to get rid of one instruction in a sorting network, and provided a patch for this to libc++. It’s relatively easy to use the C++ sort in Fortran, as shown here (with the downside it is linked and not instantiated in-place, e.g. for constant arrays).

Interesting links, thanks. In practice, it often does not really matter whether it runs slightly faster or not… But it is fun to dive into these details. Anyway here are some further numbers (only gfortran with large sets):

Sorting many shuffled arrays of size 100 yields a 10% improvement with the minmax variant over swap variant.

Sorting many shuffled arrays of size 8 directly with a sorting network gives a 85% (factor ca. 7) improvement with minmax.

Thus regarding the original question where sorting of small arrays is required, it looks like that a conditional-jump-free version provided by a sorting network can make a huge difference.

3 Likes

May I ask is it possible to save the index of the sorted array with this min-max approach?

In the paper the author wrote:

To leverage this more efficient sorting, we adopt the bit-packing technique described
in Figure 4, encoding the index in the low-order bits.

May I also ask how could we use encoding and bit manipulation to do such thing?