Fastest sorting for a specific genre of unordered numbers

I did misunderstand you initially.

The data are not always sorted in the exponent field. I just ran your program over several thousand of data files, and found quite a few exceptions where the exponent field decreased, with increasing index.

@mecej4 I change the Github benchmark. Now it loads in ~11,000 different arrays that I need to sort from a data file, and benchmarks routines by sorting all these different cases. I also added a routine by @pmk sortedIndex. The stdlib routine still wins.

The stdlib sorting module was an admirable effort from @wclodius. Have a look at the original issue to see all the benchmarking he did: Implement sorting algorithms · Issue #98 · fortran-lang/stdlib · GitHub

I’d also bring your attention to my post in which I listed a bunch of other sorting routines.

An easy way to sort a contiguous Fortran array is to build upon the C++ standard library functions std::sort and std::stable_sort. You do however need a compiler that supports the std::span container introduced in C++20. The code demonstrated below was tested succesfully with gcc 11.1.0

Here are the C++ routines:

#include <algorithm>
#include <numeric>
#include <span>

extern "C" {

// Finds the indices that would sort an array
//
// Performs an indirect sort of the rank-1 array `a` of size `n`. 
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_argsort(int n, const double *a, int *index) {

  std::span inds(index, static_cast<size_t>(n));
  std::iota(inds.begin(),inds.end(),1);

  auto compare = [&a](const int& il, const int& ir) {
    return a[il-1] < a[ir-1];
  };

  std::sort(inds.begin(),inds.end(),compare);

}

// Finds the indices that would sort an array while preserving order
// between equal elements.
//
// Performs an indirect sort of the rank-1 array `a` of size `n`. 
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_stable_argsort(int n, const double *a, int *index) {

  std::span inds(index, static_cast<size_t>(n));
  std::iota(inds.begin(),inds.end(),1);

  auto compare = [&a](const int& il, const int& ir) {
    return a[il-1] < a[ir-1];
  };

  std::stable_sort(inds.begin(),inds.end(),compare);
}


} // extern "C"

First, extern "C" keyword is used to generate external linkage and prevent compiler name mangling. The dummy variables are all C/Fortran compatible. The span container is used to “wrap” the raw contiguous index buffer with the whole C++ iterator machinery. Next we use std::iota to fill the index array with sequentially increasing values starting from 1 due to Fortran indexing. A lambda expression is used to perform the comparison, by referencing into the original array of doubles a with a capture y reference. Finally, we pass the iterators and lambda compare function to the actual sorting routine.

Here’s a Fortran driver that calls the procedure:

program test_cpp_argsort

   use iso_c_binding, only: c_double, c_int
   implicit none

   interface
      subroutine cpp_argsort(n,array,index) bind(C,name="cpp_argsort")
         import c_double, c_int
         integer(c_int), intent(in), value :: n
         real(c_double), intent(in) :: array(*)
         integer(c_int), intent(inout) :: index(*)
      end subroutine
      subroutine cpp_stable_argsort(n,array,index) bind(C, &
           name="cpp_stable_argsort")
         import c_double, c_int
         integer(c_int), intent(in), value :: n
         real(c_double), intent(in) :: array(*)
         integer(c_int), intent(inout) :: index(*)
      end subroutine
   end interface

   integer, parameter :: n = 5
   real(c_double) :: a(n)
   integer(c_int) :: inds(n)

   a = [real(c_double) :: 1., 5., 4., 2., 3.]

   write(*,'(A)') "Before:"
   write(*,*) a

   call cpp_argsort(n,a,inds)

   write(*,'(A)') "After:"
   write(*,*) a(inds)

end program

Finally, the result:

$ g++ -c -std=c++20 -O2 -march=native cpp_argsort.cc 
$ gfortran -Wall -O2 -march=native test_cpp_argsort.f90 cpp_argsort.o -o test -lstdc++
$ ./test
Before:
   1.0000000000000000        5.0000000000000000        4.0000000000000000        2.0000000000000000        3.0000000000000000     
After:
   1.0000000000000000        2.0000000000000000        3.0000000000000000        4.0000000000000000        5.0000000000000000    

At least for the built-in types which are interoperable, using std::span is an easy way to unleash the full potential of the C++ standard library algorithms.

3 Likes

I added the C++ std routines to my benchmark. They are right on par with the fortran-stdlib routines.

2 Likes

Are you using any optimization flags?

Here’s what I got on an Intel® Core™ i7-11700K with Ubuntu and gcc 11.1.0 (note I added -O3 -march=native using the FFLAGS and CXXFLAGS environment variables)

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.7011091E-05 |
| stdlib sort_index             |       2.1951818E-06 |
| sorting_module quicksort      |       9.6158182E-06 |
| sorter sortedIndex            |       2.0240000E-06 |
| C++ std::sort                 |       1.8006364E-06 |
| C++ std::stable_sort          |       1.5749091E-06 |
| quicksort_own_2d_swapped      |       3.8420000E-06 |

C++ is consistently on top by a small margin.

Edit: earlier I didn’t apply the Release mode. With Release mode, the C++ sort and Fortran stdlib sort are practically equal:

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6566909E-05 |
| stdlib sort_index             |       1.5939091E-06 |
| sorting_module quicksort      |       8.8872727E-06 |
| sorter sortedIndex            |       1.9198182E-06 |
| C++ std::sort                 |       1.5998182E-06 |
| C++ std::stable_sort          |       1.3950000E-06 |
| quicksort_own_2d_swapped      |       3.7874545E-06 |

2 Likes

For Fortran i am using GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0, and for C++ I am using Apple clang version 13.0.0 (clang-1300.0.29.30). I have a Apple M1 Macbook Air.

I just added -march=native, which gives a tiny speed boost. But ya std::sort and stdlib are about the same.

I use my Fortran indexed quick sort, a non-recursive sort by keeping a pair-list of the un-sorted sub-lists.
For the pivot value I choose the middle value, which works well for partially sorted lists. (other examples might choose the first value which is ok for random values but might be poor for partially ordered lists)
I use an order test that retains the order of repeated values, which more applies to integer sorts but is less important for real value sorts (assuming fewer duplicates).
I also use a bubble sort when the the sub-list is less than 8 to 16, although “8” can depend on the data type and processor. ( about 10% to 20% improvement )

I have used this sort for many years on lists of real(8) values ( x,y or z or ax+by+cz values for 100k points in a finite element mesh). Unlike the repeated claims of instability, I have found this sort to be robust, not poor for ordered lists.

1 Like

The options you used in your CMake file gave the best overall results for me; tried a variety or sort libraries. The only one that consistently beat the stdlib library was the mrgrnk procedure from orderpack, but only by a few percent; depending on the compiler and options sometimes the sort_quick_rx procedure from M_sort was faster; but the fastest times were orderpack mrgrnk and stdlib sort_index. Since sorting speed can be quite sensitive to partial ordering and array size the orders might change if the datafile is not representative of your general input, but it looks like stdlib with the optimization flags you are using is a good choice based on what I found. I increased the iterations a bit to eliminate some of the noise, but looking at the fastest dozen runs stdlib looks good; the orderpack and sort_quick_rx routines were sometimes faster, mostly with other compilers. It is interesting how much variation I saw in which was fastest depending on which compiler was used, but it was still always those three.

| orderpack_mrgrnk              |       3.3981818E-06 |
| orderpack_mrgrnk              |       3.6169091E-06 |
| stdlib_sort_index             |       3.8764545E-06 |
| stdlib_sort_index             |       3.9332727E-06 |
| orderpack_mrgrnk              |       4.2291818E-06 |
| qsort_inline                  |       4.2962727E-06 |
| M_sort_sort_quick_rx          |       4.4433636E-06 |
| M_sort_sort_quick_rx          |       4.4838182E-06 |
| qsort_inline                  |       4.4886364E-06 |
| M_sort_sort_quick_rx          |       4.6305455E-06 |
| qsort_inline                  |       4.7409091E-06 |
| stdlib_sort_index             |       5.3317273E-06 |

PS: the qsort_inline prodedure from the Fortran Wiki, customized to your data type shows up too; it is a very similiar algorithm to the M_sort sort_quick_rx routine, and the M_sort routines are already generic so easier to use.

4 Likes

I can confirm, mrgrnk from ORDERPACK is quite fast for this problem:

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6560636E-05 |
| stdlib sort_index             |       1.5619091E-06 |
| sorting_module quicksort      |       9.1251818E-06 |
| sorter sortedIndex            |       2.1222727E-06 |
| C++ std::sort                 |       1.5285455E-06 |
| C++ std::stable_sort          |       1.4374545E-06 |
| quicksort_own_2d_swapped      |       3.7542727E-06 |
| mrgrnk                        |       1.0954545E-06 |

It’s good to know that specialized Fortran routines can still outperform the C++ standard library. With mrgrnk we’re looking at a speed-up of ×20 (for the sorting part) already compared to the original futils subroutine.

@nicholaswogan, does this mean the other part of your computation (binning?) becomes the bottleneck?

1 Like

Check if you activated (by removing the comment mark) the lines in the source file mrgrnk.f90 that are preceded by the comment “This line may be activated when the initial set is already close to sorted.”

Doing this may give you an additional speed increase.

The C/C++ standard library sorting routines are usually handicapped by having to call a user-provided routine for comparing pairs of items from the array being sorted. The Fortran sorting routines, by doing these comparisons inline, may turn out to be slightly faster. But then, if we were to write the C sorting routines with inline comparisons (instead of making function calls for comparisons), those would become faster, as well.

1 Like

@urbanjost, Thanks so much for the benchmarks. I added mrgrnk to the Github benchmark. I also find that it barely outperforms stdlib.

@mecej4, uncommenting that line in mrgrnk gives give a little extra performance! thanks.

@ivanpribec, I am still bottlenecked by sorting. Here are some profiling results from XCode Time Profiler, where I have implemented each sorting algorithm into my main code.

futils argsort

stdlib sort_index

mrgrnk (with line uncommented)

The time I care about is the time of clima_climate_MOD_radiative_transfer. So mrgrnk does the best overall. Both sort_index and mrgrnk are massive improvements over argsort from futils.

Seems like mrgrnk is about as fast as it is going to get for sorting. Since this is still ~46% of computation time, there isn’t a ton to gain from optimizing the rebin subroutine (23% computation time). It would be great to optimize rebin some, but I’ve fiddled with it, and can’t imagine a way to make it a lot faster.

I consider this a big success! Thanks so much for all your help so far! My code is over 4x faster! Any further recommendations are very welcome.

The project GitHub - cphyc/Fortran-parallel-sort: A fortran library to perform parallel sorts. uses mrgrnk with OpenMP. Compiling it on WSL2 with GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 I get

(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort$ gfortran -march=native -O3 -fopenmp m_mrgrnk.f90 mod_sort.f90 test.f90
(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort$ ./a.out
 mrgrnk took   101.30691999999954      ms/iter
 parallel sort took   43.551079999999729      ms/iter
 speed up factor    2.3261632088113582     
 using            4  threads
 efficiency is    58.154080220283952      percent
 Note this assumes the serial version is unvectorized.
 However it probably is auto-vectorized which contributes to the efficiency appearing low.
           0  of 8 sorting algorithms have errors.
2 Likes

Was just thinking about cloning ORDERPACK2.0 on github as an fpm(1) package and was looking to see if anyone had and as a side-effect just ran across the parallel version; and was just thinking about trying it or basing a coarray version off it and came back to mention it and you already have it running! Nice.

@ivanpribec – you got a much nicer bump than I did with ORDERPACK; might be hardware, etc… but wondering what compiler options you used or what else might be different that might explain that? I was using test data different from the original, so that is a likely reason; but was thinking it might be something else.

This isn’t completely true for the C++ sorting routine. Since std::sort is a template, the compiler will effectively inline the comparison. Using a functor (a struct with the operator()) or lambda expression instead of a global function gives the compiler stronger guarantees which raise the chances for inlining.

The faster operation of std::sort in comparison to qsort is discussed in several places:

The page from Martin Ueding quite clearly shows a factor two speed-up for arrays with more than 10^5 elements.

Modern C++ and Fortran can work very well together IMO. It’s just a shame one always has to flatten all the interfaces down to C level for compatibility which makes it a lot of tedious work.


@urbanjost - I used the default release options specified by the CMake build, following the instructions on Nick’s GitHub repo. The only change I made to the driver, was to add the following section (and time printout):

  call cpu_time(t(8))
  do i = 1,n_dat
    x8 = all_data(:,i)
    call mrgrnk(x8, inds1)
    x8 = x8(inds1)
  enddo
  call cpu_time(t(9))
2 Likes

Thanks, @ivanpribec, for your highlighting the differences between qsort() and std::sort. It would be useful to Fortran programmers to have a recipe for calling std::sort from a Fortran program (including a recipe for specifying the comparison function/functor in C++, if necessary, or in Fortran, if possible).

I read a couple of your references, and I wondered why, when comparing the qsort() of C and the std::sort of C++ they did not note that the two routines/ methods may well use different algorithms. For example, qsort() may use mergesort in one C library, and quicksort in another C library! Furthermore, the scaling law for sort time versus array size may be different for the two implementations, especially if even one of the two uses a hybrid algorithm.

@mecej4, you make several good points. Upon re-reading the Stack Overflow threads, a few less visible comments did remark the differences could be attributed to different algorithms. We would need to look at the C and C++ runtime-library implementations to know for sure. From the pragmatic point of view C++ seems to have a small edge. Both routines are black-boxes from the user perspective.

With regard to a more general recipe, are you interested in sorting also other types than the intrinsic ones? Is an indirect sort sufficient, or would an in-place sort also be desirable? For derived types, the C++ code will be a somewhat more complex because the size of the array elements has to be communicated manually across the language barrier.

One more difference is that the C++ template library makes it very easy to write an indirect sort since the “lambda” comparison function can capture variables from the enclosing scope.

With the standard C qsort you would need to pack your keys and indexes into an array of structures, leading to duplication. The fundamental issue is that C doesn’t support nested functions.

That raises the interesting question whether a Fortran internal subprogram can have the bind(C) attribute and could hence be used to do an indirect sort using C’s qsort?

1 Like

Counter to what I expected, it’s perfectly feasible. Even more surprisingly, host association in internal functions permits using qsort for indirect sorting, which is not possible in pure C without the use of global variables.

Here’s a demonstration:

module sorting
   use, intrinsic :: iso_c_binding
   implicit none
   private
   public :: argsort
contains

   !> Returns the indices that would sort an array
   !>
   !> Arguments:
   !>   n: number of elements
   !>   list: on entry, list of indices 1..(1)..n 
   !>           (for reverse sort, n..(-1)..1)
   !>         on return, the indices such that keys(list) would be sorted
   !>   keys: array to be sorted
   !>
   subroutine argsort(n,list,keys)
      integer, intent(in) :: n
      integer, intent(inout), target :: list(*)
      integer, intent(in) :: keys(*)

      interface
         subroutine qsort(ptr,count,size,comp) bind(c,name="qsort")
            import c_ptr, c_int, c_size_t, c_funptr
            implicit none
            type(c_ptr), intent(in), value :: ptr
               !> pointer to the array to sort
            integer(c_size_t), intent(in), value :: count
               !> number of elements in the array
            integer(c_size_t), intent(in), value :: size
               !> size of each element in the array in bytes
            type(c_funptr), intent(in), value :: comp
               !> comparison function which returns ​a negative integer value 
               !> if the first argument is less than the second, a positive 
               !> integer value if the first argument is greater than the 
               !> second and zero if the arguments are equivalent
         end subroutine
      end interface

      if (n < 2) 
         ! list already sorted
         return
      end if

      call qsort(c_loc(list),int(n,c_size_t),c_sizeof(list(1)), &
         c_funloc(argsort_compare))

   contains
      ! technically, the inputs should be void pointers
      ! but works nevertheless
      integer(c_int) function argsort_compare(a,b) bind(c)
         integer(c_int), intent(in) :: a, b
         argsort_compare = keys(a) - keys(b)
      end function
   end subroutine
end module

program main

   use sorting
   implicit none

   integer :: list(4)
   integer :: keys(4)

   list = [1,2,3,4]
   keys = [7,5,6,3]

   call argsort(4,list,keys)

   ! Supposed to print [ 3, 5, 6, 7]
   print *, keys(list)

end program

Notice the comparison function is encoded in Fortran. So the conclusion in my previous post is curiously enough - wrong. You can perform an indirect sort using qsort without global variables from Fortran. :exploding_head:

Further, we could give argsort the bind(c) attribute, making it a callable from C again. Using Fortran to bypass restrictions of C would make an interesting addition to the Wikipedia entry on qsort.

2 Likes

That’s a very interesting interface - I am not sure if the interfacing plays a role on some overhead, but I’ve compared it against my own implementation of argsort and the results are quite interesting: with gfortran, the Fortran version is always at least some ~20-30% faster than the C counterpart, see for example here:

module sorting_f
   use iso_fortran_env, only: int32,int64
   implicit none
   private

   integer, parameter, public :: IKIND = int32
   integer, parameter, public :: ISIZE = int64
  
   public :: argsort_f
contains


   !> Argsort: Federico's implementation
   subroutine argsort_f(n,list,keys)
      integer(ISIZE), intent(in)    :: n
      integer(ISIZE), intent(inout) :: list(*) ! indices to the sorted array
      integer(IKIND), intent(in)    :: keys(*) ! Array to be sorted

      integer :: sorted(n)

      sorted = keys(1:n)
      call int_quick_sort_andlist(sorted,list(1:n))

   end subroutine argsort_f

   pure recursive subroutine int_quick_sort_andlist(list,ilist,down)

      integer(IKIND), dimension(:), intent(inout) :: list
      integer(ISIZE), dimension(size(list)), intent(inout) :: ilist
      logical, optional, intent(in) :: down

      integer(ISIZE)     :: i, j, n
      integer(ISIZE), parameter :: max_simple_sort_size = 8
      integer(IKIND)     :: chosen
      logical            :: descending

      descending = .false.; if (present(down)) descending = down
      n = size(list)

      choose_sorting_algorithm: if (n <= max_simple_sort_size) then
         ! Use interchange sort for small lists
         do i = 1, size(list) - 1
            do j = i + 1, size(list)
               if (toBeSwapped(list(i),list(j),.false.)) then
                   call swap(list(i),list(j))
                   call swaplist(ilist(i),ilist(j))
               end if
            end do
         end do

      else
         ! Use partition (quick) sort if the list is big
         chosen = list(int(n/2))
         i = 0
         j = n + 1

         scan_lists: do
            ! Scan list from left end
            ! until element >= chosen is found
            scan_from_left: do
              i = i + 1
              if (toBeSwapped(list(i),chosen,.true.) .or. i==n) exit scan_from_left
            end do scan_from_left

            ! Scan list from right end
            ! until element <= chosen is found

            scan_from_right: do
               j = j - 1
               if (toBeSwapped(chosen,list(j),.true.) .or. j==1) exit scan_from_right
            end do scan_from_right

            swap_element: if (i < j) then
                ! Swap two out of place elements
                call swap(list(i),list(j))
                call swaplist(ilist(i),ilist(j))
            else if (i == j) then
                i = i + 1
                exit
            else
                exit
            endif swap_element

         end do scan_lists

         if (1 < j) call int_quick_sort_andlist(list(:j),ilist(:j),down)
         if (i < n) call int_quick_sort_andlist(list(i:),ilist(i:),down)

      end if choose_sorting_algorithm ! test for small array

      contains

         elemental logical function toBeSwapped(a,b,orEqual)
            integer(IKIND), intent(in) :: a,b
            logical, intent(in) :: orEqual

            toBeSwapped = merge(a<b,a>b,descending)
            if (orEqual .and. a==b) toBeSwapped = .true.

         end function toBeSwapped

         elemental subroutine swap(a,b)
            integer(IKIND), intent(inout) :: a,b
            integer(IKIND) :: tmp
            tmp = a
            a = b
            b = tmp
         end subroutine swap
         elemental subroutine swaplist(a,b)
            integer(ISIZE), intent(inout) :: a,b
            integer(ISIZE) :: tmp
            tmp = a
            a = b
            b = tmp
         end subroutine swaplist

   end subroutine int_quick_sort_andlist

end module sorting_f

module sorting
   use, intrinsic :: iso_c_binding
   implicit none
   private
   public :: argsort
contains

   !> Returns the indices that would sort an array
   !>
   !> Arguments:
   !>   n: number of elements
   !>   list: on entry, list of indices 1..(1)..n 
   !>           (for reverse sort, n..(-1)..1)
   !>         on return, the indices such that keys(list) would be sorted
   !>   keys: array to be sorted
   !>
   subroutine argsort(n,list,keys)
      integer, intent(in) :: n
      integer, intent(inout), target :: list(*)
      integer, intent(in) :: keys(*)

      interface
         subroutine qsort(ptr,count,size,comp) bind(c,name="qsort")
            import c_ptr, c_int, c_size_t, c_funptr
            implicit none
            type(c_ptr), intent(in), value :: ptr
               !> pointer to the array to sort
            integer(c_size_t), intent(in), value :: count
               !> number of elements in the array
            integer(c_size_t), intent(in), value :: size
               !> size of each element in the array in bytes
            type(c_funptr), intent(in), value :: comp
               !> comparison function which returns ​a negative integer value 
               !> if the first argument is less than the second, a positive 
               !> integer value if the first argument is greater than the 
               !> second and zero if the arguments are equivalent
         end subroutine
      end interface

      if (n < 2) then
         ! list already sorted
         return
      end if

      call qsort(c_loc(list),int(n,c_size_t),c_sizeof(list(1)), &
         c_funloc(argsort_compare))

   contains
      ! technically, the inputs should be void pointers
      ! but works nevertheless
      integer(c_int) function argsort_compare(a,b) bind(c)
         integer(c_int), intent(in) :: a, b
         argsort_compare = keys(a) - keys(b)
      end function
   end subroutine

end module sorting

program main
   use iso_c_binding, only: c_int
   use sorting
   use sorting_f
   implicit none

   integer(ISIZE), parameter :: N = 100000

   real, allocatable :: x(:)
   integer(ISIZE), allocatable :: listf(:)
   integer(IKIND), allocatable :: keys(:),list(:)
   integer(ISIZE) :: j
   real :: t0,t1,t2

   allocate(list(N),keys(N),x(N))
   call random_number(x)
   keys(:) = nint(N*x)
   list(:) = [(j,j=1,N)]
   listF   = list

   call cpu_time(t0); call argsort(int(N,c_int),list,keys)
   print *, keys(1:5),list(1:5)
   call cpu_time(t1); call argsort_f(N,listf,keys)
   print *, keys(1:5),list(1:5)
   call cpu_time(t2)

   ! Supposed to print [ 3, 5, 6, 7]
   print *, 'C time=',t1-t0
   print *, 'F time=',t2-t1

end program