# Fastest sorting for a specific genre of unordered numbers

I did misunderstand you initially.

The data are not always sorted in the exponent field. I just ran your program over several thousand of data files, and found quite a few exceptions where the exponent field decreased, with increasing index.

@mecej4 I change the Github benchmark. Now it loads in ~11,000 different arrays that I need to sort from a data file, and benchmarks routines by sorting all these different cases. I also added a routine by @pmk `sortedIndex`. The stdlib routine still wins.

The stdlib sorting module was an admirable effort from @wclodius. Have a look at the original issue to see all the benchmarking he did: Implement sorting algorithms · Issue #98 · fortran-lang/stdlib · GitHub

I’d also bring your attention to my post in which I listed a bunch of other sorting routines.

An easy way to sort a contiguous Fortran array is to build upon the C++ standard library functions `std::sort` and `std::stable_sort`. You do however need a compiler that supports the `std::span` container introduced in C++20. The code demonstrated below was tested succesfully with gcc 11.1.0

Here are the C++ routines:

``````#include <algorithm>
#include <numeric>
#include <span>

extern "C" {

// Finds the indices that would sort an array
//
// Performs an indirect sort of the rank-1 array `a` of size `n`.
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_argsort(int n, const double *a, int *index) {

std::span inds(index, static_cast<size_t>(n));
std::iota(inds.begin(),inds.end(),1);

auto compare = [&a](const int& il, const int& ir) {
return a[il-1] < a[ir-1];
};

std::sort(inds.begin(),inds.end(),compare);

}

// Finds the indices that would sort an array while preserving order
// between equal elements.
//
// Performs an indirect sort of the rank-1 array `a` of size `n`.
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_stable_argsort(int n, const double *a, int *index) {

std::span inds(index, static_cast<size_t>(n));
std::iota(inds.begin(),inds.end(),1);

auto compare = [&a](const int& il, const int& ir) {
return a[il-1] < a[ir-1];
};

std::stable_sort(inds.begin(),inds.end(),compare);
}

} // extern "C"
``````

First, `extern "C"` keyword is used to generate external linkage and prevent compiler name mangling. The dummy variables are all C/Fortran compatible. The span container is used to “wrap” the raw contiguous index buffer with the whole C++ iterator machinery. Next we use `std::iota` to fill the index array with sequentially increasing values starting from 1 due to Fortran indexing. A lambda expression is used to perform the comparison, by referencing into the original array of doubles `a` with a capture y reference. Finally, we pass the iterators and lambda compare function to the actual sorting routine.

Here’s a Fortran driver that calls the procedure:

``````program test_cpp_argsort

use iso_c_binding, only: c_double, c_int
implicit none

interface
subroutine cpp_argsort(n,array,index) bind(C,name="cpp_argsort")
import c_double, c_int
integer(c_int), intent(in), value :: n
real(c_double), intent(in) :: array(*)
integer(c_int), intent(inout) :: index(*)
end subroutine
subroutine cpp_stable_argsort(n,array,index) bind(C, &
name="cpp_stable_argsort")
import c_double, c_int
integer(c_int), intent(in), value :: n
real(c_double), intent(in) :: array(*)
integer(c_int), intent(inout) :: index(*)
end subroutine
end interface

integer, parameter :: n = 5
real(c_double) :: a(n)
integer(c_int) :: inds(n)

a = [real(c_double) :: 1., 5., 4., 2., 3.]

write(*,'(A)') "Before:"
write(*,*) a

call cpp_argsort(n,a,inds)

write(*,'(A)') "After:"
write(*,*) a(inds)

end program
``````

Finally, the result:

``````\$ g++ -c -std=c++20 -O2 -march=native cpp_argsort.cc
\$ gfortran -Wall -O2 -march=native test_cpp_argsort.f90 cpp_argsort.o -o test -lstdc++
\$ ./test
Before:
1.0000000000000000        5.0000000000000000        4.0000000000000000        2.0000000000000000        3.0000000000000000
After:
1.0000000000000000        2.0000000000000000        3.0000000000000000        4.0000000000000000        5.0000000000000000
``````

At least for the built-in types which are interoperable, using `std::span` is an easy way to unleash the full potential of the C++ standard library algorithms.

3 Likes

I added the C++ std routines to my benchmark. They are right on par with the fortran-stdlib routines.

2 Likes

Are you using any optimization flags?

Here’s what I got on an Intel® Core™ i7-11700K with Ubuntu and gcc 11.1.0 (note I added `-O3 -march=native` using the `FFLAGS` and `CXXFLAGS` environment variables)

``````\$ ./sort_benchmark
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.7011091E-05 |
| stdlib sort_index             |       2.1951818E-06 |
| sorting_module quicksort      |       9.6158182E-06 |
| sorter sortedIndex            |       2.0240000E-06 |
| C++ std::sort                 |       1.8006364E-06 |
| C++ std::stable_sort          |       1.5749091E-06 |
| quicksort_own_2d_swapped      |       3.8420000E-06 |
``````

C++ is consistently on top by a small margin.

Edit: earlier I didn’t apply the Release mode. With Release mode, the C++ sort and Fortran stdlib sort are practically equal:

``````\$ ./sort_benchmark
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6566909E-05 |
| stdlib sort_index             |       1.5939091E-06 |
| sorting_module quicksort      |       8.8872727E-06 |
| sorter sortedIndex            |       1.9198182E-06 |
| C++ std::sort                 |       1.5998182E-06 |
| C++ std::stable_sort          |       1.3950000E-06 |
| quicksort_own_2d_swapped      |       3.7874545E-06 |

``````
2 Likes

For Fortran i am using `GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0`, and for C++ I am using `Apple clang version 13.0.0 (clang-1300.0.29.30)`. I have a Apple M1 Macbook Air.

I just added `-march=native`, which gives a tiny speed boost. But ya `std::sort` and `stdlib` are about the same.

I use my Fortran indexed quick sort, a non-recursive sort by keeping a pair-list of the un-sorted sub-lists.
For the pivot value I choose the middle value, which works well for partially sorted lists. (other examples might choose the first value which is ok for random values but might be poor for partially ordered lists)
I use an order test that retains the order of repeated values, which more applies to integer sorts but is less important for real value sorts (assuming fewer duplicates).
I also use a bubble sort when the the sub-list is less than 8 to 16, although “8” can depend on the data type and processor. ( about 10% to 20% improvement )

I have used this sort for many years on lists of real(8) values ( x,y or z or ax+by+cz values for 100k points in a finite element mesh). Unlike the repeated claims of instability, I have found this sort to be robust, not poor for ordered lists.

1 Like

The options you used in your CMake file gave the best overall results for me; tried a variety or sort libraries. The only one that consistently beat the stdlib library was the mrgrnk procedure from orderpack, but only by a few percent; depending on the compiler and options sometimes the sort_quick_rx procedure from M_sort was faster; but the fastest times were orderpack mrgrnk and stdlib sort_index. Since sorting speed can be quite sensitive to partial ordering and array size the orders might change if the datafile is not representative of your general input, but it looks like stdlib with the optimization flags you are using is a good choice based on what I found. I increased the iterations a bit to eliminate some of the noise, but looking at the fastest dozen runs stdlib looks good; the orderpack and sort_quick_rx routines were sometimes faster, mostly with other compilers. It is interesting how much variation I saw in which was fastest depending on which compiler was used, but it was still always those three.

``````| orderpack_mrgrnk              |       3.3981818E-06 |
| orderpack_mrgrnk              |       3.6169091E-06 |
| stdlib_sort_index             |       3.8764545E-06 |
| stdlib_sort_index             |       3.9332727E-06 |
| orderpack_mrgrnk              |       4.2291818E-06 |
| qsort_inline                  |       4.2962727E-06 |
| M_sort_sort_quick_rx          |       4.4433636E-06 |
| M_sort_sort_quick_rx          |       4.4838182E-06 |
| qsort_inline                  |       4.4886364E-06 |
| M_sort_sort_quick_rx          |       4.6305455E-06 |
| qsort_inline                  |       4.7409091E-06 |
| stdlib_sort_index             |       5.3317273E-06 |
``````

PS: the qsort_inline prodedure from the Fortran Wiki, customized to your data type shows up too; it is a very similiar algorithm to the M_sort sort_quick_rx routine, and the M_sort routines are already generic so easier to use.

4 Likes

I can confirm, mrgrnk from ORDERPACK is quite fast for this problem:

``````\$ ./sort_benchmark
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6560636E-05 |
| stdlib sort_index             |       1.5619091E-06 |
| sorting_module quicksort      |       9.1251818E-06 |
| sorter sortedIndex            |       2.1222727E-06 |
| C++ std::sort                 |       1.5285455E-06 |
| C++ std::stable_sort          |       1.4374545E-06 |
| quicksort_own_2d_swapped      |       3.7542727E-06 |
| mrgrnk                        |       1.0954545E-06 |
``````

It’s good to know that specialized Fortran routines can still outperform the C++ standard library. With mrgrnk we’re looking at a speed-up of ×20 (for the sorting part) already compared to the original futils subroutine.

@nicholaswogan, does this mean the other part of your computation (binning?) becomes the bottleneck?

1 Like

Check if you activated (by removing the comment mark) the lines in the source file mrgrnk.f90 that are preceded by the comment “This line may be activated when the initial set is already close to sorted.”

Doing this may give you an additional speed increase.

The C/C++ standard library sorting routines are usually handicapped by having to call a user-provided routine for comparing pairs of items from the array being sorted. The Fortran sorting routines, by doing these comparisons inline, may turn out to be slightly faster. But then, if we were to write the C sorting routines with inline comparisons (instead of making function calls for comparisons), those would become faster, as well.

1 Like

@urbanjost, Thanks so much for the benchmarks. I added `mrgrnk` to the Github benchmark. I also find that it barely outperforms stdlib.

@mecej4, uncommenting that line in `mrgrnk` gives give a little extra performance! thanks.

@ivanpribec, I am still bottlenecked by sorting. Here are some profiling results from XCode Time Profiler, where I have implemented each sorting algorithm into my main code.

## mrgrnk (with line uncommented)

The time I care about is the time of `clima_climate_MOD_radiative_transfer`. So `mrgrnk` does the best overall. Both `sort_index` and `mrgrnk` are massive improvements over `argsort` from futils.

Seems like `mrgrnk` is about as fast as it is going to get for sorting. Since this is still ~46% of computation time, there isn’t a ton to gain from optimizing the rebin subroutine (23% computation time). It would be great to optimize `rebin` some, but I’ve fiddled with it, and can’t imagine a way to make it a lot faster.

I consider this a big success! Thanks so much for all your help so far! My code is over 4x faster! Any further recommendations are very welcome.

The project GitHub - cphyc/Fortran-parallel-sort: A fortran library to perform parallel sorts. uses `mrgrnk` with OpenMP. Compiling it on WSL2 with GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 I get

``````(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort\$ gfortran -march=native -O3 -fopenmp m_mrgrnk.f90 mod_sort.f90 test.f90
(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort\$ ./a.out
mrgrnk took   101.30691999999954      ms/iter
parallel sort took   43.551079999999729      ms/iter
speed up factor    2.3261632088113582
efficiency is    58.154080220283952      percent
Note this assumes the serial version is unvectorized.
However it probably is auto-vectorized which contributes to the efficiency appearing low.
0  of 8 sorting algorithms have errors.
``````
2 Likes

Was just thinking about cloning ORDERPACK2.0 on github as an fpm(1) package and was looking to see if anyone had and as a side-effect just ran across the parallel version; and was just thinking about trying it or basing a coarray version off it and came back to mention it and you already have it running! Nice.

@ivanpribec – you got a much nicer bump than I did with ORDERPACK; might be hardware, etc… but wondering what compiler options you used or what else might be different that might explain that? I was using test data different from the original, so that is a likely reason; but was thinking it might be something else.

This isn’t completely true for the C++ sorting routine. Since `std::sort` is a template, the compiler will effectively inline the comparison. Using a functor (a struct with the `operator()`) or lambda expression instead of a global function gives the compiler stronger guarantees which raise the chances for inlining.

The faster operation of `std::sort` in comparison to `qsort` is discussed in several places:

The page from Martin Ueding quite clearly shows a factor two speed-up for arrays with more than 10^5 elements.

Modern C++ and Fortran can work very well together IMO. It’s just a shame one always has to flatten all the interfaces down to C level for compatibility which makes it a lot of tedious work.

@urbanjost - I used the default release options specified by the CMake build, following the instructions on Nick’s GitHub repo. The only change I made to the driver, was to add the following section (and time printout):

``````  call cpu_time(t(8))
do i = 1,n_dat
x8 = all_data(:,i)
call mrgrnk(x8, inds1)
x8 = x8(inds1)
enddo
call cpu_time(t(9))
``````
2 Likes

Thanks, @ivanpribec, for your highlighting the differences between qsort() and std::sort. It would be useful to Fortran programmers to have a recipe for calling std::sort from a Fortran program (including a recipe for specifying the comparison function/functor in C++, if necessary, or in Fortran, if possible).

I read a couple of your references, and I wondered why, when comparing the qsort() of C and the std::sort of C++ they did not note that the two routines/ methods may well use different algorithms. For example, qsort() may use mergesort in one C library, and quicksort in another C library! Furthermore, the scaling law for sort time versus array size may be different for the two implementations, especially if even one of the two uses a hybrid algorithm.

@mecej4, you make several good points. Upon re-reading the Stack Overflow threads, a few less visible comments did remark the differences could be attributed to different algorithms. We would need to look at the C and C++ runtime-library implementations to know for sure. From the pragmatic point of view C++ seems to have a small edge. Both routines are black-boxes from the user perspective.

With regard to a more general recipe, are you interested in sorting also other types than the intrinsic ones? Is an indirect sort sufficient, or would an in-place sort also be desirable? For derived types, the C++ code will be a somewhat more complex because the size of the array elements has to be communicated manually across the language barrier.

One more difference is that the C++ template library makes it very easy to write an indirect sort since the “lambda” comparison function can capture variables from the enclosing scope.

With the standard C `qsort` you would need to pack your keys and indexes into an array of structures, leading to duplication. The fundamental issue is that C doesn’t support nested functions.

That raises the interesting question whether a Fortran internal subprogram can have the `bind(C)` attribute and could hence be used to do an indirect sort using C’s `qsort`?

1 Like

Counter to what I expected, it’s perfectly feasible. Even more surprisingly, host association in internal functions permits using `qsort` for indirect sorting, which is not possible in pure C without the use of global variables.

Here’s a demonstration:

``````module sorting
use, intrinsic :: iso_c_binding
implicit none
private
public :: argsort
contains

!> Returns the indices that would sort an array
!>
!> Arguments:
!>   n: number of elements
!>   list: on entry, list of indices 1..(1)..n
!>           (for reverse sort, n..(-1)..1)
!>         on return, the indices such that keys(list) would be sorted
!>   keys: array to be sorted
!>
subroutine argsort(n,list,keys)
integer, intent(in) :: n
integer, intent(inout), target :: list(*)
integer, intent(in) :: keys(*)

interface
subroutine qsort(ptr,count,size,comp) bind(c,name="qsort")
import c_ptr, c_int, c_size_t, c_funptr
implicit none
type(c_ptr), intent(in), value :: ptr
!> pointer to the array to sort
integer(c_size_t), intent(in), value :: count
!> number of elements in the array
integer(c_size_t), intent(in), value :: size
!> size of each element in the array in bytes
type(c_funptr), intent(in), value :: comp
!> comparison function which returns ​a negative integer value
!> if the first argument is less than the second, a positive
!> integer value if the first argument is greater than the
!> second and zero if the arguments are equivalent
end subroutine
end interface

if (n < 2)
return
end if

call qsort(c_loc(list),int(n,c_size_t),c_sizeof(list(1)), &
c_funloc(argsort_compare))

contains
! technically, the inputs should be void pointers
! but works nevertheless
integer(c_int) function argsort_compare(a,b) bind(c)
integer(c_int), intent(in) :: a, b
argsort_compare = keys(a) - keys(b)
end function
end subroutine
end module

program main

use sorting
implicit none

integer :: list(4)
integer :: keys(4)

list = [1,2,3,4]
keys = [7,5,6,3]

call argsort(4,list,keys)

! Supposed to print [ 3, 5, 6, 7]
print *, keys(list)

end program
``````

Notice the comparison function is encoded in Fortran. So the conclusion in my previous post is curiously enough - wrong. You can perform an indirect sort using `qsort` without global variables from Fortran. Further, we could give `argsort` the `bind(c)` attribute, making it a callable from C again. Using Fortran to bypass restrictions of C would make an interesting addition to the Wikipedia entry on `qsort`.

2 Likes

That’s a very interesting interface - I am not sure if the interfacing plays a role on some overhead, but I’ve compared it against my own implementation of `argsort` and the results are quite interesting: with gfortran, the Fortran version is always at least some ~20-30% faster than the C counterpart, see for example here:

``````module sorting_f
use iso_fortran_env, only: int32,int64
implicit none
private

integer, parameter, public :: IKIND = int32
integer, parameter, public :: ISIZE = int64

public :: argsort_f
contains

!> Argsort: Federico's implementation
subroutine argsort_f(n,list,keys)
integer(ISIZE), intent(in)    :: n
integer(ISIZE), intent(inout) :: list(*) ! indices to the sorted array
integer(IKIND), intent(in)    :: keys(*) ! Array to be sorted

integer :: sorted(n)

sorted = keys(1:n)
call int_quick_sort_andlist(sorted,list(1:n))

end subroutine argsort_f

pure recursive subroutine int_quick_sort_andlist(list,ilist,down)

integer(IKIND), dimension(:), intent(inout) :: list
integer(ISIZE), dimension(size(list)), intent(inout) :: ilist
logical, optional, intent(in) :: down

integer(ISIZE)     :: i, j, n
integer(ISIZE), parameter :: max_simple_sort_size = 8
integer(IKIND)     :: chosen
logical            :: descending

descending = .false.; if (present(down)) descending = down
n = size(list)

choose_sorting_algorithm: if (n <= max_simple_sort_size) then
! Use interchange sort for small lists
do i = 1, size(list) - 1
do j = i + 1, size(list)
if (toBeSwapped(list(i),list(j),.false.)) then
call swap(list(i),list(j))
call swaplist(ilist(i),ilist(j))
end if
end do
end do

else
! Use partition (quick) sort if the list is big
chosen = list(int(n/2))
i = 0
j = n + 1

scan_lists: do
! Scan list from left end
! until element >= chosen is found
scan_from_left: do
i = i + 1
if (toBeSwapped(list(i),chosen,.true.) .or. i==n) exit scan_from_left
end do scan_from_left

! Scan list from right end
! until element <= chosen is found

scan_from_right: do
j = j - 1
if (toBeSwapped(chosen,list(j),.true.) .or. j==1) exit scan_from_right
end do scan_from_right

swap_element: if (i < j) then
! Swap two out of place elements
call swap(list(i),list(j))
call swaplist(ilist(i),ilist(j))
else if (i == j) then
i = i + 1
exit
else
exit
endif swap_element

end do scan_lists

if (1 < j) call int_quick_sort_andlist(list(:j),ilist(:j),down)
if (i < n) call int_quick_sort_andlist(list(i:),ilist(i:),down)

end if choose_sorting_algorithm ! test for small array

contains

elemental logical function toBeSwapped(a,b,orEqual)
integer(IKIND), intent(in) :: a,b
logical, intent(in) :: orEqual

toBeSwapped = merge(a<b,a>b,descending)
if (orEqual .and. a==b) toBeSwapped = .true.

end function toBeSwapped

elemental subroutine swap(a,b)
integer(IKIND), intent(inout) :: a,b
integer(IKIND) :: tmp
tmp = a
a = b
b = tmp
end subroutine swap
elemental subroutine swaplist(a,b)
integer(ISIZE), intent(inout) :: a,b
integer(ISIZE) :: tmp
tmp = a
a = b
b = tmp
end subroutine swaplist

end subroutine int_quick_sort_andlist

end module sorting_f

module sorting
use, intrinsic :: iso_c_binding
implicit none
private
public :: argsort
contains

!> Returns the indices that would sort an array
!>
!> Arguments:
!>   n: number of elements
!>   list: on entry, list of indices 1..(1)..n
!>           (for reverse sort, n..(-1)..1)
!>         on return, the indices such that keys(list) would be sorted
!>   keys: array to be sorted
!>
subroutine argsort(n,list,keys)
integer, intent(in) :: n
integer, intent(inout), target :: list(*)
integer, intent(in) :: keys(*)

interface
subroutine qsort(ptr,count,size,comp) bind(c,name="qsort")
import c_ptr, c_int, c_size_t, c_funptr
implicit none
type(c_ptr), intent(in), value :: ptr
!> pointer to the array to sort
integer(c_size_t), intent(in), value :: count
!> number of elements in the array
integer(c_size_t), intent(in), value :: size
!> size of each element in the array in bytes
type(c_funptr), intent(in), value :: comp
!> comparison function which returns ​a negative integer value
!> if the first argument is less than the second, a positive
!> integer value if the first argument is greater than the
!> second and zero if the arguments are equivalent
end subroutine
end interface

if (n < 2) then
return
end if

call qsort(c_loc(list),int(n,c_size_t),c_sizeof(list(1)), &
c_funloc(argsort_compare))

contains
! technically, the inputs should be void pointers
! but works nevertheless
integer(c_int) function argsort_compare(a,b) bind(c)
integer(c_int), intent(in) :: a, b
argsort_compare = keys(a) - keys(b)
end function
end subroutine

end module sorting

program main
use iso_c_binding, only: c_int
use sorting
use sorting_f
implicit none

integer(ISIZE), parameter :: N = 100000

real, allocatable :: x(:)
integer(ISIZE), allocatable :: listf(:)
integer(IKIND), allocatable :: keys(:),list(:)
integer(ISIZE) :: j
real :: t0,t1,t2

allocate(list(N),keys(N),x(N))
call random_number(x)
keys(:) = nint(N*x)
list(:) = [(j,j=1,N)]
listF   = list

call cpu_time(t0); call argsort(int(N,c_int),list,keys)
print *, keys(1:5),list(1:5)
call cpu_time(t1); call argsort_f(N,listf,keys)
print *, keys(1:5),list(1:5)
call cpu_time(t2)

! Supposed to print [ 3, 5, 6, 7]
print *, 'C time=',t1-t0
print *, 'F time=',t2-t1

end program

``````