Fastest sorting for a specific genre of unordered numbers

My code is bottlenecked by sorting. I’m sorting a length ~256 1-D array of real(8) which is partially sorted. The sorting algorithm must return the indexes used to sort the array. Sorting an array of this general size, and partially sorted-ness occurs thousands of times in my code, taking up ~50% of the computation time.

I have made a Github repo which contains a benchmark sort_benchmark.f90, which benchmarks two sorting algorithms on the kind of sequence of numbers that I need to sort.

Current results:

algorithm Time for 1 sort (s)
futils argsort 1.0787695E-05
stdlib sort_index 1.3692000E-06

I’m blown away at how fast Fortran stdlib’s sort_index is, compared to the sorting algorithm in my library of useful utilities (futils)

Wondering if people know if a sorting algorithm that might be better suited to my specific problem?

Thanks for suggestions!

Before one can think of ways to exploit the existence of partial ordering, you need to give a clear description of what sort of partial ordering may be assumed. It is probably not correct to look at the one data file that you show and start searching for algorithms that are suitable.

Do the thousands of clumps of data that you have to sort come in the form of formatted real numbers?

Are the numbers always in proper sort order in the exponent field? If so, bin the data by exponent and sort bins separately.

There are runs (sequences of identical values) in the data. In other places, a number that was previously seen and observed to be in order w.r.t. its neighbors appears again, obviously now out of place. Do you expect the relative orders of equal values to be preserved in the computed sort index (i.e., do you expect stable sorting)?

The algorithm used in Futils is selection sort, a simple algorithm that is short to code and sufficient for sorting short arrays, say with about a hundred entries or fewer. It is not surprising that it is slower than the better sorting algorithms that are used in combinations (such as “introspection”) in Fortran-lang/Stdlib.

1 Like

I’m by no means any sorting expert, but I’ve found the comparison of different algorithms on Wikipedia to be quite comprehensive.

1 Like

Do the thousands of clumps of data that you have to sort come in the form of formatted real numbers?

Yes

Are the numbers always in proper sort order in the exponent field? If so, bin the data by exponent and sort bins separately.

No. The numbers to be sorted always look like this. The below graph “looks” sorted but it isn’t.

Within each flat interface, the numbers are not sorted. Here is a zoom in on the y axis:

There are runs (sequences of identical values) in the data. In other places, a number that was previously seen and observed to be in order w.r.t. its neighbors appears again, obviously now out of place. Do you expect the relative orders of equal values to be preserved in the computed sort index (i.e., do you expect stable sorting)?

I don’t think so but I don’t completely understand

I think that you misunderstood the question, and your graphs and the descriptions related to them answer questions other than the one that I asked. The exponent field begins at column 23 in each line of your data file. For the first line, the exponent field contains “-006”. The following program reads the data file and checks if the lines are in increasing order in this field.

program checkorder
!Given a data file with lines such as
!   2.3551561120185991E-006
!Check if the file is sorted by the exponent field
implicit none
integer iexp,iexp0, i
!
open(11,file='sort_data.txt',status='old')
read(11,'(22X,i4)')iexp0
i = 1
do
   read(11,'(22X,i4)',end=100)iexp
   i = i+1
   if(iexp < iexp0)print '(A,2x,i,2x,A)',' Line',i,'is out of order'
   iexp0 = iexp
end do
100 print *,i,' lines scanned'
close(11)
end program

It would be useful if you can run this program on a number of your data files and report if any of them are out of order in the exponent field, i.e., columns 23-26. (Please note, you will have to arrange for the program to read a file that you designate, rather than the sole example data file that you provided).

I did misunderstand you initially.

The data are not always sorted in the exponent field. I just ran your program over several thousand of data files, and found quite a few exceptions where the exponent field decreased, with increasing index.

@mecej4 I change the Github benchmark. Now it loads in ~11,000 different arrays that I need to sort from a data file, and benchmarks routines by sorting all these different cases. I also added a routine by @pmk sortedIndex. The stdlib routine still wins.

The stdlib sorting module was an admirable effort from @wclodius. Have a look at the original issue to see all the benchmarking he did: Implement sorting algorithms · Issue #98 · fortran-lang/stdlib · GitHub

I’d also bring your attention to my post in which I listed a bunch of other sorting routines.

An easy way to sort a contiguous Fortran array is to build upon the C++ standard library functions std::sort and std::stable_sort. You do however need a compiler that supports the std::span container introduced in C++20. The code demonstrated below was tested succesfully with gcc 11.1.0

Here are the C++ routines:

#include <algorithm>
#include <numeric>
#include <span>

extern "C" {

// Finds the indices that would sort an array
//
// Performs an indirect sort of the rank-1 array `a` of size `n`. 
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_argsort(int n, const double *a, int *index) {

  std::span inds(index, static_cast<size_t>(n));
  std::iota(inds.begin(),inds.end(),1);

  auto compare = [&a](const int& il, const int& ir) {
    return a[il-1] < a[ir-1];
  };

  std::sort(inds.begin(),inds.end(),compare);

}

// Finds the indices that would sort an array while preserving order
// between equal elements.
//
// Performs an indirect sort of the rank-1 array `a` of size `n`. 
// Indices are returned in the array `index`.
// Fortran or 1-based numbering is assumed for the `index` array.
//
void cpp_stable_argsort(int n, const double *a, int *index) {

  std::span inds(index, static_cast<size_t>(n));
  std::iota(inds.begin(),inds.end(),1);

  auto compare = [&a](const int& il, const int& ir) {
    return a[il-1] < a[ir-1];
  };

  std::stable_sort(inds.begin(),inds.end(),compare);
}


} // extern "C"

First, extern "C" keyword is used to generate external linkage and prevent compiler name mangling. The dummy variables are all C/Fortran compatible. The span container is used to “wrap” the raw contiguous index buffer with the whole C++ iterator machinery. Next we use std::iota to fill the index array with sequentially increasing values starting from 1 due to Fortran indexing. A lambda expression is used to perform the comparison, by referencing into the original array of doubles a with a capture y reference. Finally, we pass the iterators and lambda compare function to the actual sorting routine.

Here’s a Fortran driver that calls the procedure:

program test_cpp_argsort

   use iso_c_binding, only: c_double, c_int
   implicit none

   interface
      subroutine cpp_argsort(n,array,index) bind(C,name="cpp_argsort")
         import c_double, c_int
         integer(c_int), intent(in), value :: n
         real(c_double), intent(in) :: array(*)
         integer(c_int), intent(inout) :: index(*)
      end subroutine
      subroutine cpp_stable_argsort(n,array,index) bind(C, &
           name="cpp_stable_argsort")
         import c_double, c_int
         integer(c_int), intent(in), value :: n
         real(c_double), intent(in) :: array(*)
         integer(c_int), intent(inout) :: index(*)
      end subroutine
   end interface

   integer, parameter :: n = 5
   real(c_double) :: a(n)
   integer(c_int) :: inds(n)

   a = [real(c_double) :: 1., 5., 4., 2., 3.]

   write(*,'(A)') "Before:"
   write(*,*) a

   call cpp_argsort(n,a,inds)

   write(*,'(A)') "After:"
   write(*,*) a(inds)

end program

Finally, the result:

$ g++ -c -std=c++20 -O2 -march=native cpp_argsort.cc 
$ gfortran -Wall -O2 -march=native test_cpp_argsort.f90 cpp_argsort.o -o test -lstdc++
$ ./test
Before:
   1.0000000000000000        5.0000000000000000        4.0000000000000000        2.0000000000000000        3.0000000000000000     
After:
   1.0000000000000000        2.0000000000000000        3.0000000000000000        4.0000000000000000        5.0000000000000000    

At least for the built-in types which are interoperable, using std::span is an easy way to unleash the full potential of the C++ standard library algorithms.

3 Likes

I added the C++ std routines to my benchmark. They are right on par with the fortran-stdlib routines.

2 Likes

Are you using any optimization flags?

Here’s what I got on an Intel® Core™ i7-11700K with Ubuntu and gcc 11.1.0 (note I added -O3 -march=native using the FFLAGS and CXXFLAGS environment variables)

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.7011091E-05 |
| stdlib sort_index             |       2.1951818E-06 |
| sorting_module quicksort      |       9.6158182E-06 |
| sorter sortedIndex            |       2.0240000E-06 |
| C++ std::sort                 |       1.8006364E-06 |
| C++ std::stable_sort          |       1.5749091E-06 |
| quicksort_own_2d_swapped      |       3.8420000E-06 |

C++ is consistently on top by a small margin.

Edit: earlier I didn’t apply the Release mode. With Release mode, the C++ sort and Fortran stdlib sort are practically equal:

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6566909E-05 |
| stdlib sort_index             |       1.5939091E-06 |
| sorting_module quicksort      |       8.8872727E-06 |
| sorter sortedIndex            |       1.9198182E-06 |
| C++ std::sort                 |       1.5998182E-06 |
| C++ std::stable_sort          |       1.3950000E-06 |
| quicksort_own_2d_swapped      |       3.7874545E-06 |

2 Likes

For Fortran i am using GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0, and for C++ I am using Apple clang version 13.0.0 (clang-1300.0.29.30). I have a Apple M1 Macbook Air.

I just added -march=native, which gives a tiny speed boost. But ya std::sort and stdlib are about the same.

I use my Fortran indexed quick sort, a non-recursive sort by keeping a pair-list of the un-sorted sub-lists.
For the pivot value I choose the middle value, which works well for partially sorted lists. (other examples might choose the first value which is ok for random values but might be poor for partially ordered lists)
I use an order test that retains the order of repeated values, which more applies to integer sorts but is less important for real value sorts (assuming fewer duplicates).
I also use a bubble sort when the the sub-list is less than 8 to 16, although “8” can depend on the data type and processor. ( about 10% to 20% improvement )

I have used this sort for many years on lists of real(8) values ( x,y or z or ax+by+cz values for 100k points in a finite element mesh). Unlike the repeated claims of instability, I have found this sort to be robust, not poor for ordered lists.

1 Like

The options you used in your CMake file gave the best overall results for me; tried a variety or sort libraries. The only one that consistently beat the stdlib library was the mrgrnk procedure from orderpack, but only by a few percent; depending on the compiler and options sometimes the sort_quick_rx procedure from M_sort was faster; but the fastest times were orderpack mrgrnk and stdlib sort_index. Since sorting speed can be quite sensitive to partial ordering and array size the orders might change if the datafile is not representative of your general input, but it looks like stdlib with the optimization flags you are using is a good choice based on what I found. I increased the iterations a bit to eliminate some of the noise, but looking at the fastest dozen runs stdlib looks good; the orderpack and sort_quick_rx routines were sometimes faster, mostly with other compilers. It is interesting how much variation I saw in which was fastest depending on which compiler was used, but it was still always those three.

| orderpack_mrgrnk              |       3.3981818E-06 |
| orderpack_mrgrnk              |       3.6169091E-06 |
| stdlib_sort_index             |       3.8764545E-06 |
| stdlib_sort_index             |       3.9332727E-06 |
| orderpack_mrgrnk              |       4.2291818E-06 |
| qsort_inline                  |       4.2962727E-06 |
| M_sort_sort_quick_rx          |       4.4433636E-06 |
| M_sort_sort_quick_rx          |       4.4838182E-06 |
| qsort_inline                  |       4.4886364E-06 |
| M_sort_sort_quick_rx          |       4.6305455E-06 |
| qsort_inline                  |       4.7409091E-06 |
| stdlib_sort_index             |       5.3317273E-06 |

PS: the qsort_inline prodedure from the Fortran Wiki, customized to your data type shows up too; it is a very similiar algorithm to the M_sort sort_quick_rx routine, and the M_sort routines are already generic so easier to use.

4 Likes

I can confirm, mrgrnk from ORDERPACK is quite fast for this problem:

$ ./sort_benchmark 
| algorithm                     | Time for 1 sort (s) |
| ----------------------------- | ------------------- |
| futils argsort                |       2.6560636E-05 |
| stdlib sort_index             |       1.5619091E-06 |
| sorting_module quicksort      |       9.1251818E-06 |
| sorter sortedIndex            |       2.1222727E-06 |
| C++ std::sort                 |       1.5285455E-06 |
| C++ std::stable_sort          |       1.4374545E-06 |
| quicksort_own_2d_swapped      |       3.7542727E-06 |
| mrgrnk                        |       1.0954545E-06 |

It’s good to know that specialized Fortran routines can still outperform the C++ standard library. With mrgrnk we’re looking at a speed-up of ×20 (for the sorting part) already compared to the original futils subroutine.

@nicholaswogan, does this mean the other part of your computation (binning?) becomes the bottleneck?

1 Like

Check if you activated (by removing the comment mark) the lines in the source file mrgrnk.f90 that are preceded by the comment “This line may be activated when the initial set is already close to sorted.”

Doing this may give you an additional speed increase.

The C/C++ standard library sorting routines are usually handicapped by having to call a user-provided routine for comparing pairs of items from the array being sorted. The Fortran sorting routines, by doing these comparisons inline, may turn out to be slightly faster. But then, if we were to write the C sorting routines with inline comparisons (instead of making function calls for comparisons), those would become faster, as well.

1 Like

@urbanjost, Thanks so much for the benchmarks. I added mrgrnk to the Github benchmark. I also find that it barely outperforms stdlib.

@mecej4, uncommenting that line in mrgrnk gives give a little extra performance! thanks.

@ivanpribec, I am still bottlenecked by sorting. Here are some profiling results from XCode Time Profiler, where I have implemented each sorting algorithm into my main code.

futils argsort

stdlib sort_index

mrgrnk (with line uncommented)

The time I care about is the time of clima_climate_MOD_radiative_transfer. So mrgrnk does the best overall. Both sort_index and mrgrnk are massive improvements over argsort from futils.

Seems like mrgrnk is about as fast as it is going to get for sorting. Since this is still ~46% of computation time, there isn’t a ton to gain from optimizing the rebin subroutine (23% computation time). It would be great to optimize rebin some, but I’ve fiddled with it, and can’t imagine a way to make it a lot faster.

I consider this a big success! Thanks so much for all your help so far! My code is over 4x faster! Any further recommendations are very welcome.

The project GitHub - cphyc/Fortran-parallel-sort: A fortran library to perform parallel sorts. uses mrgrnk with OpenMP. Compiling it on WSL2 with GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 I get

(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort$ gfortran -march=native -O3 -fopenmp m_mrgrnk.f90 mod_sort.f90 test.f90
(base) /mnt/c/fortran/public_domain/github/Fortran-parallel-sort$ ./a.out
 mrgrnk took   101.30691999999954      ms/iter
 parallel sort took   43.551079999999729      ms/iter
 speed up factor    2.3261632088113582     
 using            4  threads
 efficiency is    58.154080220283952      percent
 Note this assumes the serial version is unvectorized.
 However it probably is auto-vectorized which contributes to the efficiency appearing low.
           0  of 8 sorting algorithms have errors.
2 Likes

Was just thinking about cloning ORDERPACK2.0 on github as an fpm(1) package and was looking to see if anyone had and as a side-effect just ran across the parallel version; and was just thinking about trying it or basing a coarray version off it and came back to mention it and you already have it running! Nice.

@ivanpribec – you got a much nicer bump than I did with ORDERPACK; might be hardware, etc… but wondering what compiler options you used or what else might be different that might explain that? I was using test data different from the original, so that is a likely reason; but was thinking it might be something else.

This isn’t completely true for the C++ sorting routine. Since std::sort is a template, the compiler will effectively inline the comparison. Using a functor (a struct with the operator()) or lambda expression instead of a global function gives the compiler stronger guarantees which raise the chances for inlining.

The faster operation of std::sort in comparison to qsort is discussed in several places:

The page from Martin Ueding quite clearly shows a factor two speed-up for arrays with more than 10^5 elements.

Modern C++ and Fortran can work very well together IMO. It’s just a shame one always has to flatten all the interfaces down to C level for compatibility which makes it a lot of tedious work.


@urbanjost - I used the default release options specified by the CMake build, following the instructions on Nick’s GitHub repo. The only change I made to the driver, was to add the following section (and time printout):

  call cpu_time(t(8))
  do i = 1,n_dat
    x8 = all_data(:,i)
    call mrgrnk(x8, inds1)
    x8 = x8(inds1)
  enddo
  call cpu_time(t(9))
2 Likes