79 Languages speed competition: Can we make Fortran win?

CRquantum · March 24, 2022, 7:40am

Dear all,

There is language speed competition! I just find an interesting video on Youtube,

The guy called Dave Plummer seems legit.

The language speed challenge is simple,
given 5 seconds, from 1 to 10^6, find as many as prime number per unit time as possible (correct me if I was wrong). Anyway, the higher the number the faster the language is.
Initially perhaps there 45 Language in the completion, now it seems there are at least 79.

Correct me if I was wrong. But it seems Fortran’s performance isn’t great, among 79 languages, only at 21th position,

Two columns are the most important, passes, and passes/s/t (pass per second per thread). The number there the higher the better.

I just have one simple question: Can we make Fortran the champion? Can we? Shall we?

PS.
All the information is in that his github,

The result report can be found here,
https://plummerssoftwarellc.github.io/PrimeView/?rc=30&sc=rc&sd=True
One report for example is here,
https://plummerssoftwarellc.github.io/PrimeView/report?id=rbergen-1648101155.json&hi=False&hf=False&hp=True&fi=&fp=&fa=&ff=&fb=&tp=True&sc=pp&sd=True

There are currently two Fortran solutions there,

I believe Fortran should perform better than it is now!

Pap · March 24, 2022, 8:38am

Assembly at position #24? Seriously? And… Lisp at #1? And there is more… Java, the worst language ever, at higher position that Fortran? I could go on, but I think enough is enough… Unless I am reading this table wrong, this is totally nonsense.

msz59 · March 24, 2022, 11:43am

Does not make much sense to me, either. Lisp/C++/D performing 4-5 orders of magnitude (!) faster than the others. But I gave a try the C++ and Fortran solutions and the numbers are similar to those in the table above. So either there is such a difference in the algorithms used or I do not understand what is happening.

gnikit · March 24, 2022, 1:42pm

Having had a very quick look at the C++ (sol 3) and Fortran (sol1 & 2) the solutions are not quite equivalent. Although they implement the same base algorithm C++ has a lot more going on under the hood.

C++ uses unsigned ints, bitwise operators, and most importantly compile time optimisations like precomputed square roots (and it is thread, although the default number of threads is 1). All that in addition to using a fixed size buffer.

On the other hand the Fortran (prime-8bit) implementation uses dynamic storage at every iteration of the sieve and the normal sqrt intrinsic function. Also, I think the use of OOP in this case might be adding overhead and thus hindering performance.

We could probably boost our score if we used the preprocessor and removed some of the OOP.

PS My guess is that the majority of the time is spent in clear_bits followed by calls to allocate and get_bit

Here are the gprof results using ifort -Ofast -pg (gofrtran results were comparable)

certik · March 24, 2022, 6:57pm

Finding primes is not a bad benchmark, but the fast algorithms seem to use bitfields. I wonder if one could use logical arrays and let the Fortran compiler implement it as a bitfield.

For Fortran it’s better to do some more numerical benchmark, like solving some equation.

msz59 · March 24, 2022, 7:01pm

I have modified the type PrimeSieve definition to contain
integer(kind=int8), dimension(500000) :: raw_bits and commented out deallocate in the finalizer but it did not help.
I have also tried to change int64 to int32, also no significant change.

@gnikit’s profiling effort give a hint regarding clear_bits procedure being the major cpu-eater. I checked how many single bytes get zeroed in a benchmark run. The result is astonishing. My home machine executed 4577 passes in 5 seconds, calling clear_bits 768936 times and zeroing total of 3712226197 (3.7G) bytes. There seems to be plenty to improve but how? It seems that the step dummy is never 2, so the loop step is always different than 1. If it were 1, the compiler would arrange a call to memset.

msz59 · March 24, 2022, 7:05pm

Actually there is a version using logical array on GitHub but it performs even worse

certik · March 24, 2022, 7:30pm

I assume Fortran compilers don’t optimize it well, but they could.

Is the idea that bitfield requires 8x less memory than a byte array, and thus even though you have to do bit operations and more complicated indexing to use it, it still ends up faster because you don’t have to copy as much data from memory to registers?

gnikit · March 24, 2022, 7:32pm

If I recall correctly, the existing implementation on GitHub uses normal logicals i.e. 32bit which would explain the performance difference. I suspect that if we were able to use 1bit arrays instead of 8bit we would get similar performance to C++.

Another potential bottleneck is the cache performance. The raw_bits array realistically does not fit in any of the CPU caches so accessing it with non consecutive indices would theoretically result in poorer performance. In the context of Eratosthenes’ sieve I don’t see a straightforward way to avoid that.

gnikit · March 24, 2022, 7:41pm

That is my understanding. However from personal experience having used bitfields in Fortran for wavelet compression the performance gains were not impressive (the savings in memory obviously were, but that is a different matter).

certik · March 24, 2022, 7:46pm

I also couldn’t get bitfields to perform faster than byte (8bit) arrays (in C or Fortran).

Regarding fast prime generation, I think this might be one of the fastest codes out there:

GitHub - kimwalisch/primesieve: 🚀 Fast prime number generator (see also the algorithm description)
And a C implementation: GitHub - TotallyNotChase/cFastSieve: 🚀 Fastest (probably) Sieve of Eratosthenes implementation, in C

msz59 · March 24, 2022, 9:37pm

Calling repeatedly primesieve_count_primes(0,1000000) for 5 seconds gives 35 times less passes done than the C++ winner in the table (flo80-pol-constexpr), the latter requiring C++20 features.

Jweber · March 27, 2022, 11:39am

I am actually the author of the code presented in the video. In the github repository there is actually another, object-oriented solution by another author I would have preferred to be chosen.

Jweber · March 27, 2022, 11:42am

The reason for using a bitfield was to reproduce the algorithm presented in the first video of the comparison series as accurately as possible.

Jweber · March 27, 2022, 11:44am

The reason why
in these solutions the results were computed at compile time so there was not much to do at run time…

ivanpribec · March 27, 2022, 12:22pm

It turns out you can do this in Fortran too! See the recent post by @mecej4.

Jweber · March 27, 2022, 8:58pm

Here my results for the 4 diffent Fortran codes of the comparison project

The original, non tweaked algorithm has been presented (for Python, C# and C++) in What's the FASTEST Computer Language? C++ vs Fortran vs Cobol: E04 - YouTube

The rules are explained in Primes/CONTRIBUTING.md at drag-race · PlummersSoftwareLLC/Primes · GitHub

My non-faithful (it does not use classes) approach has been presented in the Fortran vs. Cobol comparison video, although there exists also faithful implementation by Thomas Jollans (github user tjol).
I do not know, I guess because it has a rather “classic” Fortran look.

The considered compilers are:
gfortran-7
gfortran-8
gfortran-9
gfortran-10
gfortran-11
ifort 21.04
ifx 21.04 beta
aocc 3.1.0 based on flang 12.0

The test have been performed on an AMD Ryzen 7 3700X CPU (base frequency 2200 Mhz,
max frequency 3600 MHz)

The numbers are the number of times the sieve algorithm is completed
for the range up to 1000000 with 5 seconds

Notable results are:

When a bitfiled is used, there are strong variations between different compilers
(and different versions of gfortran). The variations increase for the object-oriented version
The fastest code is generated if one 1-bit integer is used to store the state (prime/not prime)
of a number (gain ~50% over the best bitfield results). There the results are more consistent than for the bitfield
Storing the states of the numbers in an array of logical variables is slower than storing it in an 8bit integer and similar to the performance of a good bitfield implementation.
Disclaimer: I do not believe that I really chose the optimal command line options

solution_1 One bit per digit (stored in int64 variables), not object oriented, shown in video

gfortran-7 -march=native -O3 -o benchmark-gcc-7 PrimesFortran.f08
8905.2

gfortran-8 -march=native -O3 -o benchmark-gcc-8 PrimesFortran.f08
11210

gfortran-9 -march=native -O3 -o benchmark-gcc-9 PrimesFortran.f08
11211

gfortran-10 -march=native -O3 -o benchmark-gcc-10 PrimesFortran.f08
11323

gfortran-11 -march=native -O3 -o benchmark-gcc-11 PrimesFortran.f08
12417

Note: ifort -v results in ifort version 2021.4.0 ifort dors not recognize the .f08 suffix changed to .f90
ifort -Ofast -march=core-avx2 -o benchmark-ifort-21.04 PrimesFortran.f90
6097

ifx -Ofast -march=core-avx2 -o benchmark-ifx-21.04-beta PrimesFortran.f9
7283

flang -march=native -O3 -o benchmark-flang-12.0 PrimesFortran.f90
6659

Object-oriented with bit array, fully meets the ‘faithful’ criteria of the comparison

gfortran-7 -march=native -O3 -oo-bitarray-gcc-7 prime-bitarray.f90
12552

gfortran-8 -march=native -O3 -oo-bitarray-gcc-8 prime-bitarray.f90
8199

gfortran-9 -march=native -O3 -oo-bitarray-gcc-9 prime-bitarray.f90
8138

gfortran-10 -march=native -O3 -oo-bitarray-gcc-10 prime-bitarray.f90
7613

gfortran-11 -march=native -O3 -oo-bitarray-gcc-11 prime-bitarray.f90
12619

ifort -march=core-avx2 -O3 -oo-bitarray-ifort-12.04 prime-bitarray.f90
2591.0

ifx -march=core-avx2 -O3 -oo-bitarray-ifx-12.04 prime-bitarray.f90
2547

flang -march=native -O3 -oo-bitarray-flang-12.0 prime-bitarray.f90
1412

Object-oriented, one 8-bit Integer per number

gfortran-7 -march=native -O3 -oo-8bit-gcc-7 prime-8bit.f90
18155

gfortran-8 -march=native -O3 -oo-8bit-gcc-8 prime-8bit.f90
16597

gfortran-9 -march=native -O3 -oo-8bit-gcc-9 prime-8bit.f90
17870

gfortran-10 -march=native -O3 -oo-8bit-gcc-10 prime-8bit.f90
18029

gfortran-11 -march=native -O3 -oo-8bit-gcc-11 prime-8bit.f90
18247

ifort -march=core-avx2 -O3 -oo-8bit-ifort-12.4 prime-8bit.f90
18086

ifx -march=core-avx2 -O3 -oo-8bit-ifort-12.4 prime-8bit.f90
18361

flang -march=native -O3 -oo-8bit-flang-12.0 prime-8bit.f90
9651
(Note: The values varied strongly between approx- 4000 and approx. 11000)

Object-oriented, one Logical per number

gfortran-7 -march=native -O3 -oo-logical-gcc-7 prime-logical-array.f90
12833

gfortran-8 -march=native -O3 -oo-logical-gcc-8 prime-logical-array.f90
12764

gfortran-9 -march=native -O3 -oo-logical-gcc-9 prime-logical-array.f90
12441

gfortran-10 -march=native -O3 -oo-logical-gcc-10 prime-logical-array.f90
12453

gfortran-11 -march=native -O3 -oo-logical-gcc-11 prime-logical-array.f90
12635.8

ifort -march=core-avx2 -O3 -oo-logical-ifort-12.4 prime-logical-array.f90
13174

ifx -march=core-avx2 -O3 -oo-logical-ifort-12.4 prime-logical-array.f90
13385

flang -march=native -O3 -oo-logical-flang-12.0 prime-logical-array.f90
5573
(Note: strong fluctuations of the results between ~3800 and ~8300)

msz59 · March 27, 2022, 10:04pm

I guess it is a typo and should be 8-bit integer.

Still, as you noticed a few post above, the best C++ solutions does most of the calculations during the compilation (constant expressions), so it is sort of cheating. If this is also the case for the other 2 or 3 leaders (Lisp, D, maybe Rust), that would explain the many-orders-of-magnitude difference between those and the rest of the competitors. It would require to sum compilation and execution times to make a fair comparison.

FortranFan · March 28, 2022, 1:59am

Yes almost all the leading contenders in this “drag race” from Zig to C++ achieve their performance via compile-time computing.

See this thread.

CONSTEXPR as much as possible has been a mantra with C++ and its derivatives such as D, Rust, Zig, etc. Folks such as Stroutsrup and De Rios et al. have presented a fair bit on the benefits of this, especially in systems programming. The results here are but one where their efforts show what is viable with the approach to compute once (during compilation) but to consume voraciously and efficiently i.e., use the values widely with great speed during runtime execution.

Jweber · March 28, 2022, 3:16am

That was indeed a typo. I have corrected it.

Topic		Replies	Views
Computing at compile time	49	2838	March 30, 2022
Fortran only half as fast as C++ finding primes Humor	5	993	September 27, 2022
Improving Fortran Results in the Julia Micro-benchmarks Help	44	3996	June 23, 2022
Julia: Fast as Fortran, Beautiful as Python	184	11793	November 13, 2022
Comparing Fortran and Julia's Bessel function performance	69	4864	October 23, 2022

79 Languages speed competition: Can we make Fortran win?

solution_1 One bit per digit (stored in int64 variables), not object oriented, shown in video

Object-oriented with bit array, fully meets the ‘faithful’ criteria of the comparison

Object-oriented, one 8-bit Integer per number

Object-oriented, one Logical per number

Related topics