1 billion row challenge in Fortran : mmap + OpenMP + branchless scan

sshestov · January 23, 2024, 8:30am

Hi all,

Recently a challenge (The One Billion Row Challenge - Gunnar Morling) has been announced to read and process a simple csv file. The trick is the file has 1e+9 rows thus its size is ~13Gb. Originally the challenge is for Java programmers, but many people have proposed solutions in their favorite languages (see “Show & tell” link there). BTW, best time in Java is currently ~2.5 s, while best C result is ~0.3 s, they rely on bit-magic, SIMD and many similar things.

The problem itself is quite artificial (if we need speed, we just use a different format of file), but nevertheless…

Since in Fortran we have OpenMP almost out-of-the-box, I decided to give it a try. I’ve to spent some time … currently I mmap the file (use c-function), split it into chunks to loop through via OpenMP. In each chunk I find a csv-separator (‘;’ char in this case) using Findloc, and scan the following number without if.

I’ve tried many different things to speed up, but nothing helps anymore. Currently I’m at nearly ~3 s with 16 threads. Significant amount of time is spent at FINDLOC. C-wizards use SIMD instructions to parse strings, I have not fount if it is possible to do in Fortran (making a c-call is possible, but would be unjust IMHO).

Do you have any ideas how to improve further?

!$OMP PARALLEL DO  stuff
do chunk_num=1,Nchunks
  cpos = ch_start(chunk_num)
  do 
    pos_s=findloc(i11(cpos:),scode,DIM=1)-1   
    d2p = merge(1,0,i11(cpos+pos_s+2) == dcode)  ;  d3p = merge(1,0,i11(cpos+pos_s+3) == dcode)  ;  d4p = merge(1,0,i11(cpos+pos_s+4) == dcode)  ;  m1p = merge(1,0,i11(cpos+pos_s+1) == mcode)
    pos_nl = d2p*(pos_s+4) + d3p*(pos_s+5) + d4p*(pos_s+6) + 1    
    tempi = d2p*(                                           (i11(cpos+pos_s+1)-zcode)*10 + (i11(cpos+pos_s+3)-zcode)  ) + &   ! dot is the 2nd symbol => temp has a form of 1.2
            d3p*( (1.-m1p)*((i11(cpos+pos_s+1)-zcode)*100 + (i11(cpos+pos_s+2)-zcode)*10 + (i11(cpos+pos_s+4)-zcode))   + &   ! dot is the 3rd symbol => temp has a form of 12.4 or -4.5 
                      m1p *(                              - (i11(cpos+pos_s+2)-zcode)*10 - (i11(cpos+pos_s+4)-zcode)) ) + &
            d4p*(          -(i11(cpos+pos_s+2)-zcode)*100 - (i11(cpos+pos_s+3)-zcode)*10 - (i11(cpos+pos_s+5)-zcode)  )       
    
    id = get_hash_i( i11(cpos:cpos+pos_s-1) )
    if (stations_cnt(id) == 0) then  
      do i=cpos,cpos+pos_s-1
        stations(id)(1+i-cpos:1+i-cpos) = x11(i)
      end do
    end if
    temp_min_i(id) = min(temp_min_i(id),tempi)
    temp_max_i(id) = max(temp_max_i(id),tempi)
    temp_cum_i(id) = temp_cum_i(id)+(tempi)
    stations_cnt(id) = stations_cnt(id)+1
    cpos = cpos+pos_nl
    if (cpos >= ch_end(chunk_num)) exit
  end do
end do
!$OMP END PARALLEL DO

My code is here: GitHub - sshestov/1brc-fortran: 1 billion row challnge in Fortran

ivanpribec · January 23, 2024, 8:48am

I’ve been thinking about a pipelined implementation using OpenMP tasks or sections:

! r | r r r .. r r r |            read & hash
!   | s s s .. s s s | s          statistics
!
! r r | r r .. r r |              read
!   h | h h .. h h | h            hash
!     | s s .. s s | s s          statistics

Granted, this doesn’t help beyond 2 or 3 threads, and perhaps there is insufficient work in the sections to even exploit this type of parallelism.

I think you’d probably break the single-file rule, if you had to write this C function.

PierU · January 23, 2024, 9:13am

The performance when processing a file a highly dependent on the physical support where the file is. 0.3s in C, but on which kind of disk? If it is on a Nvme SSD, you have absolutely no chance to approach it with a HDD or even a SATA SSD (Nvme SSDs have genuine concurrent accesses to files, while it’s a single queue on a SATA disk).

EDIT: by the way, 0.3s to process a 13GB file means at least a rate of 43GB/s, which is way above the performances any Nvme SSD that I know… So I assume that to get such a timing, the file is fully in RAM cache before the run, and that you have to do the same.

sshestov · January 23, 2024, 9:49am

Given actually the simplicity of the problem and my current performance, I assume the bottleneck is FINDLOC in my case. If I comment out all the rest except FINDLOC and searching through the file, I get like 1.6-1.8 seconds. Of course it would be nice to reduce the timing to them, but it will not allow to go below.

So I was thinking to speed up the FINDLOC. I have checked converting the loop to a fixed number of iterations, and changing FINDLOC into fixed-size sub-array, and merging a part of the array with an 0…63 offset-array and finding a minimum. Nothing helped.

As for the c-call, I’m already using it for mmap, from within the same program. But I’m not going to do official challenge

themos · January 23, 2024, 9:50am

Nobody in their right mind would choose to keep a billion rows in human-readable format (formatted, in Fortran-speak) and make it somebody else’s problem to make a machine process that fast.

Oh, God, give us courage to change what must be altered, serenity to accept what can not be helped, and insight to know the one from the other.

sshestov · January 23, 2024, 9:51am

Of course it would depend; but I suspect most of us have reasonable machines these day. And after the first run the file will be cached in memory (this is stated in the rules already).

BTW, a cold start from HDD takes minutes, while subsequent runs take the same ~3 s. A cold start from NMVE is indistinguishable from a hot start.

PierU · January 23, 2024, 9:53am

If we don’t know what you have tried, we have chances to propose things you have already tried … Random thoughts:

writing your own findloc code
are computations on integer(kind=1) as fast as computations on integer(kind=4)? Can it be a reason why SIMD is not possible (aligment issues)?
I would use if/then/elsif blocks instead of the branchless approach (the latter requires much more computations on average)
OpenMP schedule(dynamic) is very often less efficient than the default static schedule.
…

sshestov · January 23, 2024, 9:57am

writing your own findloc code

I’ve tried and did not succeed. Any further ideas?

are computations on integer(kind=1) as fast as computations on integer(kind=4)? Can it be a reason why SIMD is not possible (aligment issues)?

After mmap I get a contiguous (I suspect) array to which I make a pointer either with a 1-byte integer array or a char array. Doing search and hash on 1-byte integer is a way faster than chars. However, to convert them to 4-byte integers will have some overhead; while I do not see any significant overhead from the hash function.

I would use if/then/elsif blocks instead the branchless approach (the latter require much more computations on average)

I’ve started with that. It was wa-a-a-a-ay slower.

OpenMP schedule(dynamic) is very often less efficient than the default static schedule.

No effect

sshestov · January 23, 2024, 12:52pm

But thank you for suggestions!

ivanpribec · January 24, 2024, 12:17am

It’s possible to use !$omp simd for scanning, but what you get depends on the flags used, how exactly you program things, the quality of implementation of your compiler:

    ! Find semicolon location
    integer function findsc(str)
        character(len=*) :: str  

        integer(1), parameter :: sc = iachar(';',1)
        integer(1) :: r(16)
        logical(1) :: z
        integer :: i, k 

        if (mod(len(str),16) /= 0) then
            error stop "Length should be multiple of 16"
        end if

        do i = 1, len(str), 16
            ! read chars into integer register
            r = transfer(str(i:i+16),r)
        
            ! search for semicolon
            z = .false.
            !$omp simd simdlen(16) reduction(.or.: z)
            do k = 1, 16
                z = z .or. (r(k) == sc)
            end do
            
            if (z) then
               ! semicolon was found
                findsc = i + findloc(r,sc,dim=1) - 1
                return
            end if
        end do

        ! semicolon not found
        findsc = 0
    end function

When I compile this with ifx -O3 -xhost -qopenmp-simd, the outer and inner (SIMD) loops reduce to:

        vpbroadcastb    xmm1, byte ptr [rip + .LCPI1_1]        ; tmp1(1:16) = 59
.LBB1_4:
        vmovdqu xmm0, xmmword ptr [r14 + rax]                  ; r = transfer(str(i:i+16))
        vmovdqa xmmword ptr [rip + simd_mp_findsc_$R], xmm0    ; ??? (store r in static memory?)
        vpcmpeqb        xmm2, xmm0, xmm1                       ; tmp2(1:16) = r == tmp1
        vpmovmskb       ecx, xmm2                              ; mask = (significant bits of tmp2)
        test    ecx, ecx                                       ; 
        jne     .LBB1_7                                        ; if (mask /= 0) goto ...
        add     rax, 16                                        ; i = i + 16
        lea     ecx, [rax + 1]
        cmp     ecx, ebx
        jle     .LBB1_4                                        ; repeat outer loop
.LBB1_6:
        xor     eax, eax.                    ; set result, findsc = 0
        add     rsp, 8                        
        pop     rbx
        pop     r14
        ret                                  ; return

If it were not for the instruction marked with ???, the instructions appear to be the same as in the StackOverflow thread: c - how to break from a loop when using sse intrinsics? - Stack Overflow

When a semi-colon is found, the code jumps to the following block where findloc has been unrolled:

.LBB1_7:
        vmovd   edx, xmm0            ; tmpd = transfer(r,tmpd)
        mov     ecx, 1               ; tmpc = 1
        cmp     dl, 59               ; flag = int(tmpd,1) == 59
        je      .LBB1_23             ; jump if (flag)
        vpextrb edx, xmm0, 1         ; tmpd = transfer(r(1), tmpd)
        mov     ecx, 2               ; tmpc = 2
        cmp     dl, 59               ; flag = int(tmpd,1) == 59
        je      .LBB1_23             ; jump if (flag)
        vpextrb edx, xmm0, 2         ; tmpd = transfer(r(2), tmpd)
        mov     ecx, 3               ; ...
        cmp     dl, 59
        je      .LBB1_23
        vpextrb edx, xmm0, 3
        mov     ecx, 4
        cmp     dl, 59
        je      .LBB1_23
        vpextrb edx, xmm0, 4
        mov     ecx, 5
        cmp     dl, 59
        je      .LBB1_23
...                              ; findloc has been unrolled
...                              ; checking byte by byte for iachar(';') = 59

.LBB1_23:
        mov     edx, ecx         ; tmpd = tmpc
.LBB1_24:
        add     eax, edx         ; findsc := (i-1) + tmpd
        add     rsp, 8
        pop     rbx
        pop     r14
        ret                      ; return

If I add the -recursive flag, the instruction marked with ??? (store values in a static local variable?) disappears, and the load (move) and compare are fused:

        vpbroadcastb    xmm0, byte ptr [rip + .LCPI1_1]       ; set lanes to ";"
.LBB1_4:
        vpcmpeqb        xmm1, xmm0, xmmword ptr [rbx + rax]   ; load and compare for equality
        vpmovmskb       ecx, xmm1                             ; mask = extract significant bits
        test    ecx, ecx
        jne     .LBB1_7                               ; jump if semi-colon found
        add     rax, 16                               ; increment address by 16 bytes
        lea     ecx, [rax + 1]                        ; load index of next element
        cmp     ecx, r14d                             ; flag = i <= len(str)
        jle     .LBB1_4                               ; if (flag) goto .LBB1_4

Neat!

sshestov · January 24, 2024, 2:31pm

Wow! Thanks a lot for such extended analysis. I’ve tried your code, but since I’m with gfortran (-O3 -fopenmp -fopenmp-simd -march=native) it looks like I do not get the same effect. More than that, currently it drops from 3 to >6 seconds.

I will investigate how to install ifort (it asked immediately 14Gb; I will need to sacrifice the input file which is 13 GB:)

BTW, I’m working already with 1-byte integer array, so had to modify your code accordingly. Note r = transfer(str(i:i+16),r) must be r=transfer(str(i:i+15,r)) IMHO

ivanpribec · January 24, 2024, 7:00pm

Good catch!

With gfortran the omp simd directives don’t work well for this particular character searching application. Actually, even with ifx I noted later that the generated instructions are “fragile”, in the sense that flags, minor source-code changes, and compiler versions make big differences.

Here are some performance results obtained on an 11th Gen Intel(R) Core™ i7-11700K @ 3.60GHz. The compilers used are:

ifx version 2022.0.0
ifort version 2021.5.0
gcc version 13.1.0 (Ubuntu 13.1.0-8ubuntu1~20.04.2)
nvfortran 22.7-0 64-bit target on x86-64 Linux -tp haswell

The test files can be found in this repository (GitHub - ivan-pi/fortran-ascii at f96e01c11c9bfe8163e2a9a47f4ffd4c1ff76982; the driver is named benchmark_scan). The bandwidth of the copy and triad were measured using STREAM (MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS). The benchmark searches a string of size N (divisible by 32), where only the last element is a semi-colon, and hence is not very representative (but it was easy to create).

A few conclusions I can make:

findloc tends to be slow for character scanning
index is always a tiny bit faster than scan
overall custom-1 appears to be the most performant across all four compilers

Do take these with a grain of salt, as a different compiler flag or even the change in the link order and environment variables could change performance by a large amount.

RonShepard · January 24, 2024, 8:32pm

This is an off-by-one mistake that I often make. I find that if I write it as

r=transfer(str(i+0:i+15,r))

then I don’t make it as often. Of course, I expect the compiler to ignore the “+0” when generating the code, so it is just a note to myself.

bwanakelly · January 25, 2024, 7:49am

Out of curiosity, how long are the CSV lines?

sshestov · January 25, 2024, 9:05am

In the beginning the list of cities included only the real ones, and the file looked like this:

Lhasa;26.5
Washington, D.C.;22.0
Bridgetown;38.8
Nassau;19.1
Willemstad;33.1
Las Palmas de Gran Canaria;21.3
Baku;8.1
Albuquerque;18.1
Tokyo;-3.8
Murmansk;7.2
Austin;29.3
Jayapura;19.2
Gangtok;12.5
Bata;24.3
Las Palmas de Gran Canaria;17.9
Phoenix;42.1
Marrakesh;13.0
Novosibirsk;3.1
Palembang;29.3

There were 413 different names. This is a relatively small number, thus people started to invent trivial hash algorithms, like use only 8 first characters and XOR them. Thus the organizers changed the rule that any random string from 1 to 1000 characters is possible.

sshestov · January 26, 2024, 8:57am

Thanks again for making such detailed analysis. I’ve tried to follow your suggestions (not precisely though) and do not see any encouraging results. Details are below.

I must admit I am even near not as experienced as you; may be detailed inspection of my binaries would help, but they are big and I’m lost.

I have tried nvfortran 21.5-0, freshly downloaded ifx and iforn 2024.0.2, my default gfortran 11.4.0 (I’m on Ubuntu 22.04; CPU is i9-9900X; 64Gb ram).

The compiler options are:

-O3 -fast -tp=skylake -mp=multicore for nv ;
-Ofast -xcore-avx2 -qopenmp-simd -qopenmp for ifx and ifort;
-O3 -fopenmp -fopenmp-simd -march=native for gfortran;

I’ve tried my original version with findloc and with your findsc, which I had to modify. The reason is that in order to read the file I’m using mmap (a C-function) to which I’m making pointers both as integer(1) and character arrays. I never use real strings (I expect conversion to strings will give some overhead), thus can not use directly index() or scan() as in your examples. My modified findsc() is below.

In any case nvfortran is very slow, ~40 s; during compilation it complains that almost neither of the loops could be vectorized. For the others the versions with findloc is normally faster by ~2 s; the winner is gfortran where I get either 3 s (findloc) or 6 s (findsc). ifx and ifort give 12 s and 15 s correspondingly.

I’m afraid there is one important factor that reduces the speed of your findsc: majority of the station names are quite short, and actual cut of several x16 front symbols happens very seldomly, just for very few lines out of 10^9. While the slowest findloc remains and it must be done for every short line.

Do you have any other ideas how to rewrite findloc? I was thinking about comparing 16 positions with an offset array 0…16 and then doing merge or something similar. Could such functions as minloc be extensively vectorized?

integer function findsc(arr)
    integer(kind=1), dimension(:) :: arr  
    integer(1), parameter :: sc = 59 !iachar(';',1)
    integer(1) :: r(16)
    logical(1) :: z
    integer :: i, k 
    do i = 1, size(arr), 16      
        r = arr(i:i+15)                  
        z = .false.
        !$omp simd simdlen(16) reduction(.or.: z)
        do k = 1, 16
            z = z .or. (r(k) == sc)
        end do
        if (z) then
            findsc = i + findloc(r,sc,dim=1) - 1     ! semicolon was found
            return
        end if
    end do
    findsc = 0                                       ! semicolon not found
end function

jkd2022 · January 26, 2024, 11:22am

This appears to be faster on Intel and Nvidia compilers, although slower on GFortran:

integer function findsc2(arr)
implicit none
integer(kind=1), dimension(:) :: arr  
integer(1), parameter :: sc = 59 !iachar(';',1)
integer p,i
p=size(arr)
do i=1,p
  if (arr(i) == sc) exit
end do
if (i > p) then
  findsc2=0
else
  findsc2=i
end if
end function

I also tried adding

!$OMP SIMD EARLY_EXIT

before the loop but, apart from Classic Intel, the compilers either complain or return incorrect results.

ivanpribec · January 26, 2024, 11:35am

This appears to be an Intel-specific extension.

ivanpribec · January 27, 2024, 8:35pm

Attempt 1

Here is a program which checks if eight bytes contain a semi-colon code, by using a technique known as SIMD within a register (SWAR). The idea is you pack the bytes into a variable of the size of the registers and use the available arithmetic and bitwise operations to operate on multiple bytes simultaneously (a kind of SIMD in software).

Edit: the function is broken, I have to go back to the drawing board… sorry for posting before testing.

The function first subtracts a mask composed of bytes containing the value 59 (ascii code for ;). Any byte containing a semi-colon will become zero. The second part I blatantly copied from a blog by Wojciech Muła: http://0x80.pl/articles/mask-zero-nonzero-bytes.html, without attempting to understand it. Essentially, if a byte is zero, it will set the most significant bit to 1. Finally we count the number of trailing zeros and transform the value into a position. (This is for little-endian.)

program swar
implicit none

character(8) :: a, b
integer(1) :: c(8)

integer :: j, k

a = "abcdefgh"
c = [(iachar(a(j:j),1),j=1,8)]
print *,sc_index(c)

do k = 1, 8
    b = a
    b(k:k) = ';'
    c = [(iachar(b(j:j),1),j=1,8)]
    print *, sc_index(c)
end do

contains

integer function sc_index(a)
    integer(1), intent(in) :: a(8)
    integer(8) :: r, m

    r = transfer(a,r) - int(z'3B3B3B3B3B3B3B3B',8)  ! z'35' = 59 = iachar(';')
    m = iand(r,int(z'7f7f7f7f7f7f7f7f',8)) + int(z'7f7f7f7f7f7f7f7f',8)
    r = iand(ior(r,m),int(z'8080808080808080',8))
    r = ieor(r,int(z'8080808080808080',8))
    
    sc_index = merge((trailz(r) - 7) / 8 + 1, 0, r /= 0)

end function

end program

The output of the program is:

The function is inspired by the blog post of Daniel Lemire: Quickly identifying a sequence of digits in a string of characters – Daniel Lemire's blog

Edit: it looks like there is a still a bug lurking in this function, as it fails on the following input:

a = "!<0`-'PO"
c = transfer(a,c)
print *, scindex(c)   ! prints 2 instead of 0

Attempt 2

I found a StackOverflow answer which offered a SWAR solution for byte-comparison: c - How to write a SWAR comparison which puts 0xFF in a lane on matches? - Stack Overflow

Here is my naive port to Fortran:

    integer function find_swar(str) result(pos)
        implicit none
        character(len=*), intent(in) :: str  

        integer :: i
        integer(8), parameter :: sc = int(z'3b3b3b3b3b3b3b3b',8)
        integer(8) :: r

        if (mod(len(str),8) /= 0) then
            write(*,'(A,I0)') "ERROR: String length should be multiple of ",8
            error stop
        end if

        do i = 1, len(str), 8
            r = transfer(str(i:i+7),r)
            pos = t_index(r, sc)
            if (pos > 0) then
                pos = i + pos - 1
                return
            end if
        end do

    contains

        integer function t_index(x, y)
            integer(8), intent(in) :: x, y
            integer(8) :: xored, mask
            integer(8) :: msb = int(z'8080808080808080',8)
            xored = ieor(x,y)
            mask = iand(ior(shiftr(xored,1), msb) - xored, msb)
            if (mask /= 0) then
                t_index = (trailz(mask) - 7) / 8 + 1
            else
                t_index = 0
            end if
        end function

    end function

I get decent results with the options gfortran -O2 -march=native -fwrapv. This time I didn’t get any false positives, even when parsing > 100 MB of characters.

sshestov · January 29, 2024, 11:16am

Thanks again for this detailed analysis! I’ve tried to implement it, and … no, I was not able to gain anything.

First, I had to slightly modify the logic. Before that I was looping through a chunk of my huge array i11 inside a conditional loop:

cpos = ch_start(chunk_num)
do
  pos_s = findloc(i11(cpos:),scode,DIM=1) - 1
  !!! do stuff with temperature scan and NEW_LINE position;
  !!! pos_nl is 3-6 symbols to the right of pos_s
  cpos = cpos + pos_nl
  if (cpos > ch_end(chunk_num)) exit
end

Now I scroll through the whole chunk in any case

  do j=ch_start(chunk_num), ch_end(chunk_num), 8
    r = transfer(i11(j:j+7),r)
    pos_s = t_index(r)              !! sc position within 8-byte sub-array
    if (pos_s > 0) then 
      pos_s = j + pos_s-1 - cpos    !! I actually need sc position with respect to cpos
      !!! do stuff with temperature scan and NEW_LINE position;
      !!! pos_nl is 3-6 symbols to the right of pos_s
      cpos = cpos + pos_nl
    end if
  end do

I also moved 8-byte sc = int(z'3b3b3b3b3b3b3b3b',8) inside t_index.

The whole thing works, but I do not see any improvement. Furthermore, in the beginning it was slower But after I have resolved all the collisions of OpenMP over-the-chunk-number-loop the time reduced to the original ~3 s. Switching to ifx just slows down the whole thing to 7-19 s.

It looks like we are not able to over-perform gfortran

Topic		Replies	Views
Fortran openMP - make code faster	4	477	August 31, 2022
Parallel Programming with Coarrays in Fortran (blog post)	5	763	April 1, 2024
Parallel programming resources	2	600	October 14, 2021
OpenMP and `do concurrent` loop = crash at runtime Help	11	1044	May 5, 2023
LFortran can now build legacy and modern Minpack Announcements	4	1269	May 3, 2023

1 billion row challenge in Fortran : mmap + OpenMP + branchless scan

Attempt 1

Attempt 2

Related topics