1 billion row challenge in Fortran : mmap + OpenMP + branchless scan

sshestov · January 30, 2024, 9:23am

Just tried your version. No, it does not speed up. It looks like with ifort and ifx the execution time is the same ~12 s with both findloc() and findsc2(), as well as gfortran & findsc2(). However, with gfortran and findloc() it immediately drops to <3 s.

emiruz · March 13, 2024, 11:52pm

Hey guys, I learned Fortran today just so that I could write this, so I’m a bit new. I wanted to see how well a binary prefix tree would do: no hash table! It does 10M rows in about 4.2s on my machine (single threaded).

Its much slower than I thought: I was hoping for < 1 sec. I’d be very grateful for any tips regarding performance improvements. I’m sure there’s lots I must be doing sub-optimally.

You can run it like this:

gfortran -O3 -o 1brc 1brc.f90 && time ./1brc

Here is the code:

module trie_types
  implicit none
  
  type :: trie
     character(len=1) :: key = '!'
     real :: min = 100
     real :: max = -100
     real :: sum = 0
     integer :: count = 0
     type(trie), pointer :: child => null()
     type(trie), pointer :: sibling => null()
  end type trie  
end module trie_types

module builder
  use trie_types
  implicit none

contains

  subroutine update(buffer, root)
    character(len=*), intent(in) :: buffer
    type(trie), pointer, intent(inout) :: root
    type(trie), pointer :: current
    integer :: i=1
    character(len=1) :: c
    real :: f

    current => root
    do while (i <= len(buffer))
       c = buffer(i:i)
       if (c == ';') then
          read(buffer(i+1:), *) f
          current%min   = min(current%min, f)
          current%max   = max(current%max, f)
          current%sum   = current%sum + f
          current%count = current%count + 1
          current => root
          i = i + merge(5,6,abs(f)<10) + merge(1,0,sign(1.0,f)<0)
       else if (.not. associated(current%child)) then
          allocate(current%child)
          current => current%child
          current%key = c
          i = i + 1
       else if (current%child%key == c) then
          current => current%child
          i = i + 1
       else
          current => current%child
          do
             if (.not. associated(current%sibling)) then
                allocate(current%sibling)
                current => current%sibling
                current%key = c
                exit
             endif
             current => current%sibling
             if (current%key == c) exit
          end do
          i = i + 1
       end if
    end do

  end subroutine update

  recursive subroutine display(tree, prefix)
    type(trie), pointer, intent(in) :: tree
    character(len=*), intent(in), optional :: prefix
    character(len=:), allocatable :: new_prefix

    if (.not. associated(tree)) return
    if (present(prefix)) then
        new_prefix = prefix // tree%key
    else
        new_prefix = tree%key
    end if
    if (tree%min /= 100) then
       print '(A,A,F5.1,F5.1,F5.1)', new_prefix, ";", &
            tree%min, tree%max, tree%sum / tree%count
    end if
    if (associated(tree%child)) then
        call display(tree%child, new_prefix)
    end if
    if (associated(tree%sibling)) then
        call display(tree%sibling, prefix)
    end if
  end subroutine display
end module builder

program one_brc
  use trie_types
  use builder
  implicit none

  integer, parameter :: buffer_size = 10000
  character(len=:), allocatable :: buffer
  integer :: fd, read_size
  type(trie), pointer :: tree

  open(newunit=fd, file='measurements.txt', access='stream')
  inquire(unit=fd, size=read_size)
  allocate(tree)
  allocate(character(len=read_size) :: buffer)
  read(fd) buffer
  call update(buffer, tree)
  call display(tree%child)
end program one_brc

EDIT: @sshestov I notice your solution uses C calls such as mmap heavily. Is this likely to also make a big difference in my version?

sshestov · March 14, 2024, 10:22am

Hi Emir, nice trial. Yes, I was using mmap from the very beginning and even did not try to read the file by the standard mechanisms. I believe it speeds-up significantly (it was known from the beginning it will be faster and this is the operating system that is responsible for almost everything, you do not need to allocate&whatsoever). It is quite simple to use, I encourage you to use my code for that.

During my trials, I have found many smaller things that were slowing down the performance significantly! I get to the point when parsing the array in memory became the bottleneck (I was checking the numbers, hash-ing does not take a lot of time). Thus I would suggest you to switch from arrays of characters to integer(1) parse the numbers manually, not by read(). Also in my case switching from if-then-else read to a branchless version gave another significant benefit.

emiruz · March 14, 2024, 11:32am

Here is a small modification which has taken it from ~4s to ~1s for 10M rows. I replaced the read(buffer, *) f with a custom str2real function:

  function str2real (str) result(f)
    character(len=*), intent(in) :: str
    integer :: i, off
    integer :: zero
    real :: f
    zero = ichar('0')
    off = merge (2, 1, str(1:1)=='-')
    i = index(str(off:),'.')
    if (i==2) then
       f = ichar(str(off:off)) - zero
    else 
       f = (ichar(str(off:off)) - zero)*10 + ichar(str(off+1:off+1)) - zero
    end if
    f = f + (ichar(str(off+i:off+i)) - zero) / 10.0
    f = merge(-f,f,off==2)
  end function str2real

sshestov · March 14, 2024, 12:37pm

UPDATE: currently it does not go to the end of a 1G file (I see ~160 000 entries / city; while it should be ~2 500 000). That is probably why it gives 34s vs 27s.

BTW, I’ve checked your code (original one, with read()) with the 1G-row file, it took 34s on my computer. While my program run with a single thread takes 27s. Not a huge difference.

I expect the situation could be different if you introduce parallel processing.

emiruz · March 14, 2024, 4:11pm

Hey here is my more polished solution: 136 lines of code. It uses the custom float parsing, and also chunks the file so that it can do the full 1B rows. It is single threaded. It takes 2m8s on my machine – or about 8.5s per 1G. Running it on a laptop: i7-1165G7 @ 2.80GHz, 16G RAM + SSD.

emiruz · March 14, 2024, 9:12pm

Here is a Trie version of the prefix tree code. Its 117 lines of code, completes 1B rows in 1m25s; single threaded. Still looking for a 10x speed up Grateful for concrete tips!

RonShepard · March 14, 2024, 10:21pm

You might experiment with asynchronous i/o. If your compiler supports it, then you can overlap the i/o operations with processing of data within the buffers.

After you read in the first buffer, you initiate the read of the second buffer. After you have processed the first buffer, you wait for the second buffer i/o to complete. When the second buffer is available, you use move_alloc() to swap the buffers (among several possibilities), and initiate the i/o on the new second buffer and begin processing the new first buffer.

The possible problem is that some compilers don’t support actual asynchronous i/o, they just do normal synchronous i/o using the asynchronous syntax, so you won’t actually gain anything.

sshestov · March 15, 2024, 9:29am

UPDATE: just added few links to the leaders, it’s very interesting to read about their solutions.

Hi @RonShepard, thanks for the advice. In fact, all the “participants” are normally comparing warm runs, when the whole 13Gb file is already in memory. The winners clearly use a lot of low-level C-magic (not because of C, but they are finally implemented in C) and include parallel workers, chunks, AVX instructions etc, and do it in ~0.2 s

link1 link2

In my case I’m using mmap to get the file into memory and process it there in chunks via OpenMP. I believe in this case the io and processing goes asynchronously, since my timing is ~4 s, i.e. closer to the leaders and way better then a single process.

emiruz · March 15, 2024, 3:57pm

Here is my hash table version in 134 lines. It completes single threaded in 46s, of which 27s is spent reading and breaking it into lines … My disk is slow time cat measurements > /dev/null takes ~13s. This given @RonShepard , async IO could make a significant difference, at least to my local runs, but I’m not sure if it’d make any difference if the whole file was in memory.

I might try this on a meaty AWS machine just so that I can use tmpfs to put the whole file in memory and see how quickly it can process it then

Regarding the parallel processing, I thought I could get 16 threads to split the key space using mod (ichar(key(1:1)), 16) , all read the whole file, but only hash table the stuff in their key space, and then just output it at the end. Is this a terrible idea? Would it be better to chunk it and then write a reducer to combine the results?

emiruz · March 15, 2024, 4:46pm

gprof says a lot of time is spent in the hash function. Its just my sloppy implementation of FNV1-a. I didn’t know how to get around the unsigned int problem so I calculate a 32-bit hash in a 64-bit space and use modulus to control for overflow, but it seems slow. Do you see some obvious way to make it faster?

  pure function hash(key,m) result(h)
    use, intrinsic :: iso_fortran_env, only: int64
    integer(int64), parameter :: prime = 16777619_int64
    integer(int64), parameter :: basis = 2166136261_int64
    integer(int64) :: h
    character(len=*), intent(in) :: key
    integer, intent(in) :: m
    integer :: i
    h = basis
    do i = 1, len(key)
       h = ieor(h, iachar(key(i:i), int64))
       h = mod(h * prime, 2_int64**32)
    end do
    h = mod (h,m)
  end function hash

EDIT: hmm… maybe I can fix the prefix … It makes sense not consider the whole key but just a prefix long enough to distinguish the cities. Presumably if I halve the number of characters processed, I halve the time spent in this function.

emiruz · March 15, 2024, 7:19pm

I’m struggling to understand why this would be? x= merge(a,b,cond) seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.

EDIT:

@ivanpribec I tried your SIMD functions and variants thereof. Thank you, it taught me a bit about SIMD! It is much slower that findloc for me. I think mostly because “;” (in my case new line) is not very far away from the start of any given search. I.e. its mostly within 32 characters. Here is my function below. I wrote it slightly backwards. This way there is no need to re-run the scan, and it can handle input of any length. Again, I’d imagine it would work better if the character we are looking for was much more sparse.

  function findnl(ns) result(idx)
    integer(1), intent(in) :: ns(:)
    integer(1) :: r(16)
    integer :: l, i, k, z, idx
    
    l = size(ns)

    i = 1
    if (l >= 16) then
       do i = 1, l, 16
          r = transfer(ns(i:i+16-1),r)
          z = 17
          !$omp simd simdlen(16) reduction(min: z)
          do k = 1, 16
             z = min (z, merge(k, 17, r(k)==10))
          end do
          if (z<17) then
             idx = i + z - 1
             return
          end if
       end do
    end if
    do i = i, mod(l, 16)
       if (ns(i)==10) then
          idx = i
          return
       endif
    end do
    idx = 0  
    end function findnl

@sshestov my hash + mmap version (thank you for the help). Its now down to 38s for 1B rows

sshestov · March 16, 2024, 8:31pm

I’m struggling to understand why this would be? x= merge(a,b,cond) seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.

I’m not an expert, but a) the whole INTERNET claims it will be like that; b) it was in my case; c) a branchless version requiers 3-4 computations (and I believe the number of instructions will be close), while with if-then-else - the real execution can not be fast.

It looks (to me) like your hash function is very expensive. Can you try another one? BTW, I was also strugling with converting a signed int to an unsigned, and the solution is … to do +32767.

… a-a-a-a-a … you already changed to integer(kind=1) The remaining step is to read the file in chunks and do OpenMP. This is quite easy actually.

Also, a while ago I saw a claim from an experienced guy that inside the mmap one should use not c_ptr but rather c_int_ptr. I did not get it, and it was not working for me.

Beliavsky · March 17, 2024, 12:59pm

If anyone wants to test the Fortran codes without downloading a 13Gb file, I generated a file with 1e+6 rows, uploaded to GitHub - Beliavsky/1brc_data_subsets: subset data file for 1brc, the 1 billion row challenge, with 1 million rows using the Python code with the patch described at No such file in directory & 'charmap' codec can't decode · Issue #715 · gunnarmorling/1brc · GitHub.

emiruz · March 17, 2024, 11:09pm

Here is my hash + mmap + OpenMP version in Fortran:

gfortran -fopenmp -march=native -ffast-math -O3 -o 1brc 1brc_hash_mmap_openmp.f90
time ./1brc | wc -l

8875

real    0m9.522s
user    0m51.326s
sys     0m0.872s

That’s on 4 cores and 8 threads. mmap with OpenMP is wonderful. mmap fully deals with the random access of a file much too big to fit into memory. The problem at this stage is that the single threaded version can’t saturate even a single CPU. Utilisation for the parallel version is even worse for the same reason, so starting with the whole file in RAM is necessary to get to the 1.53s Java benchmark.

Still looking for a more performant hash function if anyone knows one!

sshestov · March 18, 2024, 9:11am

Well, I’ve tried your code on my PC, and get up to 3.3 s (while there are runs with ~20 s and ~5 s). In the case of my version I get 2.9 s.

As for the has functions, I took examples from here1 or here2.

I’ve implemented the first one as follows:

integer function get_hash_i(array) result(res)
  integer(kind=1), dimension(:), intent(in) :: array
  integer :: len, i
  integer(2) :: ha

  ha=21845  ! 0x5555=21845; the 0x5555 is suggested by the author
  do i=1,size(array)
    ha = ieor( ior(ishftc(ha,2),ishftc(ha,-14)) , int(array(i),2) )
  end do 
  !res = merge(int(ha,kind=4),65536-int(not(ha),kind=4)-1,ha>=0)  !! int(2) is signed; converting to unsigned int(4)
  res = 32766+ha
end function get_hash_i

sshestov · March 18, 2024, 9:13am

Thanks! Does it mean we can expect soon a version from you?

emiruz · March 18, 2024, 9:45am

Thank you for running it! What spec is your PC? You’ll get best performance by setting the number of parts to twice the number of cores (assuming hyper threading). For example, for 8 cores it should be integer, parameter :: parts = 16 . I’m super excited about trying the hash functions! Will report back.

Also, can you share what parameters you’re using with gfortan to get the best performance speed?

EDIT : I tried your function above verbatim, it generates a huge amount of conflicts, so you might get much better performance by using mine instead (its FNV1-a).

sshestov · March 18, 2024, 10:28am

In my case it’s Intel i9-9900X 3.50GHz. I’m using either -O3 -march=native or just -O3. I do not see any significant difference is speed. I’ve also tried different compilers (with different parameters) and my conclusion is that gfortran is significantly faster than others. I suspect it’s just a coincidence for this particular problem (gfortran was able to vectorize it better than the others).

If I run your code with parts=8 I get 4.2s, parts=16 => 3.5s, and managed to get 2.8s with parts=20. I must admit though that the execution time jumps from 3.5 to 20 s in a random way. While in my case I’m almost always get very much same results. Does it suggest an un-foreseen behavior in your code?

Also for the hash function: I do not resolve conflicts (intentionally); sufficiently large hash table & algorithm result in no conflicts. At least in my case on a given set of realistic city names. I can not imagine why do you get them. As it was my way to broaden my horizon, I did not want to go to extreme and with “random length city names with random letters”.

emiruz · March 18, 2024, 12:03pm

The crazy variation in time I think is caused by mmap I think , when you overload it with concurrent access it results with threads locking and waiting for the rest to finish before continuing. It only happens after a particular level of concurrency.

Your hash function and code seem to produce lots of conflicts so far as I can see. Try cutting at “;” and doing:

cat measurements.txt | grep -oP “^.+;” | sort -u | wc -l

at the command line and compare it to the number of rows you generate.

Topic		Replies	Views
On the creation of temporary arrays when passing type element arguments	24	3682	September 27, 2021
79 Languages speed competition: Can we make Fortran win?	43	4294	January 15, 2023
Advent of Code 2022 Announcements	91	3827	December 24, 2022
Advent of Code 2021 Announcements	58	2903	December 25, 2021
Advent of Code 2024 Announcements	37	1068	December 27, 2024

1 billion row challenge in Fortran : mmap + OpenMP + branchless scan

Related topics