1 billion row challenge in Fortran : mmap + OpenMP + branchless scan

Just tried your version. No, it does not speed up. It looks like with ifort and ifx the execution time is the same ~12 s with both findloc() and findsc2(), as well as gfortran & findsc2(). However, with gfortran and findloc() it immediately drops to <3 s.

Hey guys, I learned Fortran today just so that I could write this, so Iā€™m a bit new. I wanted to see how well a binary prefix tree would do: no hash table! It does 10M rows in about 4.2s on my machine (single threaded).

Its much slower than I thought: I was hoping for < 1 sec. Iā€™d be very grateful for any tips regarding performance improvements. Iā€™m sure thereā€™s lots I must be doing sub-optimally.

You can run it like this:

gfortran -O3 -o 1brc 1brc.f90 && time ./1brc

Here is the code:

module trie_types
  implicit none
  
  type :: trie
     character(len=1) :: key = '!'
     real :: min = 100
     real :: max = -100
     real :: sum = 0
     integer :: count = 0
     type(trie), pointer :: child => null()
     type(trie), pointer :: sibling => null()
  end type trie  
end module trie_types

module builder
  use trie_types
  implicit none

contains

  subroutine update(buffer, root)
    character(len=*), intent(in) :: buffer
    type(trie), pointer, intent(inout) :: root
    type(trie), pointer :: current
    integer :: i=1
    character(len=1) :: c
    real :: f

    current => root
    do while (i <= len(buffer))
       c = buffer(i:i)
       if (c == ';') then
          read(buffer(i+1:), *) f
          current%min   = min(current%min, f)
          current%max   = max(current%max, f)
          current%sum   = current%sum + f
          current%count = current%count + 1
          current => root
          i = i + merge(5,6,abs(f)<10) + merge(1,0,sign(1.0,f)<0)
       else if (.not. associated(current%child)) then
          allocate(current%child)
          current => current%child
          current%key = c
          i = i + 1
       else if (current%child%key == c) then
          current => current%child
          i = i + 1
       else
          current => current%child
          do
             if (.not. associated(current%sibling)) then
                allocate(current%sibling)
                current => current%sibling
                current%key = c
                exit
             endif
             current => current%sibling
             if (current%key == c) exit
          end do
          i = i + 1
       end if
    end do

  end subroutine update

  recursive subroutine display(tree, prefix)
    type(trie), pointer, intent(in) :: tree
    character(len=*), intent(in), optional :: prefix
    character(len=:), allocatable :: new_prefix

    if (.not. associated(tree)) return
    if (present(prefix)) then
        new_prefix = prefix // tree%key
    else
        new_prefix = tree%key
    end if
    if (tree%min /= 100) then
       print '(A,A,F5.1,F5.1,F5.1)', new_prefix, ";", &
            tree%min, tree%max, tree%sum / tree%count
    end if
    if (associated(tree%child)) then
        call display(tree%child, new_prefix)
    end if
    if (associated(tree%sibling)) then
        call display(tree%sibling, prefix)
    end if
  end subroutine display
end module builder

program one_brc
  use trie_types
  use builder
  implicit none

  integer, parameter :: buffer_size = 10000
  character(len=:), allocatable :: buffer
  integer :: fd, read_size
  type(trie), pointer :: tree

  open(newunit=fd, file='measurements.txt', access='stream')
  inquire(unit=fd, size=read_size)
  allocate(tree)
  allocate(character(len=read_size) :: buffer)
  read(fd) buffer
  call update(buffer, tree)
  call display(tree%child)
end program one_brc

EDIT: @sshestov I notice your solution uses C calls such as mmap heavily. Is this likely to also make a big difference in my version?

Hi Emir, nice trial. Yes, I was using mmap from the very beginning and even did not try to read the file by the standard mechanisms. I believe it speeds-up significantly (it was known from the beginning it will be faster and this is the operating system that is responsible for almost everything, you do not need to allocate&whatsoever). It is quite simple to use, I encourage you to use my code for that.

During my trials, I have found many smaller things that were slowing down the performance significantly! I get to the point when parsing the array in memory became the bottleneck (I was checking the numbers, hash-ing does not take a lot of time). Thus I would suggest you to switch from arrays of characters to integer(1) parse the numbers manually, not by read(). Also in my case switching from if-then-else read to a branchless version gave another significant benefit.

1 Like

Here is a small modification which has taken it from ~4s to ~1s for 10M rows. I replaced the read(buffer, *) f with a custom str2real function:

  function str2real (str) result(f)
    character(len=*), intent(in) :: str
    integer :: i, off
    integer :: zero
    real :: f
    zero = ichar('0')
    off = merge (2, 1, str(1:1)=='-')
    i = index(str(off:),'.')
    if (i==2) then
       f = ichar(str(off:off)) - zero
    else 
       f = (ichar(str(off:off)) - zero)*10 + ichar(str(off+1:off+1)) - zero
    end if
    f = f + (ichar(str(off+i:off+i)) - zero) / 10.0
    f = merge(-f,f,off==2)
  end function str2real

UPDATE: currently it does not go to the end of a 1G file (I see ~160 000 entries / city; while it should be ~2 500 000). That is probably why it gives 34s vs 27s.

BTW, Iā€™ve checked your code (original one, with read()) with the 1G-row file, it took 34s on my computer. While my program run with a single thread takes 27s. Not a huge difference.

I expect the situation could be different if you introduce parallel processing.

Hey here is my more polished solution: 136 lines of code. It uses the custom float parsing, and also chunks the file so that it can do the full 1B rows. It is single threaded. It takes 2m8s on my machine ā€“ or about 8.5s per 1G. Running it on a laptop: i7-1165G7 @ 2.80GHz, 16G RAM + SSD.

Here is a Trie version of the prefix tree code. Its 117 lines of code, completes 1B rows in 1m25s; single threaded. Still looking for a 10x speed up :slight_smile: Grateful for concrete tips!

You might experiment with asynchronous i/o. If your compiler supports it, then you can overlap the i/o operations with processing of data within the buffers.

After you read in the first buffer, you initiate the read of the second buffer. After you have processed the first buffer, you wait for the second buffer i/o to complete. When the second buffer is available, you use move_alloc() to swap the buffers (among several possibilities), and initiate the i/o on the new second buffer and begin processing the new first buffer.

The possible problem is that some compilers donā€™t support actual asynchronous i/o, they just do normal synchronous i/o using the asynchronous syntax, so you wonā€™t actually gain anything.

1 Like

UPDATE: just added few links to the leaders, itā€™s very interesting to read about their solutions.

Hi @RonShepard, thanks for the advice. In fact, all the ā€œparticipantsā€ are normally comparing warm runs, when the whole 13Gb file is already in memory. The winners clearly use a lot of low-level C-magic (not because of C, but they are finally implemented in C) and include parallel workers, chunks, AVX instructions etc, and do it in ~0.2 s

link1 link2

In my case Iā€™m using mmap to get the file into memory and process it there in chunks via OpenMP. I believe in this case the io and processing goes asynchronously, since my timing is ~4 s, i.e. closer to the leaders and way better then a single process.

2 Likes

Here is my hash table version in 134 lines. It completes single threaded in 46s, of which 27s is spent reading and breaking it into lines ā€¦ My disk is slow time cat measurements > /dev/null takes ~13s. This given @RonShepard , async IO could make a significant difference, at least to my local runs, but Iā€™m not sure if itā€™d make any difference if the whole file was in memory.

I might try this on a meaty AWS machine just so that I can use tmpfs to put the whole file in memory and see how quickly it can process it then :slight_smile:

Regarding the parallel processing, I thought I could get 16 threads to split the key space using mod (ichar(key(1:1)), 16) , all read the whole file, but only hash table the stuff in their key space, and then just output it at the end. Is this a terrible idea? Would it be better to chunk it and then write a reducer to combine the results?

gprof says a lot of time is spent in the hash function. Its just my sloppy implementation of FNV1-a. I didnā€™t know how to get around the unsigned int problem so I calculate a 32-bit hash in a 64-bit space and use modulus to control for overflow, but it seems slow. Do you see some obvious way to make it faster?

  pure function hash(key,m) result(h)
    use, intrinsic :: iso_fortran_env, only: int64
    integer(int64), parameter :: prime = 16777619_int64
    integer(int64), parameter :: basis = 2166136261_int64
    integer(int64) :: h
    character(len=*), intent(in) :: key
    integer, intent(in) :: m
    integer :: i
    h = basis
    do i = 1, len(key)
       h = ieor(h, iachar(key(i:i), int64))
       h = mod(h * prime, 2_int64**32)
    end do
    h = mod (h,m)
  end function hash

EDIT: hmmā€¦ maybe I can fix the prefix ā€¦ It makes sense not consider the whole key but just a prefix long enough to distinguish the cities. Presumably if I halve the number of characters processed, I halve the time spent in this function.

Iā€™m struggling to understand why this would be? x= merge(a,b,cond) seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.

EDIT:

@ivanpribec I tried your SIMD functions and variants thereof. Thank you, it taught me a bit about SIMD! It is much slower that findloc for me. I think mostly because ā€œ;ā€ (in my case new line) is not very far away from the start of any given search. I.e. its mostly within 32 characters. Here is my function below. I wrote it slightly backwards. This way there is no need to re-run the scan, and it can handle input of any length. Again, Iā€™d imagine it would work better if the character we are looking for was much more sparse.

  function findnl(ns) result(idx)
    integer(1), intent(in) :: ns(:)
    integer(1) :: r(16)
    integer :: l, i, k, z, idx
    
    l = size(ns)

    i = 1
    if (l >= 16) then
       do i = 1, l, 16
          r = transfer(ns(i:i+16-1),r)
          z = 17
          !$omp simd simdlen(16) reduction(min: z)
          do k = 1, 16
             z = min (z, merge(k, 17, r(k)==10))
          end do
          if (z<17) then
             idx = i + z - 1
             return
          end if
       end do
    end if
    do i = i, mod(l, 16)
       if (ns(i)==10) then
          idx = i
          return
       endif
    end do
    idx = 0  
    end function findnl

@sshestov my hash + mmap version (thank you for the help). Its now down to 38s for 1B rows :slight_smile:

Iā€™m struggling to understand why this would be? x= merge(a,b,cond) seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.

Iā€™m not an expert, but a) the whole INTERNET claims it will be like that; b) it was in my case; c) a branchless version requiers 3-4 computations (and I believe the number of instructions will be close), while with if-then-else - the real execution can not be fast.

It looks (to me) like your hash function is very expensive. Can you try another one? BTW, I was also strugling with converting a signed int to an unsigned, and the solution is ā€¦ to do +32767.

ā€¦ a-a-a-a-a ā€¦ you already changed to integer(kind=1) :slight_smile: The remaining step is to read the file in chunks and do OpenMP. This is quite easy actually.

Also, a while ago I saw a claim from an experienced guy that inside the mmap one should use not c_ptr but rather c_int_ptr. I did not get it, and it was not working for me.

If anyone wants to test the Fortran codes without downloading a 13Gb file, I generated a file with 1e+6 rows, uploaded to GitHub - Beliavsky/1brc_data_subsets: subset data file for 1brc, the 1 billion row challenge, with 1 million rows using the Python code with the patch described at No such file in directory & 'charmap' codec can't decode Ā· Issue #715 Ā· gunnarmorling/1brc Ā· GitHub.

1 Like

Here is my hash + mmap + OpenMP version in Fortran:

gfortran -fopenmp -march=native -ffast-math -O3 -o 1brc 1brc_hash_mmap_openmp.f90
time ./1brc | wc -l

8875

real    0m9.522s
user    0m51.326s
sys     0m0.872s

Thatā€™s on 4 cores and 8 threads. mmap with OpenMP is wonderful. mmap fully deals with the random access of a file much too big to fit into memory. The problem at this stage is that the single threaded version canā€™t saturate even a single CPU. Utilisation for the parallel version is even worse for the same reason, so starting with the whole file in RAM is necessary to get to the 1.53s Java benchmark.

Still looking for a more performant hash function if anyone knows one!

Well, Iā€™ve tried your code on my PC, and get up to 3.3 s (while there are runs with ~20 s and ~5 s). In the case of my version I get 2.9 s.

As for the has functions, I took examples from here1 or here2.

Iā€™ve implemented the first one as follows:

integer function get_hash_i(array) result(res)
  integer(kind=1), dimension(:), intent(in) :: array
  integer :: len, i
  integer(2) :: ha

  ha=21845  ! 0x5555=21845; the 0x5555 is suggested by the author
  do i=1,size(array)
    ha = ieor( ior(ishftc(ha,2),ishftc(ha,-14)) , int(array(i),2) )
  end do 
  !res = merge(int(ha,kind=4),65536-int(not(ha),kind=4)-1,ha>=0)  !! int(2) is signed; converting to unsigned int(4)
  res = 32766+ha
end function get_hash_i
1 Like

Thanks! Does it mean we can expect soon a version from you? :wink:

Thank you for running it! What spec is your PC? Youā€™ll get best performance by setting the number of parts to twice the number of cores (assuming hyper threading). For example, for 8 cores it should be integer, parameter :: parts = 16 . Iā€™m super excited about trying the hash functions! Will report back.

Also, can you share what parameters youā€™re using with gfortan to get the best performance speed?

EDIT : I tried your function above verbatim, it generates a huge amount of conflicts, so you might get much better performance by using mine instead (its FNV1-a).

In my case itā€™s Intel i9-9900X 3.50GHz. Iā€™m using either -O3 -march=native or just -O3. I do not see any significant difference is speed. Iā€™ve also tried different compilers (with different parameters) and my conclusion is that gfortran is significantly faster than others. I suspect itā€™s just a coincidence for this particular problem (gfortran was able to vectorize it better than the others).

If I run your code with parts=8 I get 4.2s, parts=16 => 3.5s, and managed to get 2.8s with parts=20. I must admit though that the execution time jumps from 3.5 to 20 s in a random way. While in my case Iā€™m almost always get very much same results. Does it suggest an un-foreseen behavior in your code?

Also for the hash function: I do not resolve conflicts (intentionally); sufficiently large hash table & algorithm result in no conflicts. At least in my case on a given set of realistic city names. I can not imagine why do you get them. As it was my way to broaden my horizon, I did not want to go to extreme and with ā€œrandom length city names with random lettersā€.

1 Like

The crazy variation in time I think is caused by mmap I think , when you overload it with concurrent access it results with threads locking and waiting for the rest to finish before continuing. It only happens after a particular level of concurrency.

Your hash function and code seem to produce lots of conflicts so far as I can see. Try cutting at ā€œ;ā€ and doing:

cat measurements.txt | grep -oP ā€œ^.+;ā€ | sort -u | wc -l

at the command line and compare it to the number of rows you generate.