Just tried your version. No, it does not speed up. It looks like with ifort
and ifx
the execution time is the same ~12 s with both findloc()
and findsc2()
, as well as gfortran & findsc2()
. However, with gfortran
and findloc()
it immediately drops to <3 s.
Hey guys, I learned Fortran today just so that I could write this, so Iām a bit new. I wanted to see how well a binary prefix tree would do: no hash table! It does 10M rows in about 4.2s on my machine (single threaded).
Its much slower than I thought: I was hoping for < 1 sec. Iād be very grateful for any tips regarding performance improvements. Iām sure thereās lots I must be doing sub-optimally.
You can run it like this:
gfortran -O3 -o 1brc 1brc.f90 && time ./1brc
Here is the code:
module trie_types
implicit none
type :: trie
character(len=1) :: key = '!'
real :: min = 100
real :: max = -100
real :: sum = 0
integer :: count = 0
type(trie), pointer :: child => null()
type(trie), pointer :: sibling => null()
end type trie
end module trie_types
module builder
use trie_types
implicit none
contains
subroutine update(buffer, root)
character(len=*), intent(in) :: buffer
type(trie), pointer, intent(inout) :: root
type(trie), pointer :: current
integer :: i=1
character(len=1) :: c
real :: f
current => root
do while (i <= len(buffer))
c = buffer(i:i)
if (c == ';') then
read(buffer(i+1:), *) f
current%min = min(current%min, f)
current%max = max(current%max, f)
current%sum = current%sum + f
current%count = current%count + 1
current => root
i = i + merge(5,6,abs(f)<10) + merge(1,0,sign(1.0,f)<0)
else if (.not. associated(current%child)) then
allocate(current%child)
current => current%child
current%key = c
i = i + 1
else if (current%child%key == c) then
current => current%child
i = i + 1
else
current => current%child
do
if (.not. associated(current%sibling)) then
allocate(current%sibling)
current => current%sibling
current%key = c
exit
endif
current => current%sibling
if (current%key == c) exit
end do
i = i + 1
end if
end do
end subroutine update
recursive subroutine display(tree, prefix)
type(trie), pointer, intent(in) :: tree
character(len=*), intent(in), optional :: prefix
character(len=:), allocatable :: new_prefix
if (.not. associated(tree)) return
if (present(prefix)) then
new_prefix = prefix // tree%key
else
new_prefix = tree%key
end if
if (tree%min /= 100) then
print '(A,A,F5.1,F5.1,F5.1)', new_prefix, ";", &
tree%min, tree%max, tree%sum / tree%count
end if
if (associated(tree%child)) then
call display(tree%child, new_prefix)
end if
if (associated(tree%sibling)) then
call display(tree%sibling, prefix)
end if
end subroutine display
end module builder
program one_brc
use trie_types
use builder
implicit none
integer, parameter :: buffer_size = 10000
character(len=:), allocatable :: buffer
integer :: fd, read_size
type(trie), pointer :: tree
open(newunit=fd, file='measurements.txt', access='stream')
inquire(unit=fd, size=read_size)
allocate(tree)
allocate(character(len=read_size) :: buffer)
read(fd) buffer
call update(buffer, tree)
call display(tree%child)
end program one_brc
EDIT: @sshestov I notice your solution uses C calls such as mmap heavily. Is this likely to also make a big difference in my version?
Hi Emir, nice trial. Yes, I was using mmap from the very beginning and even did not try to read the file by the standard mechanisms. I believe it speeds-up significantly (it was known from the beginning it will be faster and this is the operating system that is responsible for almost everything, you do not need to allocate&whatsoever). It is quite simple to use, I encourage you to use my code for that.
During my trials, I have found many smaller things that were slowing down the performance significantly! I get to the point when parsing the array in memory became the bottleneck (I was checking the numbers, hash-ing does not take a lot of time). Thus I would suggest you to switch from arrays of characters to integer(1) parse the numbers manually, not by read(). Also in my case switching from if-then-else read to a branchless version gave another significant benefit.
Here is a small modification which has taken it from ~4s to ~1s for 10M rows. I replaced the read(buffer, *) f with a custom str2real function:
function str2real (str) result(f)
character(len=*), intent(in) :: str
integer :: i, off
integer :: zero
real :: f
zero = ichar('0')
off = merge (2, 1, str(1:1)=='-')
i = index(str(off:),'.')
if (i==2) then
f = ichar(str(off:off)) - zero
else
f = (ichar(str(off:off)) - zero)*10 + ichar(str(off+1:off+1)) - zero
end if
f = f + (ichar(str(off+i:off+i)) - zero) / 10.0
f = merge(-f,f,off==2)
end function str2real
UPDATE: currently it does not go to the end of a 1G file (I see ~160 000 entries / city; while it should be ~2 500 000). That is probably why it gives 34s vs 27s.
BTW, Iāve checked your code (original one, with read()
) with the 1G-row file, it took 34s on my computer. While my program run with a single thread takes 27s. Not a huge difference.
I expect the situation could be different if you introduce parallel processing.
Hey here is my more polished solution: 136 lines of code. It uses the custom float parsing, and also chunks the file so that it can do the full 1B rows. It is single threaded. It takes 2m8s on my machine ā or about 8.5s per 1G. Running it on a laptop: i7-1165G7 @ 2.80GHz, 16G RAM + SSD.
Here is a Trie version of the prefix tree code. Its 117 lines of code, completes 1B rows in 1m25s; single threaded. Still looking for a 10x speed up Grateful for concrete tips!
You might experiment with asynchronous i/o. If your compiler supports it, then you can overlap the i/o operations with processing of data within the buffers.
After you read in the first buffer, you initiate the read of the second buffer. After you have processed the first buffer, you wait for the second buffer i/o to complete. When the second buffer is available, you use move_alloc()
to swap the buffers (among several possibilities), and initiate the i/o on the new second buffer and begin processing the new first buffer.
The possible problem is that some compilers donāt support actual asynchronous i/o, they just do normal synchronous i/o using the asynchronous syntax, so you wonāt actually gain anything.
UPDATE: just added few links to the leaders, itās very interesting to read about their solutions.
Hi @RonShepard, thanks for the advice. In fact, all the āparticipantsā are normally comparing warm runs, when the whole 13Gb file is already in memory. The winners clearly use a lot of low-level C-magic (not because of C, but they are finally implemented in C) and include parallel workers, chunks, AVX instructions etc, and do it in ~0.2 s
In my case Iām using mmap to get the file into memory and process it there in chunks via OpenMP. I believe in this case the io and processing goes asynchronously, since my timing is ~4 s, i.e. closer to the leaders and way better then a single process.
Here is my hash table version in 134 lines. It completes single threaded in 46s, of which 27s is spent reading and breaking it into lines ā¦ My disk is slow time cat measurements > /dev/null
takes ~13s. This given @RonShepard , async IO could make a significant difference, at least to my local runs, but Iām not sure if itād make any difference if the whole file was in memory.
I might try this on a meaty AWS machine just so that I can use tmpfs to put the whole file in memory and see how quickly it can process it then
Regarding the parallel processing, I thought I could get 16 threads to split the key space using mod (ichar(key(1:1)), 16)
, all read the whole file, but only hash table the stuff in their key space, and then just output it at the end. Is this a terrible idea? Would it be better to chunk it and then write a reducer to combine the results?
gprof says a lot of time is spent in the hash function. Its just my sloppy implementation of FNV1-a. I didnāt know how to get around the unsigned int problem so I calculate a 32-bit hash in a 64-bit space and use modulus to control for overflow, but it seems slow. Do you see some obvious way to make it faster?
pure function hash(key,m) result(h)
use, intrinsic :: iso_fortran_env, only: int64
integer(int64), parameter :: prime = 16777619_int64
integer(int64), parameter :: basis = 2166136261_int64
integer(int64) :: h
character(len=*), intent(in) :: key
integer, intent(in) :: m
integer :: i
h = basis
do i = 1, len(key)
h = ieor(h, iachar(key(i:i), int64))
h = mod(h * prime, 2_int64**32)
end do
h = mod (h,m)
end function hash
EDIT: hmmā¦ maybe I can fix the prefix ā¦ It makes sense not consider the whole key but just a prefix long enough to distinguish the cities. Presumably if I halve the number of characters processed, I halve the time spent in this function.
Iām struggling to understand why this would be? x= merge(a,b,cond)
seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.
EDIT:
@ivanpribec I tried your SIMD functions and variants thereof. Thank you, it taught me a bit about SIMD! It is much slower that findloc for me. I think mostly because ā;ā (in my case new line) is not very far away from the start of any given search. I.e. its mostly within 32 characters. Here is my function below. I wrote it slightly backwards. This way there is no need to re-run the scan, and it can handle input of any length. Again, Iād imagine it would work better if the character we are looking for was much more sparse.
function findnl(ns) result(idx)
integer(1), intent(in) :: ns(:)
integer(1) :: r(16)
integer :: l, i, k, z, idx
l = size(ns)
i = 1
if (l >= 16) then
do i = 1, l, 16
r = transfer(ns(i:i+16-1),r)
z = 17
!$omp simd simdlen(16) reduction(min: z)
do k = 1, 16
z = min (z, merge(k, 17, r(k)==10))
end do
if (z<17) then
idx = i + z - 1
return
end if
end do
end if
do i = i, mod(l, 16)
if (ns(i)==10) then
idx = i
return
endif
end do
idx = 0
end function findnl
@sshestov my hash + mmap version (thank you for the help). Its now down to 38s for 1B rows
Iām struggling to understand why this would be?
x= merge(a,b,cond)
seems to always make an assignment but situations like calculating the minimum only require an assignment sometimes.
Iām not an expert, but a) the whole INTERNET claims it will be like that; b) it was in my case; c) a branchless version requiers 3-4 computations (and I believe the number of instructions will be close), while with if-then-else - the real execution can not be fast.
It looks (to me) like your hash function is very expensive. Can you try another one? BTW, I was also strugling with converting a signed int to an unsigned, and the solution is ā¦ to do +32767.
ā¦ a-a-a-a-a ā¦ you already changed to integer(kind=1) The remaining step is to read the file in chunks and do OpenMP. This is quite easy actually.
Also, a while ago I saw a claim from an experienced guy that inside the mmap one should use not c_ptr
but rather c_int_ptr
. I did not get it, and it was not working for me.
If anyone wants to test the Fortran codes without downloading a 13Gb file, I generated a file with 1e+6 rows, uploaded to GitHub - Beliavsky/1brc_data_subsets: subset data file for 1brc, the 1 billion row challenge, with 1 million rows using the Python code with the patch described at No such file in directory & 'charmap' codec can't decode Ā· Issue #715 Ā· gunnarmorling/1brc Ā· GitHub.
Here is my hash + mmap + OpenMP version in Fortran:
gfortran -fopenmp -march=native -ffast-math -O3 -o 1brc 1brc_hash_mmap_openmp.f90
time ./1brc | wc -l
8875
real 0m9.522s
user 0m51.326s
sys 0m0.872s
Thatās on 4 cores and 8 threads. mmap with OpenMP is wonderful. mmap fully deals with the random access of a file much too big to fit into memory. The problem at this stage is that the single threaded version canāt saturate even a single CPU. Utilisation for the parallel version is even worse for the same reason, so starting with the whole file in RAM is necessary to get to the 1.53s Java benchmark.
Still looking for a more performant hash function if anyone knows one!
Well, Iāve tried your code on my PC, and get up to 3.3 s (while there are runs with ~20 s and ~5 s). In the case of my version I get 2.9 s.
As for the has functions, I took examples from here1 or here2.
Iāve implemented the first one as follows:
integer function get_hash_i(array) result(res)
integer(kind=1), dimension(:), intent(in) :: array
integer :: len, i
integer(2) :: ha
ha=21845 ! 0x5555=21845; the 0x5555 is suggested by the author
do i=1,size(array)
ha = ieor( ior(ishftc(ha,2),ishftc(ha,-14)) , int(array(i),2) )
end do
!res = merge(int(ha,kind=4),65536-int(not(ha),kind=4)-1,ha>=0) !! int(2) is signed; converting to unsigned int(4)
res = 32766+ha
end function get_hash_i
Thanks! Does it mean we can expect soon a version from you?
Thank you for running it! What spec is your PC? Youāll get best performance by setting the number of parts to twice the number of cores (assuming hyper threading). For example, for 8 cores it should be integer, parameter :: parts = 16
. Iām super excited about trying the hash functions! Will report back.
Also, can you share what parameters youāre using with gfortan to get the best performance speed?
EDIT : I tried your function above verbatim, it generates a huge amount of conflicts, so you might get much better performance by using mine instead (its FNV1-a).
In my case itās Intel i9-9900X 3.50GHz. Iām using either -O3 -march=native
or just -O3
. I do not see any significant difference is speed. Iāve also tried different compilers (with different parameters) and my conclusion is that gfortran
is significantly faster than others. I suspect itās just a coincidence for this particular problem (gfortran
was able to vectorize it better than the others).
If I run your code with parts=8
I get 4.2s, parts=16
=> 3.5s, and managed to get 2.8s with parts=20
. I must admit though that the execution time jumps from 3.5 to 20 s in a random way. While in my case Iām almost always get very much same results. Does it suggest an un-foreseen behavior in your code?
Also for the hash function: I do not resolve conflicts (intentionally); sufficiently large hash table & algorithm result in no conflicts. At least in my case on a given set of realistic city names. I can not imagine why do you get them. As it was my way to broaden my horizon, I did not want to go to extreme and with ārandom length city names with random lettersā.
The crazy variation in time I think is caused by mmap I think , when you overload it with concurrent access it results with threads locking and waiting for the rest to finish before continuing. It only happens after a particular level of concurrency.
Your hash function and code seem to produce lots of conflicts so far as I can see. Try cutting at ā;ā and doing:
cat measurements.txt | grep -oP ā^.+;ā | sort -u | wc -l
at the command line and compare it to the number of rows you generate.