Excessive main memory usage by stdlib_hashmaps routines

I am using the stdlib_hashmaps module in an application that needs a large hash table (wishing for several times 100M entries) and each entry is of a small fixed size; a few default integers for key and data combined. The expected (to me) total memory usage would be several GB, which would pose no problem on my desktop with 64 GB RAM. But my code is using much more RAM than I had expected.
Is there a work-around, like maybe some way to induce the hashmap routines to clear up space?
To clarify things I produced a simple test program that creates a hashmap for n data items, n a power of 2, in which each key uses just one default integer and there are no further data. Here is that program.

program main
use stdlib_hashmaps, only : chaining_hashmap_type
implicit none
integer, parameter :: n = 2**28
call new_map (n)
stop
contains
subroutine new_map (n)
integer, intent (in) :: n
! Produce a hash map containing n items, report the number of entries
type (chaining_hashmap_type) :: map
integer :: i
call map%init()
do i = 0, n-1
call map%map_entry ([i])
end do
write (,) ā€˜entries:’, map%entries()
return
end subroutine new_map
end program main

With the given value n=2^28 the peak RAM usage is 47.0 GiB (reported by top). It is the same for the chaining_hashmap_type and the open_hashmap_type. But if the hashmap for this instance would have 2^29 slots and each slot uses the minimal 4 bytes for this application then total RAM usage would be only 2 GiB.
Please advise!
(I have noticed the earlier post ā€œStdlib Hashmap errorā€ by Chuckyvt on this forum in December 2023. That concerns the stack size, not the RAM usage. I work around the stack size issue by using ulimit -s unlimited or a suitable large value for the limit.)

1 Like

Doesn’t chaining_hashmap_type imply a hash table that uses linked lists to manage collisions? This means that the initial array is at whatever size the hashmap decides and each array element contains the head of the linked list that stores actual elements. Thus for 2^28 entries are likely to have lots of linked list elements allocated for them, much greater than 4 bytes per element.

I have tried to understand the large memory usage by looking at the stdlib source code, but I cannot figure it out. My experiment (reported above) suggests that about a dozen 64-bit pointers are being allocated per map entry or per allocated slot. But I can’t find anything like that in the source code, and also I cannot let it make sense. Why more than three pointers per map entry or per slot? One for the key, one for the value (absent in my test, but the library code would not know that) and one to point to the chain.

Some recent work has been devoted to improving their memory handling on Windows. Are you using a Windows machine to run this code? The author of that work, @hkvzjal, is active on this forum, he may have more hints to solve this issue.

1 Like

Thanks for your interest and for pinging the package author. I am in a RedHat Linux environment (no root privileges) with gfortran 14.2.1. I used the fortran-lang Fortran Package Manager (fpm 0.12.0) for the test described at the top of this thread. Created a new project with fpm new, then entered the posted main.f90 into the app subdirectory and created an fpm.toml file; then fpm build and fpm run. I append any possibly relevant bits of my fpm.toml

name = ā€œmy_stdlib_testā€
[build]
auto-executables = true
auto-tests = true
auto-examples = true
module-naming = false
[fortran]
implicit-typing = false
implicit-external = false
source-form = ā€œfreeā€
[dependencies.stdlib]
git = ā€œGitHub - fortran-lang/stdlib: Fortran Standard Libraryā€
branch = ā€œstdlib-fpmā€
[[executable]]
name = ā€œsamplingā€
source-dir = ā€œappā€
main = ā€œmain.f90ā€

I’m not the author of the module, I just helped with an issue found on Windows because of the stack size and the solution was to increase the allocation of stack memory. This doesn’t address OPs original issue on why the stack mem gets blown out.

I did see a few spots in the code where stack arrays are used instead of heap arrays (allocatables). I wonder if because of the recursivity nature of creating the hash map this might get blown out fast.

1 Like

There is indeed also a stack size issue with the hashmaps module. This was raised by @Chuckyvt in a post ā€œStdlib Hashmap errorā€ on this forum in December 2023. I ran into it and it held me back for some time until I found that I needed to increase ulimit -s to something like 1GB from the default 8MB (or just set it to unlimited). The very large stack usage seems to be localized to the finalization routines in the module. (In the earlier posted main.f90, if one has not done a ulimit -s operation then the stack size error happens after the final print statement.)
But the issue raised in the present thread is RAM usage, not stack usage. It seems to me (with evidence via that main.f90) that the module is using something like 24 32-bit words of overhead per hash table entry or slot. For a large hash table with small table entries that poses a problem.

1 Like

The chaining map entry code is here starting at line 320.

type :: chaining_map_entry_type 
       private
       integer(int_hash)  :: hash_val

       type(key_type)     :: key

       class(*), allocatable :: other

       integer(int_index) :: inmap

       type(chaining_map_entry_type), pointer :: next => null()

   end type chaining_map_entry_type

Here is my swag at overhead per entry in bytes, based on some feedback from Grok on typical overhead for pointers, allocatable arrays, and derived types.

hash_val 4
key 32
other 16
inmap 4
pointer 12

That sums up to 68 bytes per entry, which is in the ballpark of your 96 byte estimate. You could create a similar derived type in your test program and use the sizeof() argument to perhaps get a more accurate measurement.

I am sure there is room for improvement. The key is an int_16 allocatable array wrapped in a derived type. This could almost certainly be changed to native allocatable int_16 array. Could also consider changing the int_16 allocatable array to a pointer.

Honestly however, hashmaps probably aren’t ever going to be overly memory efficient when dealing with large amount of entries of relatively small data packets.

2 Likes

Very clear, thank you. Ballpark indeed, I was unsure whether to count overhead per entry or per allocated slot and there is a factor of two right there. And Grok and other LLM assure me that the application that I have for a large hash table with small individual entries is not uncommon: in each key-value pair the key describes some simple combinatorial object and the value is a count of that object in a very large database.

I don’t think it would change anything. The memory overhead is not because of the derived type encapsulation, but because of the internal descriptor that comes with a pointer or allocatable array. From what I can see, the descriptor of a 1D array is 64 bytes large with gfortran, and 72 bytes with ifx.

Using a type(c_ptr) would reduce the memory footprint to 8 bytes, but the code would be more complex and probably less efficient (with many calls to c_loc() and c_f_pointer(). Alternatively, the whole hashmap library could be written in C, with wrapper routines for Fortran.

PS: the estimations by Grok look fairly wrong

Edit: the whole chaining_map_entry_type storage size is 112 bytes in gfortran / 224 bytes in ifx.
hash_val: 8 / 8
key: 64 / 72
other: 24 / 128
inmap: 8 / 8
next: 8 / 8

1 Like

Sorry if I’m misreading, but…

Isn’t the issue here that the load factor is being ignored?

If you have 2**28 elements, which require 112 * 2**28 = 28 GiB (using @PierU 's value for gfortran), then at some point before that the load factor’s threshold was reached and the hashmap doubled its capacity (maybe it happened around the 23 GiB mark).

No, it is not being ignored. We understand that with 2^28 items getting stored in the hash table the number of slots will be 2^29.

Then, the problem is in expecting zero-metadata and zero-sized recursion?

Note that the memory I have reported is just the overhead, without actually allocating the key and the entry.

Any chance you could rerun and replace the key_type with an int_16 allocatable array? Just curious how much (if any) the derived type is costing.

Zero

The impossibility of creating pure procedures would be another unfortunate drawback of using type(c_ptr) instead of a Fortran array descriptor.

So what might be an approach towards a memory-efficient hash library in Fortran catering to large hash tables in which each key-value pair has small memory footprint? (Imagine that the user’s keys and values are fixed-size small arrays of int32.) Could it involve a hashmap type in which the key and value are of an abstract type and the user must create the appropriate type extension? It is just a half-baked guess, I have not tried to develop the thought.

The type(c_ptr) approach is a candidate for a more memory efficient approach. Losing the purity is not a big deal IMO.

Another approach would be more F77-style, with integer indeces that point to flat arrays containing all the keys and entries of a given hashmap. These flat arrays would be allocatables, and possibly reallocated as needed when the hashmap is growing (with a similar strategy to the C++ vectors).

integer(int16), allocatable :: key(:)
class(*), allocatable :: other(:)

type :: chaining_map_entry_type 
       private
       integer(int_hash)  :: hash_val
       integer(int64)     :: key_index   ! position in the key(:) array
       integer(int64)     :: other_index ! position in the other(:) array
       integer(int_index) :: inmap
       type(chaining_map_entry_type), pointer :: next => null()
end type chaining_map_entry_type

I guess that the 2 indeces are actually the same… But the number of pairs should be stored as well.

The memory overhead would be 40 bytes.

Or deliberately using hash values that have a ā€œsmallā€ range, such that many collisions happen. The downside is that searching an element will be slower, but on the bright side a node of the linked list will store many elements on average, so that the memory overhead be less critical.