Simple hash table implementation

Dear all,

I want to implement a simple hash table in Fortran - in the first place, to learn a little bit about hash tables and their organization, second place, to flex my Fortran muscles a little bit.

My first guess was the Fortran Wiki: Hash tables in Fortran Wiki, which I personally didn’t find helpful - except the hint at Kernighan’s Practice of Programming and the transfer procedure.

I ended for reference with the MUSL implementation of the POSIX hash table management: hsearch.c\search\src - musl - musl - an implementation of the standard library for Linux-based systems, search.h\include - musl - musl - an implementation of the standard library for Linux-based systems.

!! use iso_fortran_env, only: i64 => INT64
integer(i64) function hash_key(key) result (hash)
   character(len=*), intent(in) :: key
   integer :: i
   hash = 0
   do i = 1, len(key)
      hash = 31 * hash + ichar(key(i), i64)
   end do
end function hash_key

 subroutine hash_table_insert (table, key, data)
    class(hash_table_t), target, intent(inout) :: table
    character(len=*), intent(in) :: key
    class(*), pointer, intent(in) :: data
    integer(i64) :: hash
    hash = hash_key(key)
    block
      type(hash_table_entry_t), pointer :: table_entry
      integer(i64) :: i, j, table_index
      i = hash
      j = 1
      do
         table_index = iand(i, int(table%n_entries, i64)) + 1
         table_entry => table%entries(table_index)
         if (.not. allocated(table_entry%key) .or. table_entry%key == key) exit
         j = j + 1
         i = i + j
      end do
      print *, "FOUND EMPTY BUCKET", table_index
      table_entry%key = key
      table_entry%data => data
    end block
  end subroutine hash_table_insert

Here my question, they tend towards a more general hash implementation, maybe you can hint me to some nice sources:

  1. Instead of using character(len=*), intent(in) :: key could I use unlimited polymorphism and transfer it to an integer array? To my understanding, I would not transfer the actual content of the target, but only the unlimited polymorphism pointer and make hash out of the pointer itself?
  2. If I would use unlimited polymorphism, how do I infer the correct size for transfer? storage_size divided by the bit-size of an specific integer-kind?
  3. Integer arithmetic: The standard does not define how an overflow has to be handled. I know howto to “emulate” unsigned integer arithmetic with either larger integer kinds or using a module 2^30, but I don’t see how this may help me in this case.

Thanks for your help! The forum is a great source :).

Cheers,
Simon

2 Likes
  1. You need to use the value of the argument, not it’s memory location. Two strings that are stored at different locations but have the same contents should be considered equal for the purposes of a key.
  2. I’m pretty sure you can’t. Since a class(*) variable doesn’t know anything about what it’s pointing at, there’s no way of determining its size.
  3. I think you’re stuck “emulating it” with a higher kinded (larger size) integer. You could in theory implement some sort of arbitrary precision math library, but that would kill the performance of your hash table.

This is I think illustrative of a key reason that the current work on generics by the standards committee is so important. This is the kind of thing you’d expect any language to be able to do easily.

Also, FYI, Fortran doesn’t short circuit boolean operations, so if (.not. allocated(table_entry%key) .or. table_entry%key == key) exit will crash.

3 Likes

Thanks for the nice and illustrative answer.
I will think a little bit more about the design and how much I can in principle reuse from MUSL, but I think it is not as a simple task as I originally thought.

I always forget about the short-circuit behavior! Thanks for pointing that out - I started many years ago with C and it still sticks.

Hi Simon @sbrass, welcome to the Discourse!

If your intent is to support keys of many different types, an alternative approach is to have a generic procedure interface that generates a custom type(key) that is then passed to your hash_table_insert method. This is how I implemented my own hash table implementation (link below). The key type can be abstract and only needs to support a hash() method to generate the hash and and the .equals. operator to test for equality. This gives a lot of flexibility to add different key types and doesn’t require using unlimited polymorphism.

A demo from fhash:

program fhash_demo1
  use fhash, only: fhash_tbl_t, key=>fhash_key
  implicit none
  type(fhash_tbl_t) :: tbl
  integer :: val

  call tbl%set(key('my_key_1'), value=10)
  call tbl%set(key('my_key_2'), value=1.0)
  call tbl%set(key(123456), value='a string value')
  call tbl%set(key([1,2,3,4,5]), value=.false.)

  call tbl%get(key('my_key_1'),val)

end program fhash_demo1

github: https://github.com/LKedward/fhash
Documentation: https://lkedward.github.io/fhash/index.html

1 Like

fhash looks great.
The design pattern with the overloading the different types is interesting (and I didn’t know that you could it this, i.e. using different modules and aliasing the name for the extended types).

But, for the first, I came to the conclusion that it is sufficient for me to support only character keys, as I want to only access the table by string keys (e.g. reading in a configuration table and accessing the values by their respective key).

The used hash algorithm: Fowler–Noll–Vo hash function - Wikipedia, however, still requires unsigned integers.

So, let me reformulate the question: What is the most forgiving way of computing a hash in Fortran, allowed by the standard? Or should I let this be done by C, just export the string to C and let C handle the hashing?

Cheers,
Simon

PS: I would think that this would be discussed somehow with regard to the stdlib?

I see. Yes you’re right, FNV does need unsigned arithmetic; in fhash I do as Brad wrote above and simply use a larger integer kind (int64) to compute a 32bit hash. Hence unfortunately I am limited to hashing 32bit inputs only. Using c for 64bit hashing is a reasonable solution I think.

There’s a hidden-away stdlib thread here that discusses hashing; it looks like William Clodius has implemented several hash functions in both 32bit and 64bit, however I’m not yet sure how he’s done the latter in Fortran signed arithmetic.

1 Like

FWIW, I have a module that provides a way to do unsigned arithmetic in Fortran. I have not put it to much use, I admit, but it gives results that are in agreement with C (as far as the test program is concerned). It defines a new division operator, because that is the one operation that is really different. (See flibs - a collection of Fortran modules / SVN / [r427] /trunk/src/computing/unsigned_integers.f90)

2 Likes

@sbrass ,

I recommend using a C function for hashing (or C++ with extern C attribute): it’s so much more straightforward given Fortran 2003 and standard interoperability.

Yes, you can jump through hoops in Fortran with bitwise arithmetic with a higher bitness type that will end up being “clever programming” with all its travails. And for this one would be inclined to make use of the additional language support introduced starting Fortran 2008 standard, however robust compiler support was not uniform making the code less portable i.e., until enough compilers had caught up to the standard.

1 Like

For clarification - I was a little bit puzzled with:

And for this one would be inclined to make use of the additional language support introduced starting Fortran 2008 standard, …

You refer to introduction of storage_size and additional bit operations?
I checked against Fortran 2008 in Fortran Wiki, for completeness.

There are hash tables in the GLib library (C language):
https://developer.gnome.org/glib/stable/glib-Hash-Tables.html
They should be accessible via gtk-fortran (I never tried):
https://raw.githubusercontent.com/vmagnin/gtk-fortran/gtk3/src/glib-auto.f90

See this tutorial: Manage C data using the GLib collections – IBM Developer

1 Like

Hi everyone,

I followed some of your hints and instructions and build together my simple hash table implementation based upon hsearch in the musl C library.

You can find the implementation on github: GitHub - sbrass/simple_hash_table

Do you have any further ideas for improvements? For example, documentation system: Doxygen, Ford,…
The hash table “library” is rather smallish, would you recommend on using submodules for separating interfaces and implementation details?

Cheers,
Simon

PS: Many thanks to everyone!!!

1 Like

All current F90+ processors use two’s complement arithmetic. I believe that, provided you avoid the more complicated operations, e.g., division and its relatives (e.g. modulus) and exponentiation, and provided you use the same bitsize for both sides of binary operations (+, -, *, ieor, ior, iand) the bit results for two’s complement and unsigned are identical. In hashing smaller bitsize arguments to a larger bitsize hash value, you need to be careful to use transfer and not assignment to widen the smaller bitsize argument, and the details of the transfer need to respect the endianness of the processor, i.e., to transfer an int8 arg on a little endian processor
int32_value = transfer( [ arg, 0_int8, 0_int8, 0_int8], int32_value )
while on a big endian processor you need
int32_value = transfer( [ 0_int8, 0_int8, 0_int8, arg], int32_value )

3 Likes

A question for readers, cc: @lkedward , @wclodius :

Recently there was a link posted to a project online (GitHub?) that had various hash functions (DJB2, MD5, SHA*, and more) coded up in Fortran. I’ve misplaced the link, can someone who may recall please post it here? Thanks,

While I have coded up a number of hash functions, none of them is a cryptographic hash such as MD5 and SHA*. I am not aware of anyone else that has coded them up.

Here’s a SHA-256 module: GitHub - leetmaa/bitsy: A fortran module for SHA-256 hashing.

1 Like

Apologies, I do not recall such a collection of Fortran hash implementations – I would certainly be interested in it if found!

Thank you very much @wclodius , @milancurcic , @lkedward for your responses.

I might have been thinking of this GitHub - dbartilson/fh_table: Modern Fortran hash functions and hash tables but, as I mentioned, I have unfortunately misplaced/lost my notes and my memory fails me. However fh_table is close enough to what I wanted to review so that gives me a start.

Regards,

2 Likes

For the sake of completeness, I want to point to a MD5 implementation, which resides inside a High-Energy Physics Monte Carlo program written in (mostly) modern Fortran (I am affiliated with the program, but forgot about its MD5 implementation until it popped up here):

https://whizard.hepforge.org/downloads/

You can find the MD5 implementation (stripped of comments) under src/combinatorics/md5.f90, and the documented source code in share/doc/whizard.pdf.

2 Likes

ftlHashMap
An associative containers that stores elements formed by the combination of a key value and a mapped value, and which allows for fast retrieval of individual elements based on their keys. It’s basically a dictionary that internally uses a hash table to allow constant time retrieval of elements. ftlHashMap is very similar to C++'s std::unordered_map (though its interface is a bit less awkward).

2 Likes

I had the same question and problem about 6 years ago. So my 2 cents here - relatively simple implementation of the non-generic hash table with related modules (linked list, and simple hash functions): zmi_sll_hashtable_module

2 Likes