I want to implement a simple hash table in Fortran - in the first place, to learn a little bit about hash tables and their organization, second place, to flex my Fortran muscles a little bit.
My first guess was the Fortran Wiki: Hash tables in Fortran Wiki, which I personally didn’t find helpful - except the hint at Kernighan’s Practice of Programming and the transfer procedure.
!! use iso_fortran_env, only: i64 => INT64
integer(i64) function hash_key(key) result (hash)
character(len=*), intent(in) :: key
integer :: i
hash = 0
do i = 1, len(key)
hash = 31 * hash + ichar(key(i), i64)
end do
end function hash_key
subroutine hash_table_insert (table, key, data)
class(hash_table_t), target, intent(inout) :: table
character(len=*), intent(in) :: key
class(*), pointer, intent(in) :: data
integer(i64) :: hash
hash = hash_key(key)
block
type(hash_table_entry_t), pointer :: table_entry
integer(i64) :: i, j, table_index
i = hash
j = 1
do
table_index = iand(i, int(table%n_entries, i64)) + 1
table_entry => table%entries(table_index)
if (.not. allocated(table_entry%key) .or. table_entry%key == key) exit
j = j + 1
i = i + j
end do
print *, "FOUND EMPTY BUCKET", table_index
table_entry%key = key
table_entry%data => data
end block
end subroutine hash_table_insert
Here my question, they tend towards a more general hash implementation, maybe you can hint me to some nice sources:
Instead of using character(len=*), intent(in) :: key could I use unlimited polymorphism and transfer it to an integer array? To my understanding, I would not transfer the actual content of the target, but only the unlimited polymorphism pointer and make hash out of the pointer itself?
If I would use unlimited polymorphism, how do I infer the correct size for transfer? storage_size divided by the bit-size of an specific integer-kind?
Integer arithmetic: The standard does not define how an overflow has to be handled. I know howto to “emulate” unsigned integer arithmetic with either larger integer kinds or using a module 2^30, but I don’t see how this may help me in this case.
Thanks for your help! The forum is a great source :).
You need to use the value of the argument, not it’s memory location. Two strings that are stored at different locations but have the same contents should be considered equal for the purposes of a key.
I’m pretty sure you can’t. Since a class(*) variable doesn’t know anything about what it’s pointing at, there’s no way of determining its size.
I think you’re stuck “emulating it” with a higher kinded (larger size) integer. You could in theory implement some sort of arbitrary precision math library, but that would kill the performance of your hash table.
This is I think illustrative of a key reason that the current work on generics by the standards committee is so important. This is the kind of thing you’d expect any language to be able to do easily.
Also, FYI, Fortran doesn’t short circuit boolean operations, so if (.not. allocated(table_entry%key) .or. table_entry%key == key) exit will crash.
Thanks for the nice and illustrative answer.
I will think a little bit more about the design and how much I can in principle reuse from MUSL, but I think it is not as a simple task as I originally thought.
I always forget about the short-circuit behavior! Thanks for pointing that out - I started many years ago with C and it still sticks.
If your intent is to support keys of many different types, an alternative approach is to have a generic procedure interface that generates a custom type(key) that is then passed to your hash_table_insert method. This is how I implemented my own hash table implementation (link below). The key type can be abstract and only needs to support a hash() method to generate the hash and and the .equals. operator to test for equality. This gives a lot of flexibility to add different key types and doesn’t require using unlimited polymorphism.
fhash looks great.
The design pattern with the overloading the different types is interesting (and I didn’t know that you could it this, i.e. using different modules and aliasing the name for the extended types).
But, for the first, I came to the conclusion that it is sufficient for me to support only character keys, as I want to only access the table by string keys (e.g. reading in a configuration table and accessing the values by their respective key).
So, let me reformulate the question: What is the most forgiving way of computing a hash in Fortran, allowed by the standard? Or should I let this be done by C, just export the string to C and let C handle the hashing?
Cheers,
Simon
PS: I would think that this would be discussed somehow with regard to the stdlib?
I see. Yes you’re right, FNV does need unsigned arithmetic; in fhash I do as Brad wrote above and simply use a larger integer kind (int64) to compute a 32bit hash. Hence unfortunately I am limited to hashing 32bit inputs only. Using c for 64bit hashing is a reasonable solution I think.
There’s a hidden-away stdlib thread here that discusses hashing; it looks like William Clodius has implemented several hash functions in both 32bit and 64bit, however I’m not yet sure how he’s done the latter in Fortran signed arithmetic.
FWIW, I have a module that provides a way to do unsigned arithmetic in Fortran. I have not put it to much use, I admit, but it gives results that are in agreement with C (as far as the test program is concerned). It defines a new division operator, because that is the one operation that is really different. (See flibs - a collection of Fortran modules / SVN / [r427] /trunk/src/computing/unsigned_integers.f90)
I recommend using a C function for hashing (or C++ with extern C attribute): it’s so much more straightforward given Fortran 2003 and standard interoperability.
Yes, you can jump through hoops in Fortran with bitwise arithmetic with a higher bitness type that will end up being “clever programming” with all its travails. And for this one would be inclined to make use of the additional language support introduced starting Fortran 2008 standard, however robust compiler support was not uniform making the code less portable i.e., until enough compilers had caught up to the standard.
Do you have any further ideas for improvements? For example, documentation system: Doxygen, Ford,…
The hash table “library” is rather smallish, would you recommend on using submodules for separating interfaces and implementation details?
All current F90+ processors use two’s complement arithmetic. I believe that, provided you avoid the more complicated operations, e.g., division and its relatives (e.g. modulus) and exponentiation, and provided you use the same bitsize for both sides of binary operations (+, -, *, ieor, ior, iand) the bit results for two’s complement and unsigned are identical. In hashing smaller bitsize arguments to a larger bitsize hash value, you need to be careful to use transfer and not assignment to widen the smaller bitsize argument, and the details of the transfer need to respect the endianness of the processor, i.e., to transfer an int8 arg on a little endian processor
int32_value = transfer( [ arg, 0_int8, 0_int8, 0_int8], int32_value )
while on a big endian processor you need
int32_value = transfer( [ 0_int8, 0_int8, 0_int8, arg], int32_value )
Recently there was a link posted to a project online (GitHub?) that had various hash functions (DJB2, MD5, SHA*, and more) coded up in Fortran. I’ve misplaced the link, can someone who may recall please post it here? Thanks,
While I have coded up a number of hash functions, none of them is a cryptographic hash such as MD5 and SHA*. I am not aware of anyone else that has coded them up.
For the sake of completeness, I want to point to a MD5 implementation, which resides inside a High-Energy Physics Monte Carlo program written in (mostly) modern Fortran (I am affiliated with the program, but forgot about its MD5 implementation until it popped up here):
ftlHashMap
An associative containers that stores elements formed by the combination of a key value and a mapped value, and which allows for fast retrieval of individual elements based on their keys. It’s basically a dictionary that internally uses a hash table to allow constant time retrieval of elements. ftlHashMap is very similar to C++'s std::unordered_map (though its interface is a bit less awkward).
I had the same question and problem about 6 years ago. So my 2 cents here - relatively simple implementation of the non-generic hash table with related modules (linked list, and simple hash functions): zmi_sll_hashtable_module