Fpm digest Hashing handling (fnv_1a algorithm)

Hello everyone,

I am currently working on package validation part on fpm-registry, for package verification part we are implementing set of checks that will be checked on each uploaded package, one of the checks is to verify the hash (digest) of the files by recomputing them on the registry servers.

fpm uses fnv_1a algorithm (fpm/src/fpm_strings.f90 at main · fortran-lang/fpm · GitHub) for generating the digests of the files, but before generating the digests it converts the files using dilate function (fpm/src/fpm_strings.f90 at main · fortran-lang/fpm · GitHub) , I have implemented its equivalent in python, it passes the conversion checks of all the tests other than for M_strings by @urbanjost ( which on debugging , I discovered only on the file M_strings/test/test_suite_M_strings.f90 at master · urbanjost/M_strings · GitHub has been failing the tests.) . I would be grateful for any help in debugging the issue.

# python code
def hash(input_string):
    hash_val = np.int64(2166136261)
    FNV_PRIME = np.int64(16777619)
    for char in input_string:
        hash_val ^= np.int64(ord(char))
        hash_val *= FNV_PRIME
    return hash_val

def dilate(instr):
    outstr = ''
    for i,char in enumerate(instr):
        if char == '\t':
            outstr += ' ' * (8 - i%8)
        elif ord(char) in [10,13]:
            continue  # Ignore carriage returns and line feeds
        else:
            outstr += char
    return outstr


def check_digests(file_path: str) -> Tuple[int, bool]:
    """
    Check the digests of files specified in the fpm model.

    This function iterates through the sources specified in the provided data,
    computes the digest of each file, and compares it with the expected digest.
    If the computed digest does not match the expected digest, it prints an error
    message and increments the error count and return the total number of errors and a flag.

    Args:
    - file_path (str): A String containing the path to the fopm model file to be checked.

    Returns:
    - error_count (int): The total number of errors encountered while checking digests.
    - Flag (Bool): True if no errors were encountered, False otherwise.
    """

    # Initialize error count
    error_count: int = 0
    with open(f'{file_path}model.json', 'r') as file:
        model = json.load(file)

    src_data: dict = model['packages'][model['package-name']] 
    s = set()

    # Iterate over each item in the sources dictionary
    for _, source_info in src_data['sources'].items():
        # Extract digest and file name for the current source
        expected_digest: str = source_info['digest']
        file_name: str = source_info['file-name']

        # Read the content of the file
        with open(file_name, 'r',newline='') as file:
            file_content: str = file.read()

        # Compute the digest of the file content
        computed_digest: int = hash(dilate(file_content))

        # Print computed digest, expected digest, and file name
        print(computed_digest == expected_digest, file_name)

        # Check if computed digest matches the expected digest
        if computed_digest != expected_digest:
            # Print error message if digests don't match
            print(f"Computed digest {computed_digest} does not match expected digest {expected_digest} for file {file_name}")
            print("ERROR")
            error_count += 1
            

    return (error_count, True if error_count == 0 else False)

Thanks to @urbanjost for the M_strings package, it has been helping us a lot for testing with the registry. :slight_smile:

This hash function relies on (silent) integer wraparound occurring, AFAIK. Does that happen with Python/Numpy integers? Perhaps using uint64 would be safer?

Numpy integers support the integer wraparound, as the hash value might be negative (fortran only has signed integers).

Great progress @henilp105, that only 1 file does not return the same digest as fpm.

This suggests that there is some edge case that has to be considered, e.g.:

  • end of line characters / trailing spaces?
  • non-ascii characters?
  • integer overflow in the hashing algorithm?

Here is what I would do to debug:

  • Create 2 standalone programs, Python and Fortran, by only extracting the necessary routines to compute the digest code.
  • Test it on that one file, and keep removing some lines until you find the point that causes the different digest
  • Now try to understand the root cause by looking at the actual numbers and characters at that point.
1 Like

Thanks @FedericoPerini , I will try doing that. I have considered eol , trailing spaces and int overflow, I haven’t considered non ASCII chars (this might be a possible cause).