Fortran code with a large and complex dataset

Hi folks,

I have a question that may be addressed before, but I cannot trace it.

In a code I’m writing, there is a huge dataset (think of catalogue of molecules, so each record contains various information, strings, arrays of strings, numerical values, numerical arrays, logical values… And the same dataset may be used in various codes.

For different reasons, I am trying to avoid having the catalogue in an external file (to avoid possible troubles when reading it if it is in something like h5, to prevent users from changing randomly numbers in the file, etc).

What would be your advice? Should I hard code it in special module? Or maybe compile it as a library? Any other ideas from similar experiences?

More than anything I’d like to avoid spending more time in formatting the bloody thing. :dizzy_face:

Many thanks!

You say it is a huge dataset, if you were to store it as static data in code, then your program would be huge as well. Not to mention that it might be difficult to maintain that dataset, because you would need to intersperse the actual data with actual code.

Can you indicate the size of the dataset?

If the intended “data library” is a shared object or dll, then that’s an option. Otherwise, having the data in a module intended for static linking might dramatically increase the size of the final executable.

But the shared-object/dll approach is similar to just storing the data as an unformatted stream known only by your program —with your own *.my.obscure.format extension if necessary.

Just keep in mind that with the “unformatted stream” approach, the first entry in the file should be a “file format version” indicator.

1 Like

You could define a derived type that stores the data, create a subroutine that writes the data as unformatted stream, and create another subroutine that reads the data as unformatted stream. If your derived type has allocatable components you will write and read its dimensions along with the data. An example of doing this for a dataframe (matrix of data with column names) is here.

2 Likes

I think the usual approach to this problem would be to use the file system permissions to prevent users from modifying the file. In fortran, the OPEN statement has the ACTION=READ keyword that can also restrict access to read only within that program unit, overriding any write access that would be allowed by the file system.

2 Likes

For that approach to work in real life, a checksum of the file would also be needed, to guarantee integrity.

One shouldn’t rely on the user not having some sort of admin access to the data —e.g., on Linux and macOS, the first actual user tends to belong to the sudo group; on Windows, the user might have annoyed the IT department so much, that they ended up granting some admin privileges.

(I actually did the “annoy IT until the problem is solved” once, but on macOS, because some software I needed to install required DNS-related privileges)

So just declaring the values in code is not an option? I have seen character arrays containing NAMELIST input used as substitutes for files; I have seen binary data compressed and encoded as text as well; but they were different use cases. But if the dataset is so large or there are so many cases that those options are not even on the table it is also not clear how high the risk is that the users want to alter the data; but a fixed path to a read-only file is usually sufficient to prevent someone from accidently shooting themselves in the foot. If you are expecting people to intentionally try to alter the data using a checksum and encrypting the data is probably the next step. But to even begin making reasonable suggestions the size of the dataset needs disclosed.

1 Like

Also, the expected use patterns, ie. is access to the data inside a loop or nest of loops (this makes exclusively reading from a file a no-go for me). Is all of the data in the file used in a given application or just a few of the data items in the file used based on other user input. Also, is the data to be used in a parallel program. If you have the memory (which most people do for what some would consider “large” datasets) its always better to read the data once into some data structure (array, user derived type etc) and have routines to access the memory instead of trying to manipulate the external file.

The first user is (and has to be) an administrator of the machine. It can be changed afterwards, once other admin users are created. If a user is an administrator of his own machine, there’s no way to fully prevent him from breaking stuff on it.

First of all, I apologize for the late reply. I’ve been distracted by some personal matter in the last couple of days.

Thank you all for great suggestions. You gave me some ideas to work and experiment with.

The size of the data set may vary and I can do some compromises. The part that I would like to “hide” from users is of the order of 1 Gb.

@jwmwalrus, yes, creating a customized obscure format may be a good solution. The structure of the data is relatively complex, but it certainly does not need all the complexity and overheads of h5.

The root user is the administrator (or, actually, the owner) of a unix-ike machine and must always exist.

But the (back then) Mac OS X popularized the idea of disabling root access and granting sudo privileges to the first /Users user. Ubuntu mimicked that (for the user whose UID=1000), and other Linux distributions followed.

(hmm… it seems I’m old, :laughing:)

Agreed.