Metadata labels for simulation data

juannaviap · February 3, 2022, 10:58am

Hi!

What good practices are recommended for labeling data generated in Fortran simulations?

I am working on a problem that depends on many parameters and I don’t know if it is a good idea to add directly from the program a header with the current parameters or is there a better strategy.

Thanks!

Arjen · February 3, 2022, 11:34am

Is the purpose of these metadata or labelling that you can reproduce the simulation at some time in the future? Is it possible to reconstruct the values of the parameters from a reduced set of information? I mean, just as an example, your simulation involves some substance and you need to supply a dozen values representing the properties, then of course identification of the substance might be sufficient. Otherwise the full storage of the input, along with version information on your program would be necessary.
What you could do, though, is store the input in a version control system and use the commit associated with the input as a concise way to characterise the input.Via the control system you can retrieve the actual input. Just a thought, mind you.

juannaviap · February 3, 2022, 12:15pm

Yes, the idea is to use the metadata to reproduce the simulation in the future and also to compare with other techniques based on the same parameters. My question is about the best way to do this in Fortran.

As a header in the dataset itself
As an external file linked in some way to the data
…

Beliavsky · February 3, 2022, 12:15pm

That is what I have done. In addition to program parameters you can consider adding
(1) name of executable via
call get_command_argument(0,executable_name), preferably with path
(2) when the program was run by processing the output of date_and_time()
(3) compiler_version()
(4) compiler_options()
(5) names of parameter and data files used
(6) at the end, the time taken for the code to run, via calls to cpu_time() or system_clock()

juannaviap · February 3, 2022, 12:18pm

I’m interested in your workflow. Do you add the information to the generated dataset or use an external file to save the metadata?

Beliavsky · February 3, 2022, 12:41pm

I put it in the same file as the generated data, writing both the parameters and simulation data as CSV.

MarDie · February 3, 2022, 2:43pm

I recommend HDF5 for this purpose, at least if you are developing a larger application. Using the native HDF5 bindings requires a lot of boilerplate code, but @scivision has written a high-level interface called h5fortran.

Note: I have not tried h5fortran, but it make a good impression.

certik · February 3, 2022, 5:54pm

@juannaviap excellent question!

Yeah, hdf5 is nice that it produces just one file and the file is self-contained: can have both data and metadata. The problem with hdf5 is that the library is quite complicated and hard (and slow) to build. The format itself is also very complicated.

Another idea is to use a some human readable “header” (such as json, yaml, namelist, etc.) and a binary “data” with all the arrays; the structure of the data is described in the header.

There is a new format called ASDF:

That is trying to address precisely the above issue, but I think currently they only have Python implementation, so we would have to create a Fortran implementation. It is just one file also, and it contains a human readable header, and a binary data, all in one file.

samharrison7 · February 3, 2022, 6:23pm

It’s great to see this attention being paid to reproducibility! It’s a complex topic and there are lots of different ways you can go about it.

I tend to go with the approach of outputting data in NetCDF (based on HDF5) format, with metadata embedded, but also include a human readable yaml sidecar/header file with the important metadata that will allow me to re-run the simulation. This includes the information @Beliavsky mentions above, plus Git commit ID (or some other kind of version info) and the path to the config and input data. Note this also means you have to make sure the input data remains available at that path. If it’s an important model run (e.g. it will be used for a publication), then archiving all of this (including input data) on somewhere like Zenodo is a great idea (if your data are small enough to do so).

@certik thanks for introducing us to ASDF. I hadn’t come across that before but it looks great. I frequently find difficulties in installing NetCDF/HDF is the main barrier to getting other people using my models (especially if they’re on Windows), so an alternative is very welcome.

certik · February 3, 2022, 6:40pm

Yes, although relevant to us are these two issues:

juannaviap · February 5, 2022, 10:56am

Thank you all for the valuable information

nicholaswogan · February 5, 2022, 3:21pm

@certik I’ve made a Fortran interface to libYAML via iso_c_binding

It can read and emit yaml.

I want to set it up as an fpm package, but I think I will need to wait for fpm to support custom build scripts.

msz59 · February 5, 2022, 6:36pm

How can it emit yaml? The Readme says it is a parser and the only public entities in the module are parse and error_length

nicholaswogan · February 5, 2022, 7:17pm

All yaml node types (type_list, type_dictionary, etc.) have the dump procedure, which will dump all yaml to a given output unit. So, for example, this program reads test.yaml, then emits it to emitted.yaml

program test
  use fortran_yaml_c, only: parse, error_length
  use yaml_types, only: type_node
  
  class(type_node), pointer :: root
  character(len=error_length) :: error
  
  root => parse("test.yaml", error = error)
  if (error/='') then
    print*,trim(error)
    stop 1
  endif
  
  open(unit=1,file="emitted.yaml")
  
  call root%dump(unit=1,indent=0)
  close(1)
  call root%finalize()
  deallocate(root)
end program

You can also put together some yaml using the various yaml_types, then dump it. example:

program test
  use yaml_types, only: type_node, type_dictionary, type_error, real_kind, &
                        type_list, type_list_item, type_scalar, type_key_value_pair
  
  class(type_node), pointer :: root
  
  class(type_node), pointer :: val1, val2, val3
  character(len=1024) :: key1, key2, key3
  
  key1 = "pi"
  allocate(type_scalar::val1)
  select type (val1)
  class is(type_scalar)
    val1%string = "3.14159"
  end select
  
  key2 = "happy-today"
  allocate(type_scalar::val2)
  select type (val2)
  class is(type_scalar)
    val2%string = "true"
  end select
  
  key3 = "to-do"
  allocate(type_scalar::val3)
  select type (val3)
  class is(type_scalar)
    val3%string = "run a mile"
  end select
  
  allocate(type_dictionary::root)
  select type (root)
  class is (type_dictionary)
    call root%set(key1, val1)
    call root%set(key2, val2)
    call root%set(key3, val3)
  end select
  
  open(unit=1,file="emitted.yaml")
  call root%dump(unit=1,indent=0)
  close(1)

  call root%finalize()
  deallocate(root)

end program

msz59 · February 5, 2022, 8:19pm

Thanks a lot for the explanation.

msz59 · February 5, 2022, 8:42pm

Sorry for bothering but could you comment on the way the ‘node’ types are finalized? I’ve notice a call to root%finalize() before deallocation. This is something that can be achieved by final procedure. Actually you do use final just in one extended type (type_list) but explicit finalize() elsewhere.

Edit: sorry for this OT, I mistakenly believed we were on private channel with @nicholaswogan

nicholaswogan · February 5, 2022, 9:13pm

Ya, it could all be done with a final procedure. I should implement that.

I’d didn’t write this type system. It’s from this GitHub repo: https://github.com/BoldingBruggeman/fortran-yaml . The drawback of this linked code is that it can not parse all of yaml. This is why I made an interface to the parser in libYAML.

Ya sorry, this should be a totally different thread!

Topic		Replies	Views
Reading YAML files	21	1577	November 24, 2021
Store data in yaml format using Fortran Help	7	713	August 8, 2022
Fortran Control File (is there a standard?) Help	2	373	May 30, 2022
Repeat count for modern-style initializations Help	25	421	July 28, 2024
Physical constants Announcements	67	4473	July 4, 2022

Metadata labels for simulation data

Related topics