Metadata labels for simulation data

Hi!

What good practices are recommended for labeling data generated in Fortran simulations?

I am working on a problem that depends on many parameters and I don’t know if it is a good idea to add directly from the program a header with the current parameters or is there a better strategy.

Thanks!

2 Likes

Is the purpose of these metadata or labelling that you can reproduce the simulation at some time in the future? Is it possible to reconstruct the values of the parameters from a reduced set of information? I mean, just as an example, your simulation involves some substance and you need to supply a dozen values representing the properties, then of course identification of the substance might be sufficient. Otherwise the full storage of the input, along with version information on your program would be necessary.
What you could do, though, is store the input in a version control system and use the commit associated with the input as a concise way to characterise the input.Via the control system you can retrieve the actual input. Just a thought, mind you.

1 Like

Yes, the idea is to use the metadata to reproduce the simulation in the future and also to compare with other techniques based on the same parameters. My question is about the best way to do this in Fortran.

  • As a header in the dataset itself
  • As an external file linked in some way to the data
1 Like

That is what I have done. In addition to program parameters you can consider adding
(1) name of executable via
call get_command_argument(0,executable_name), preferably with path
(2) when the program was run by processing the output of date_and_time()
(3) compiler_version()
(4) compiler_options()
(5) names of parameter and data files used
(6) at the end, the time taken for the code to run, via calls to cpu_time() or system_clock()

1 Like

I’m interested in your workflow. Do you add the information to the generated dataset or use an external file to save the metadata?

I put it in the same file as the generated data, writing both the parameters and simulation data as CSV.

1 Like

I recommend HDF5 for this purpose, at least if you are developing a larger application. Using the native HDF5 bindings requires a lot of boilerplate code, but @scivision has written a high-level interface called h5fortran.

Note: I have not tried h5fortran, but it make a good impression.

2 Likes

@juannaviap excellent question!

Yeah, hdf5 is nice that it produces just one file and the file is self-contained: can have both data and metadata. The problem with hdf5 is that the library is quite complicated and hard (and slow) to build. The format itself is also very complicated.

Another idea is to use a some human readable “header” (such as json, yaml, namelist, etc.) and a binary “data” with all the arrays; the structure of the data is described in the header.

There is a new format called ASDF:

That is trying to address precisely the above issue, but I think currently they only have Python implementation, so we would have to create a Fortran implementation. It is just one file also, and it contains a human readable header, and a binary data, all in one file.

1 Like

It’s great to see this attention being paid to reproducibility! It’s a complex topic and there are lots of different ways you can go about it.

I tend to go with the approach of outputting data in NetCDF (based on HDF5) format, with metadata embedded, but also include a human readable yaml sidecar/header file with the important metadata that will allow me to re-run the simulation. This includes the information @Beliavsky mentions above, plus Git commit ID (or some other kind of version info) and the path to the config and input data. Note this also means you have to make sure the input data remains available at that path. If it’s an important model run (e.g. it will be used for a publication), then archiving all of this (including input data) on somewhere like Zenodo is a great idea (if your data are small enough to do so).

@certik thanks for introducing us to ASDF. I hadn’t come across that before but it looks great. I frequently find difficulties in installing NetCDF/HDF is the main barrier to getting other people using my models (especially if they’re on Windows), so an alternative is very welcome.

2 Likes

Yes, although relevant to us are these two issues:

2 Likes

Thank you all for the valuable information

@certik I’ve made a Fortran interface to libYAML via iso_c_binding

It can read and emit yaml.

I want to set it up as an fpm package, but I think I will need to wait for fpm to support custom build scripts.

2 Likes

How can it emit yaml? The Readme says it is a parser and the only public entities in the module are parse and error_length

All yaml node types (type_list, type_dictionary, etc.) have the dump procedure, which will dump all yaml to a given output unit. So, for example, this program reads test.yaml, then emits it to emitted.yaml

program test
  use fortran_yaml_c, only: parse, error_length
  use yaml_types, only: type_node
  
  class(type_node), pointer :: root
  character(len=error_length) :: error
  
  root => parse("test.yaml", error = error)
  if (error/='') then
    print*,trim(error)
    stop 1
  endif
  
  open(unit=1,file="emitted.yaml")
  
  call root%dump(unit=1,indent=0)
  close(1)
  call root%finalize()
  deallocate(root)
end program

You can also put together some yaml using the various yaml_types, then dump it. example:

program test
  use yaml_types, only: type_node, type_dictionary, type_error, real_kind, &
                        type_list, type_list_item, type_scalar, type_key_value_pair
  
  class(type_node), pointer :: root
  
  class(type_node), pointer :: val1, val2, val3
  character(len=1024) :: key1, key2, key3
  
  key1 = "pi"
  allocate(type_scalar::val1)
  select type (val1)
  class is(type_scalar)
    val1%string = "3.14159"
  end select
  
  key2 = "happy-today"
  allocate(type_scalar::val2)
  select type (val2)
  class is(type_scalar)
    val2%string = "true"
  end select
  
  key3 = "to-do"
  allocate(type_scalar::val3)
  select type (val3)
  class is(type_scalar)
    val3%string = "run a mile"
  end select
  
  allocate(type_dictionary::root)
  select type (root)
  class is (type_dictionary)
    call root%set(key1, val1)
    call root%set(key2, val2)
    call root%set(key3, val3)
  end select
  
  open(unit=1,file="emitted.yaml")
  call root%dump(unit=1,indent=0)
  close(1)

  call root%finalize()
  deallocate(root)

end program
1 Like

Thanks a lot for the explanation.

Sorry for bothering but could you comment on the way the ‘node’ types are finalized? I’ve notice a call to root%finalize() before deallocation. This is something that can be achieved by final procedure. Actually you do use final just in one extended type (type_list) but explicit finalize() elsewhere.

Edit: sorry for this OT, I mistakenly believed we were on private channel with @nicholaswogan

Ya, it could all be done with a final procedure. I should implement that.

I’d didn’t write this type system. It’s from this GitHub repo: https://github.com/BoldingBruggeman/fortran-yaml . The drawback of this linked code is that it can not parse all of yaml. This is why I made an interface to the parser in libYAML.

Ya sorry, this should be a totally different thread!