Reading YAML files

Arjen · November 10, 2021, 12:57pm

Does anybody know of a library to read YAML files? I admit I have not searched myself yet. The question popped in a discussion just now ;).

awvwgk · November 10, 2021, 12:58pm

The Fortran package index says we have

(Search - Fortran Programming Language)

Arjen · November 10, 2021, 1:12pm

Ah thanks, being in a meeting does not help my search muscles.

nicholaswogan · November 10, 2021, 5:28pm

I wrote Fortran interfaces to libyaml and yaml-cpp:

The benefit here is that these packages will parse all yaml: block style, flow style, etc. Right now, there isn’t a pure fortran yaml parser which does all of yaml.

Arjen · November 11, 2021, 8:04am

@nicholaswogan, @awvwgk, thanks, I will have a look at both. The discussion was about the use of YAML for storing timeseries. In our simulation programs we need timeseries of multiple variables - sometimes 20 to 40. Over the years many different formats have been used, differing in the amount of metadata, precise details for the date and time, use of missing values, etc. YAML may or may not be useful, but at the very least it is a format that is widely recognised.
The one problem I do see here is that it is intended for heterogeneous data. The metadata fit that well, but for the more regular block of times + values it feels a trifle awkward. Well, let’s try it.

awvwgk · November 11, 2021, 9:18am

Have you looked into TOML for this purpose? TOML: Tom's Obvious Minimal Language

nicholaswogan · November 11, 2021, 4:09pm

@Arjen I know what you mean. Yaml might not be the right tool for the job. I do modeling which involves time series outputs. I use unformatted binary Fortran files. Its easy to write or read records in both Fortran and in Python (scipy.io.FortranFile).

lars · November 11, 2021, 10:33pm

The problem with unformatted data is the (potential) lack of metadata, etc. If the structure is not trivially obvious it will be difficult to share/distribute the data.

Reasonable standardised data formats for timeseries data, in my experience, to look into include:
HDF5
UNV-58, old and clunky, but can be read by everyone and everything

Beliavsky · November 11, 2021, 11:22pm

I had not heard of UNV-58, but Google finds an amusing description:

Back in the days of Fortran IV when this file format was invented, you could just read the binary data into real or integer variables with A format, and everything worked fine. The “quiche-eating police” have probably made that illegal in Fortram 90/95 though. The 58B data format is hightly machine dependent, but at least it includes some flags in the header to tell you which binary format the data is in. If you need to convert from one machine to another (e.g. big endian to little endian, or IBM System/360 floating point to IEEE) that’s a different question!

certik · November 12, 2021, 12:50am

Hi @lars, welcome to the forum!

On the topic of saving files — I think the compiler could (and should!) help when you save any variables (arrays, scalars, etc.) into any widely used format, such as hdf5, numpy, etc.

Arjen · November 12, 2021, 7:23am

The question I posed concerns the format of files containing timeseries that users can edit with a text editor, not the output results. So I am really looking for a suitable text file format ;).

nicholaswogan · November 12, 2021, 2:17pm

I think more information about the contents of this text file would help people recommend options. Maybe you could link an example on github, or you could just describe it in more detail in a post.

There are time series of 20 - 40 variables. How many points typically? 100? 1,000,000? Does your Fortran program need to both parse and emit these files? Or is just parsing good enough?

Arjen · November 12, 2021, 3:28pm

@nicholaswogan , it is so easy to assume your own world view is crystal clear to everyone else … I only realised that this is not necessarily the case, when the replies started to diverge. Here goes:

In a typical application of the water quality model I am referring to we deal with substances like ammonia, nitrate, phosphate, the odd algae or two etc.
Boundary conditions and waste load data may span any time period - from a few days to a few years, with the number of times running from just a handful to many thousands (the latter mainly because one of the formats I need to work with requires all parameters to be defined at the same time and some things can be measured at a high frequency)
Likewise, all manner of environmental conditions are presented as detailed timeseries.
Users have to supply such data, coming from databases or spreadsheets or you name it.
Such text files need to be understandable by human beings as well as the program. So, meta data have to be supplied: the meaning of each column, date and time in Gregorian format, location, …

Here is an example, slightly adapted:

Date-and-time         Cl OXY    BOD Chlfa    KjdN OPO3  SPM
2014/01/01-00:00:00 1000 7      1       1     0.5 0.2   50
2014/03/01-00:00:00 -999 6      -999    30    0.6 0.2   60
2014/04/01-00:00:00 -999 4      -999    100   1.7 0.03  75
2014/06/01-00:00:00 -999 4      -999    80    1.2 0.03  80
2014/08/01-00:00:00 -999 3      -999    20    0.4 0.03  42
2014/10/15-00:00:00 -999 5      -999    4     0.6 0.1   22
2014/12/31-00:00:00 1000 7      1       1     1   0.2   30

(The value -999 means a missing value and then interpolation may be applied).

It is an example of one format and I am quite content with the way this can be used. But it is not the only one around and we are looking for some kind of standardisation. In the discussion that led me to write my original post, YAML was mentioned but TOML is another format that may or may not be suitable. Some people here prefer netCDF files, because that is standard format quite suitable for large amounts of data, but they forget that it is rather unsuitable for the uninitiated.

I have looked at the specifications of YAML and TOML and while apt for their design goals, they may not be a comfortable match for the above type of data. For the moment I am merely exploring options - perhaps even to have arguments as to why these file formats are unsuitable

nicholaswogan · November 12, 2021, 5:42pm

This is the best I can think of via yaml.

metadata:
  thisthing: true
  otherthing: "waterisgood"
  
time-series:  
  - [Date-and-time,         Cl, OXY,    BOD, Chlfa,    KjdN, OPO3,  SPM]
  - [2014/01/01-00:00:00, 1000, 7,      1,       1,     0.5, 0.2,   50 ]
  - [2014/03/01-00:00:00, null, 6,      null,    30,    0.6, 0.2,   60 ]
  - [2014/04/01-00:00:00, null, 4,      null,    100,   1.7, 0.03,  75 ]
  - [2014/06/01-00:00:00, null, 4,      null,    80,    1.2, 0.03,  80 ]
  - [2014/08/01-00:00:00, null, 3,      null,    20,    0.4, 0.03,  42 ]
  - [2014/10/15-00:00:00, null, 5,      null,    4,     0.6, 0.1,   22 ]
  - [2014/12/31-00:00:00, 1000, 7,      1,       1,     1,   0.2,   30 ]

Thousands of time points is probably OK. For millions this might feel a bit slow. This can be parsed and processed by either of the packages Iinked earlier. However, those packages won’t be able to emit yaml in this specific nice format… They would emit this as

metadata:
  thisthing: true
  otherthing: waterisgood
time-series:
  - - Date-and-time
    - Cl
    - OXY
    - BOD
    - Chlfa
    - KjdN
    - OPO3
    - SPM
  - - 2014/01/01-00:00:00
    - 1000
    - 7
    - 1
    - 1
    - 0.5
    - 0.2
    - 50
  - - 2014/03/01-00:00:00
    - null
    - 6
    - null
    - 30
    - 0.6
    - 0.2
    - 60
  - - 2014/04/01-00:00:00
    - null
    - 4
    - null
    - 100
    - 1.7
    - 0.03
    - 75
  - - 2014/06/01-00:00:00
    - null
    - 4
    - null
    - 80
    - 1.2
    - 0.03
    - 80
  - - 2014/08/01-00:00:00
    - null
    - 3
    - null
    - 20
    - 0.4
    - 0.03
    - 42
  - - 2014/10/15-00:00:00
    - null
    - 5
    - null
    - 4
    - 0.6
    - 0.1
    - 22
  - - 2014/12/31-00:00:00
    - 1000
    - 7
    - 1
    - 1
    - 1
    - 0.2
    - 30

art-rasa · November 12, 2021, 9:59pm

My first thought after seeing the table of numbers was: how about the simple CSV format? The way I understand that it was designed to save tabular data. But would CSV be too limiting for this application?

Arjen · November 15, 2021, 7:42am

The emitted form is simply not acceptable, I am afraid, as you would lose any structure and overview - for instance you cannot easily determine the change over time of Chlfa, whereas the input form has these values nicely in a column.
The formatting/meta characters in the input form are a bit of a nuisance, but probably doable. The “null” instead of a reserved value is an interesting aspect of the example.

Arjen · November 15, 2021, 7:49am

My concern with regards to CSV is that it is not a well-defined format, as witnessed by this description. For numerical data that may be less of a problem, but when it comes to text, you have various flavours and even the separator (not necessarily a comma!) is something to be guessed. It is relatively easy to produce (via a text editor or the spread sheet program of choice), but it is troublesome when it comes to combining several tables in one file - it is not uncommon to have dozens or even hundreds of such tables and putting them in different files is, eh, well, cumbersome.

nicholaswogan · November 15, 2021, 9:00pm

You could write a custom yaml emitter for you specific file format. I do not think this would be that challenging.

Arjen · November 22, 2021, 1:30pm

No, you’re right - as we know what kind of data we are dealing with, a specific layout is much easier to implement (otherwise you would have to build in some smartness to decide what layout would be best)

samharrison7 · November 23, 2021, 7:16pm

I’ve been through similar issues to this quite a few times and share your reservations about CSV files @Arjen. But at the end of the day, for timeseries data that requires non-technical user input, I always come back to CSV! As you’ve explored here, YAML etc. doesn’t really work for timeseries data, and HDF/NetCDF are no good for non-technical users.

I think the key to making CSV work is being explicit with the format required by users (give them a template?).

Slightly off topic, but I’m curious because I’m also involved in WQ modelling, what’s the model you are developing?

Topic		Replies	Views
A YAML parser for Fortran: fortran-yaml-cpp Announcements	15	2342	May 22, 2022
Store data in yaml format using Fortran Help	7	705	August 8, 2022
TOML Fortran recipes Announcements	7	947	June 25, 2022
Datetime-fortran-1.7.0 Announcements	2	577	July 27, 2020
About a Fortran Scientific Library	21	6441	November 4, 2021

Reading YAML files

Related topics