Reading YAML files

Does anybody know of a library to read YAML files? I admit I have not searched myself yet. The question popped in a discussion just now ;).

2 Likes

The Fortran package index says we have

(Search - Fortran Programming Language)

3 Likes

Ah thanks, being in a meeting does not help my search muscles.

I wrote Fortran interfaces to libyaml and yaml-cpp:

The benefit here is that these packages will parse all yaml: block style, flow style, etc. Right now, there isn’t a pure fortran yaml parser which does all of yaml.

1 Like

@nicholaswogan, @awvwgk, thanks, I will have a look at both. The discussion was about the use of YAML for storing timeseries. In our simulation programs we need timeseries of multiple variables - sometimes 20 to 40. Over the years many different formats have been used, differing in the amount of metadata, precise details for the date and time, use of missing values, etc. YAML may or may not be useful, but at the very least it is a format that is widely recognised.
The one problem I do see here is that it is intended for heterogeneous data. The metadata fit that well, but for the more regular block of times + values it feels a trifle awkward. Well, let’s try it.

Have you looked into TOML for this purpose? TOML: Tom's Obvious Minimal Language

@Arjen I know what you mean. Yaml might not be the right tool for the job. I do modeling which involves time series outputs. I use unformatted binary Fortran files. Its easy to write or read records in both Fortran and in Python (scipy.io.FortranFile).

The problem with unformatted data is the (potential) lack of metadata, etc. If the structure is not trivially obvious it will be difficult to share/distribute the data.

Reasonable standardised data formats for timeseries data, in my experience, to look into include:
HDF5
UNV-58, old and clunky, but can be read by everyone and everything :wink:

3 Likes

I had not heard of UNV-58, but Google finds an amusing description:

Back in the days of Fortran IV when this file format was invented, you could just read the binary data into real or integer variables with A format, and everything worked fine. The “quiche-eating police” have probably made that illegal in Fortram 90/95 though. The 58B data format is hightly machine dependent, but at least it includes some flags in the header to tell you which binary format the data is in. If you need to convert from one machine to another (e.g. big endian to little endian, or IBM System/360 floating point to IEEE) that’s a different question!

2 Likes

Hi @lars, welcome to the forum!

On the topic of saving files — I think the compiler could (and should!) help when you save any variables (arrays, scalars, etc.) into any widely used format, such as hdf5, numpy, etc.

2 Likes

The question I posed concerns the format of files containing timeseries that users can edit with a text editor, not the output results. So I am really looking for a suitable text file format ;).

I think more information about the contents of this text file would help people recommend options. Maybe you could link an example on github, or you could just describe it in more detail in a post.

There are time series of 20 - 40 variables. How many points typically? 100? 1,000,000? Does your Fortran program need to both parse and emit these files? Or is just parsing good enough?

@nicholaswogan , it is so easy to assume your own world view is crystal clear to everyone else … I only realised that this is not necessarily the case, when the replies started to diverge. Here goes:

  • In a typical application of the water quality model I am referring to we deal with substances like ammonia, nitrate, phosphate, the odd algae or two etc.
  • Boundary conditions and waste load data may span any time period - from a few days to a few years, with the number of times running from just a handful to many thousands (the latter mainly because one of the formats I need to work with requires all parameters to be defined at the same time and some things can be measured at a high frequency)
  • Likewise, all manner of environmental conditions are presented as detailed timeseries.
  • Users have to supply such data, coming from databases or spreadsheets or you name it.
  • Such text files need to be understandable by human beings as well as the program. So, meta data have to be supplied: the meaning of each column, date and time in Gregorian format, location, …

Here is an example, slightly adapted:

Date-and-time         Cl OXY    BOD Chlfa    KjdN OPO3  SPM
2014/01/01-00:00:00 1000 7      1       1     0.5 0.2   50
2014/03/01-00:00:00 -999 6      -999    30    0.6 0.2   60
2014/04/01-00:00:00 -999 4      -999    100   1.7 0.03  75
2014/06/01-00:00:00 -999 4      -999    80    1.2 0.03  80
2014/08/01-00:00:00 -999 3      -999    20    0.4 0.03  42
2014/10/15-00:00:00 -999 5      -999    4     0.6 0.1   22
2014/12/31-00:00:00 1000 7      1       1     1   0.2   30

(The value -999 means a missing value and then interpolation may be applied).

It is an example of one format and I am quite content with the way this can be used. But it is not the only one around and we are looking for some kind of standardisation. In the discussion that led me to write my original post, YAML was mentioned but TOML is another format that may or may not be suitable. Some people here prefer netCDF files, because that is standard format quite suitable for large amounts of data, but they forget that it is rather unsuitable for the uninitiated.

I have looked at the specifications of YAML and TOML and while apt for their design goals, they may not be a comfortable match for the above type of data. For the moment I am merely exploring options - perhaps even to have arguments as to why these file formats are unsuitable :slight_smile:

This is the best I can think of via yaml.

metadata:
  thisthing: true
  otherthing: "waterisgood"
  
time-series:  
  - [Date-and-time,         Cl, OXY,    BOD, Chlfa,    KjdN, OPO3,  SPM]
  - [2014/01/01-00:00:00, 1000, 7,      1,       1,     0.5, 0.2,   50 ]
  - [2014/03/01-00:00:00, null, 6,      null,    30,    0.6, 0.2,   60 ]
  - [2014/04/01-00:00:00, null, 4,      null,    100,   1.7, 0.03,  75 ]
  - [2014/06/01-00:00:00, null, 4,      null,    80,    1.2, 0.03,  80 ]
  - [2014/08/01-00:00:00, null, 3,      null,    20,    0.4, 0.03,  42 ]
  - [2014/10/15-00:00:00, null, 5,      null,    4,     0.6, 0.1,   22 ]
  - [2014/12/31-00:00:00, 1000, 7,      1,       1,     1,   0.2,   30 ]

Thousands of time points is probably OK. For millions this might feel a bit slow. This can be parsed and processed by either of the packages Iinked earlier. However, those packages won’t be able to emit yaml in this specific nice format… They would emit this as

metadata:
  thisthing: true
  otherthing: waterisgood
time-series:
  - - Date-and-time
    - Cl
    - OXY
    - BOD
    - Chlfa
    - KjdN
    - OPO3
    - SPM
  - - 2014/01/01-00:00:00
    - 1000
    - 7
    - 1
    - 1
    - 0.5
    - 0.2
    - 50
  - - 2014/03/01-00:00:00
    - null
    - 6
    - null
    - 30
    - 0.6
    - 0.2
    - 60
  - - 2014/04/01-00:00:00
    - null
    - 4
    - null
    - 100
    - 1.7
    - 0.03
    - 75
  - - 2014/06/01-00:00:00
    - null
    - 4
    - null
    - 80
    - 1.2
    - 0.03
    - 80
  - - 2014/08/01-00:00:00
    - null
    - 3
    - null
    - 20
    - 0.4
    - 0.03
    - 42
  - - 2014/10/15-00:00:00
    - null
    - 5
    - null
    - 4
    - 0.6
    - 0.1
    - 22
  - - 2014/12/31-00:00:00
    - 1000
    - 7
    - 1
    - 1
    - 1
    - 0.2
    - 30

My first thought after seeing the table of numbers was: how about the simple CSV format? The way I understand that it was designed to save tabular data. But would CSV be too limiting for this application?

The emitted form is simply not acceptable, I am afraid, as you would lose any structure and overview - for instance you cannot easily determine the change over time of Chlfa, whereas the input form has these values nicely in a column.
The formatting/meta characters in the input form are a bit of a nuisance, but probably doable. The “null” instead of a reserved value is an interesting aspect of the example.

My concern with regards to CSV is that it is not a well-defined format, as witnessed by this description. For numerical data that may be less of a problem, but when it comes to text, you have various flavours and even the separator (not necessarily a comma!) is something to be guessed. It is relatively easy to produce (via a text editor or the spread sheet program of choice), but it is troublesome when it comes to combining several tables in one file - it is not uncommon to have dozens or even hundreds of such tables and putting them in different files is, eh, well, cumbersome.

1 Like

You could write a custom yaml emitter for you specific file format. I do not think this would be that challenging.

No, you’re right - as we know what kind of data we are dealing with, a specific layout is much easier to implement (otherwise you would have to build in some smartness to decide what layout would be best)

I’ve been through similar issues to this quite a few times and share your reservations about CSV files @Arjen. But at the end of the day, for timeseries data that requires non-technical user input, I always come back to CSV! As you’ve explored here, YAML etc. doesn’t really work for timeseries data, and HDF/NetCDF are no good for non-technical users.

I think the key to making CSV work is being explicit with the format required by users (give them a template?).

Slightly off topic, but I’m curious because I’m also involved in WQ modelling, what’s the model you are developing?

1 Like