Proposal: Add optional arguments to savetxt

Hello, I’ve been away from the stdlib project for a long time, and could not find discussions over some features of savetxt that I find useful, following mostly numpy savetxt behavior.

I’ve opened an issue in github do discuss the idea. Looking forward to your opinions!

Proposal

I’d like to propose a change to the interface of savetxt to add some optional arguments that would make it more flexible, following quite closely numpy’s function. Please let me know if there is any interest. As I said I could not find if it already was discussed and discarded.

I am proposing to add several arguments (see proposed signature below)

  1. Adding the possibility of providing a unit number in place of a filename gives flexibility to add partial tables to a file or output to stdout. One of them must always be present.
  2. fmt Gives the possibility of customize the output format. In many cases complex calculations give only a few significant figures and I would like the number saved to express that. Should work in a similar way as it does in loadtxt().
  3. header and footer give the possibility to add information to the data files (column names, information on parameters used in calculations or measurements, etc). I find this very important because it allows to keep information together with data.
  4. Following numpy I find extremely useful if non-data information is commented or signaled in some form. This would be the use of the argument comments.
  5. I would propose to change the length of delimiter from len=1 to arbitrary length. For intstance, we may want a *.cvs file to be separated by a comma and one or more spaces ', '.

Some of these changes will probably trigger changes in loadtxt in order to keep compatibility (argument comments).

Also, I am not sure if a “bad-written” data file, with some text instead of numbers breaks the current version of loadtxt.

The spec would be something like:


savetxt - save a 2D array into a text file

Status

Experimental

Description

Saves a rank-2 array into a text file.

Syntax

call [[stdlib_io(module):savetxt(interface)]] (array [, filename | unit] [, delimiter] [, fmt] [, header] [, footer] [, comments])

Arguments

array: Shall be a rank-2 array of type real, complex or integer.

filename (optional): Shall be a character expression containing the name of the file , that will contain the 2D array. Present at the same time than unit will give an error.

unit (optional): Shall be an integer containing the unit of an already opened file. Present at the same time than filename will give an error.

delimiter (optional): Shall be a character expression of any length that contains the delimiter used to separate the columns. The default is a single space ' '.

fmt (optional): Fortran format specifier for the text save. Defaults to the write format for the data type.

header (optional): Shall be a character expression that will be written at the beginning of the file.

footer (optional): Shall be a character expression that will be written at the end of the file.

comments (optional): Shall be a character expression of (length 1 that will be prepended to the header and footer strings to mark them as comments. Default: # .


Prior Art

4 Likes

Hi @fiolj, thanks for the proposal. I think these would be great additions. I would model them closely by NumPy, to be compatible. The purpose of loadtxt is indeed to have a useful function to load tabular data from a file, most common cases covered. So I think all of the above would work well.

If you want, go ahead and send a PR against stdlib with the above changes and tests, that would make it a lot easier to evaluate all the details.

Thanks @certik, I’ve sent a PR with changes to savetxt.

1 Like

Excellent, thanks! Here is the PR: Savetxt unit by fiolj · Pull Request #1085 · fortran-lang/stdlib · GitHub.

In terms of modelling closely to Numpy, should we arrange arguments in exactly the same order?

Even if it breaks compatibility with current uses?

Thoughts?

1 Like

I would lean towards going with the NumPy order, to be compatible, even if it means breaking the current order (I think we are still experimental, so this is expected). However I would investigate NumPy’s commit history to see how the order was created and if they ever changed it and if there are any complains about it. I would use a different order if the different order is clearly better than NumPy.

2 Likes

I find optional parameters only being accessible via keyword appealing, in which case the order does not matter. That is particularly the case when there are more than one or two optional parameters.

1 Like

I’ve completed the changes and, in my opinion, it is ready to be merged. It would be great if we could get more opinions, in case something has escaped me

I think it would be useful for the description to explicitly state the behavior of the output file position; or to explicitly state the file is always rewound by default and add an append option, and whether appending (if allowed) is for output-only files and not readable by loadtxt, as it would not be clear that if headers were used loadtxt could read multiple tables.

Thanks @urbanjost

I am not sure I follow; In this version of savetxt(filename, array) the file is always opened and closed

I did not think of adding an append option. I was mostly following the behavior of Numpy’s savetxt (and loadtxt). My original idea was that the user would manage that directly. In order to write several chunks of a table (or may be several tables) one could use the savetxt(unit, array) routine.

In this latter case it is true that we don’t explicitly state what is the output state of unit, and the current behavior it is keep it at the end in order to be able to append data.

The use in this case would be something like:

  open (newunit=unit, file=fname)
  call savetxt(unit, y1, header= header)
  call savetxt(unit, y2)
  close (unit)

This is in my opinion a separate discussion but so closely related that, as you point out, should be given at this time. In the original issue we started a conversation on modification to loadtxt.

Currently my thoughts were to make it work following closely Numpy’s. In this case, the header would not separate tables as commented lines are completely ignored (independently if they are at the beginning of the file or in any other part). In that case, reading multiple tables should be managed by using the arguments skiprows and maxrows, which I see as straigthforward but cumbersome.

Possibly it is a good time to open a new thread to have a discussion on loadtxt(). It should be modified to make it easy

To a new user, even one familiar with savetext() in other incarnations in other languages, it was not clear if using the unit value if it would append or overwrite, which you have clarified. But to my knowledge the python incarnation always overwrites(?) so I was just stating it should overwrite by default, or clarify what the behavior is. Using a new append option would mean the default behavior conforms to the python behavior and you could be consistent whether a name or unit number was specified. As-is using the name always overwrites and using the unit always appends unless you do an explicit REWIND(). To me the current behavior is non-intuitive so (if not changed) you need the document to clarify what the non-pythonic “new” option does. Otherwise, you got my question completely. Thanks for the thorough response too.

Thanks @urbanjost for clarifying. Yes, currently it will overwrite when using with filename and appending when using with unit. This is the behavior of Numpy’s savetxt, when given a filehandle would append and when given a filename would overwrite (I just checked, didn’t know before).

I did not stated explicitly earlier but it was my intent (use an open file unit for appending), mainly because I could not think of other uses cases that would need a different behaviour.

For this reason, the current behavior would be my preference but I think we should discuss it.

It is true that we should make it explicit this behavior (the docs in numpy don’t explicit it)

If the goal is the least surprise for a python programmer and that is what numpy does as well, it looks like the current behavior is good, although I think the documentation (for both) should state that.

If the goal is to make a procedure as intuitive to use as possible I think both name and unit overwriting by default and both appending when an optional argument is supplied is better.

Looks like a major thrust of these modifications is to be more like Python

1 Like

When I started these modifications I was mainly following my workflow (possibly tailored by numpy) and the use of unit was introduced to be complementary of filename.

However, your point is clear, the use of an additional argument append would give a consistent behavior with filenames and units.

I’ve posted the same question in the pull request thread. We could change the spec to something like:


Syntax

call savetxt(filename, array [, delimiter] [, fmt] [, header] [, footer] [, comments] [, append])

call savetxt(unit, array[, delimiter] [, fmt] [, header] [, footer] [, comments] [, append])

Arguments

append (optional): Shall be a logical flag. If .true. data will be appended at the end of the file or unit. If .false. file will be overwritten. Default: .false.


The only problem that I can see is that when used with an unit number, the function will always be modifying the position of the file (to the beginning or the end). In the previous version the user could in principle position it arbitrarily.

Thoughts, preferences?

Is the primary purpose of savetxt to create a file to be read by another program, reread by the program that generated it, for human inspection, postprocessing, or printing?

My impression is that since it is sequential formatted data and that I do not see anything to tell a load to start or stop on a particular set of data that it is not intended for loading in anything but a single array per file, and that outputting multiple arrays would be to generate a file for printing/records or human inspection, not for reloadng.

If that is correct, then I see very few use cases where I would not be appending or ovewriting the entire file on output, and just loading a single array from a single file. If I wanted to go to particular cases I think you would need to add something to delimit the arrays ala NAMELIST groups or some other more elaborate data format.

So if almost any likely use is going to be intending to overwrite or append I think the append option as you describe it is the clearest as well as the most intuitive. If it is envisioned that savetxt/loadtxt would be used with different arrays of different shapes in an input file, possibly accessed in non-sequential order, or that it would be common to overwrite particular datasets in these multi-array files I might have a different answer.

That does seem like an odd choice for output to a formatted sequential file. Normally, one would think it would begin writing to the current position by default, and then only with specific options would it rewind+overwrite or skip to the end to append. Of course, if you are mimicking behavior in another language, then you might want to follow that convention, even if odd in a fortran context.

I fully agree with @RonShepard, I don’t see an issue with the function having different behaviors when given either a filename or an unit. If any, I see it as normal that both options are made available because one expects different behaviors.

With an unit input, I would expect that by default the next call picks at the “current position” and appends from that point. So basically, for the unit case, I would expect that append is .true. by default and does nothing internally. And only if given append=.false. shall it rewind.

There are three possible actions, not just two. The file could be rewound and overwritten with new data from the beginning, the position could be left “as is” and written/overwritten from that point forward, or the file could be repositioned to the end and the new data appended. I think the confusing thing is when two of those options are merged together to have only two actions (i.e. a single logical dummy argument has only .true. and .false. values).

It seems like there are several possible solutions. One could pass in an integer argument telling the routine which of those three options to take. Or, one could have three different arguments saying which to take, with the convention that only one of them is allowed to be true on any given call. Or one could just have a simple routine that always writes its output from the current position, and then require the programmer to position the file correctly beforehand. That last option is straightforward to implement by the programmer with, for example, rewind unit and with open(unit,position='append') on the file. The best choice might also depend on the behavior in the other language that is being mimicked.

IMHO:

Shouldn’t that be the ONLY option?

The savetext subprogram should do only what its name says —and leave rewinding and closing to the user:

rewind (unit)
call savetext(unit, ...)
close (unit)

The savetext(file, ...) on the other hand, shouldn’t even bother with providing unit numbers/handlers (the file should be opened and closed internally).

2 Likes

That append option may not even make sense for the savetext(filename, ...) option, since the user can simply open the file themselves with the appropriate option(s) and just use either the filename or the unit version. The savetext(filename, ...) version should just stick to default position behavior if the file is already open, or rewind if actually opening the file.

Maybe the savetext interface shouldn’t try to mimic the open statement? Otherwise, besides append, one might think adding an asynchronous|decimal|encoding|round|… argument is also a good thing.