Reading data: format of variables

Hi,

I am reading data from a text file. The format of data might vary from onetext file to another. I wanted to know if there is any intrinsic Fortran function that can identify the format of variables (integer, exponential notation, number of digits, number of characters, etc.). Once we know the format of the first variable, we can keep that format to read the whole text file.

Another long solution would be to split each line based on a delimiter and then convert strings to double. But I doubt the efficiency of this second method. Especially when we deal with big tabulated data.

Thanks in advance for your help,
Regards,
Mary

1 Like

Hi!
Could you provide an example?

This is my subroutine based on splitting and converting string to double which works fine:

    subroutine inputs(text_file,nlines,vec_x,vec_y)
      implicit none
      character(*), intent(in)      :: text_file
      integer, intent(in)           :: nlines
      double precision, intent(out) :: vec_x(nlines),vec_y(nlines)
      integer                       :: i
      integer                   :: ios

      character(30)    :: instring,delim
      character(30)    :: string1,string2
      double precision :: rval1,rval2
      integer :: index

      delim = ' '

      open (2, file = text_file, status = 'old')
      do i=1,nlines
         read(2, '(A)') instring
         instring = TRIM(instring)

         index = SCAN(instring,delim)
         string1 = instring(1:index-1)
         string2 = instring(index+1:)

         read(string1,fmt=*,iostat=ios) vec_x(i)
         read(string2,fmt=*,iostat=ios) vec_y(i)
         
      end do
      close (2)


      
    end subroutine inputs

Now, if we know the format of variables, this long code can be shrinked into:

      open (2, file = text_file, status = 'old')
      do i=1,nlines
         read(2, '(2(E11.2))') vec_x(i), vec_y(i)
      end do
      close (2)

But the format is variable from one text file to another. So, this ‘(2(E11.2))’ has to be automatically determined. I am wondering if there is any intrinsic function in Fortran doing so. Otherwise, I will write one myself, it should not be difficult.

And how do the text files look like?

The usual approach that is taken when formats change from file to file is to write the format as part of the data file. You read the format string, then you read the data that goes with that format, then the next format string, then the data that goes with it, and so on.

If you search for strings like ‘(*)’, then you can also branch internally to combinations of explicit formats and list-directed i/o. Fortran makes that harder than necessary on the programmer, but at least it can be done, so the functionality is there.

2 Likes

As @RonShepard said, the best way is to know the format, whether by reading it somewhere, perhaps you have a master input file where you specify the format, or some other way. For 2D real array (matrix) you can use the loadtxt function from stdlib: loadtxt – Fortran-lang/stdlib, similar to NumPy’s loadtxt.

That being said, it is possible to write a parser that determines the type of the value, but it will be some work and I would not recommend that approach, unless you have no other choice.

1 Like

@mary, would you have any control of what category the file format file can be? For example, can you go with comma-separated values in, say, a CSV file? Because in that case, you can go with a library solution e.g., a nice one by @jacobwilliams Fortran-csv-module.

In terms of intrinsic facilities in Fortran, an option you may have looked into is as mentioned by @RonShepard with list-directed IO. As you would know, it can work in a lot of simple cases and if that can suffice your needs, nothing can beat list-directed IO in terms of basic simplicity.

1 Like

I posted the inputs/outputs here Multidimensional data interpolation (table lookup) - #5 by mary

Thank you for your thorough explanation. I was not aware of this usual approach.

Thank you @RonShepard for pointing out the loadtxt function.

I have full control of how these text files can be written. I write these files and they will be used for interpolation purposes (I posted my test case in Multidimensional data interpolation (table lookup) - #5 by mary). I will read about this Fortran CVS module. Does that mean that by structuring data in a comma-separated style, there is no more need to know the format of variables?

You can’t always use the character string in the input file to tell you what kind of variable it should be
read into with list-diected input. 666 is valid input to an integer or real variable, as well as to a character one if the file was opened with delim=‘none’, which is the default.

Of course if you have control over the contents of the input file you could require that its character strings contain a character that cannot be part of a number, that its real input contain at least one of
+-.deinftynaDEINFTYNA (because Infinity and NaN are possible inputs to a real variable), and that iinput containing only +-012345689 is to an integer.

I doubt you have a fair assessment of “efficiency”, as how long do you think it would take to split each line, convert the numbers and provide an error report on invalid numbers and statistics of each column ? 1 second, 10 seconds, perhaps 100 seconds. It is probably generated at about 500 MBytes per second.

Now consider how efficient it would be to perform a less rigorous data extraction and have no idea of the errors in your data from text files of varying format ? 1 day, 10 days, perhaps 100 days to recover from an incorrect report.

I always try to generate statistics of data and a report of possible errors, especially where the data format is variable or from a 3rd party. With 3rd party data, it is beneficial to do multiple passes and try to improve the “incorrect data” assessment rules.
And why would you have varying data formats if you are controlling the data generation ?

1 Like