Allocation of array for reading strings from a text file

I want to fill an array with strings of varying lengths, while keeping the array’s length fixed. These strings are the entries of a text file. Since I fill this array element by element while traversing the text file line by line, I don’t have any prior information regarding the longest string or the number of elements. The solution I have in mind is to iterate through all the elements to determine their total number and find the longest string. Then, I will allocate the array. Afterward, I will revisit the text file and fill MyArray accordingly. However, this approach has the drawback of requiring two passes through the text file (which can become time-consuming if one deals with long text files). I am uncertain if there is an alternative approach where I can simultaneously read the text file, fill the array, and adjust the length to accommodate the maximum length found so far (comparing the current element with the previous elements). So the question is whether my approach is efficient and if not, what is the alternative?

So, for the approach in my mind, an example would be:

program allocate_array
  implicit none
  
  character(len=:), allocatable :: myArray(:)
  character(100)                :: tempString
  integer                       :: temp_strLen
  integer                       :: numElements, maxLength, i, iostatus

  ! Open the file
  open(1, file="input.txt", status="old", action="read")

  ! Initialize variables
  numElements = 0
  maxLength = 0
  i = 1
  ! Read the file and determine the maximum length and number of elements
  do
    read(1, '(A)', iostat=iostatus) tempString
    if (iostatus < 0) exit
    print*, 'line(', i, ') = ', tempString
    i = i + 1
    numElements = numElements + 1
    maxLength = max(maxLength, len(trim(tempString)))
 end do
 
  print*, 'maxLength = ', maxLength 
  ! Allocate the array with the maximum length
  allocate(character(len=maxLength) :: myArray(numElements))
  
  ! Rewind the file to the beginning
  rewind(1)

  ! Read the file again and fill the array
  do i = 1, numElements
     read(1, '(A)') tempString
     temp_strLen = len(trim(tempString))
     myArray(i)(1:temp_strLen) = tempString(1:temp_strLen) 
  end do

  ! Close the file
  close(1)

  ! Print the array
  do i = 1, numElements
    print*, myArray(i)
  end do
  
end program allocate_array

with input.text containing:

a
ab
abcd
abc
de

You could read the entire text file into a string using unformatted stream, as done in the code below, and then use the positions of the newlines to determine the length of the longest line.

program read_file_into_string
implicit none
character (len=*), parameter :: fname = "file.txt"
integer :: i,ipos,iu,len_text
integer, parameter :: n = 10
character (len=:), allocatable :: text
open (newunit=iu,file=fname,action="read")
inquire(unit=iu,size=len_text) ! get file size
close (iu)
! allocate string large enough to hold file if possible
allocate (character (len=len_text) :: text)
open (newunit=iu,file=fname,action="read", &
      form="unformatted",access="stream")
read (iu) text ! read file into variable text
ipos = index(text,new_line("")) ! pos of first newline
print "('first line: ',a)","'" // text(1:ipos-1) // "'"
! find position of next-to-last newline character in text
ipos = index(text(:len_text-1),new_line(""),back=.true.)
print "('last line: ',a)", &
      "'" // text(ipos+1:len_text-1) // "'"
end program read_file_into_string

Storing the text you gave in file.txt and running, the output is

first line: 'a'
last line: 'de'
1 Like

This is the kind of common operation that should be part of standard fortran, but it isn’t.

One workaround is to read the strings one by one and put them into a linked list. The members of the linked list can be allocated upon demand. When the linked list is complete, the array can be allocated to the correct size, the data copied from the linked list to the array, and the linked list can be deallocated.

That is basically what the fortran i/o library is doing already anyway, so when a programmer does this, it is duplicating effort that was already done.

1 Like

Thank you @Beliavsky. One very minor remark: the line

integer, parameter :: n = 10

was superfluous. Without that line the program still compiled, ran and gave the same output with gfortran and ifort.

Instead of using an allocatable array of deferred length strings which you would have to eventually allocate to a fixed length, I think it would be better to put the deferred len string in a derived type and allocate that to the number of lines of text.

ie.

type :: vlstring_t
  character(len=:), allocatable :: string
end type
integer :: iun, nrecs, istat
type(vlstring_t), allocatable :: text_file(:)
character(len=132) :: buffer
character(len=1) :: dummy
open(newunit=iun, file="yourfile.txt", form="formatted", status="unknown")
! count number of records (note you don't have to read the line of text just the first char)
nrecs = 0
count_loop: do
    read(iun, '(a1)', iostat=istat) dummy
    if (istat /= 0) exit count_loop
    nrecs = nrecs+1
 end do count_loop

allocate(text_file(nrecs))
rewind(iun)
read_loop: do i=1,nrecs
    buffer=repeat(" ", 132)
    read(iun,'(a)', iostat=istat) buffer
    if (istat /= 0) then
      print *,' error while reading text file'
      exit read_loop
    end if
   text_file(i)%string = trim(buffer)
end do read_loop
close(iun)

I’ve been using this code fragment to read text files ever since deferred length string became a thing in Fortran. This has the advantage of taking up less memory than allocating the deferred length string to a fixed length. Note that counting the number of lines by just reading the first character of a record is an old trick I’ve seen in Fortran code for many years. I’m not sure if it’s faster than reading the entire line (with modern compilers) but I think it was in the old days.
Also, you probably don’t want to use 1 for a unit number. Using the newunit option on open is a much better approach.

Edit: It just occured to me that having an option for the INQUIRE statement that returns the number of records (lines of text) in a formatted text file would be a Good Thing. You can get file size in “storage units” but I don’t think that translates to records of variable length. Maybe something to consider for Fortran 2200.

2 Likes

Thank you very much, @Beliavsky. I didn’t know that the entire text could be read into a string.
I have two questions:

  • Is it possible that while reading the text, we ignore lines starting with # (comments) or put the lines starting with special characters into another string?

  • in your code, len_text gives the size of the file and not the number of lines. So, “text” is ONE string, please correct me if I am mistaken. Is it possible to have information about the number of lines (and possibly lines starting without special characters such as #) in the first part of the code (lines 7-10 in your code) and then declare :

allocate (character (len=len_text) :: text(numElemenrts) )

Thanks @RonShepard. I am not familiar with “linked list”. I will search for it.

Thanks @rwmsu for your answer. It gives me lots of insights.
Could you please let me know if my understanding of your code is correct?
So, you define

type :: vlstring_t
  character(len=:), allocatable :: string
end type

type(vlstring_t), allocatable :: text_file(:)

Is this equivalent to the following:

  character(len=:), allocatable :: myArray(:)

so when you write:

text_file(i)%string == trim(buffer)

without doing

allocate(character(len=**maxLength**) :: text_file(nrecs))

Does it mean that each array of text_file has a different length? or it somehow does something as

maxLength = max(maxLength, len(trim(tempString)))

at runtime?

PS: I am grateful that both you and @Beliavsky have shown me another way to use the unit number. It has always been annoying for me to ensure that the same unit is not used twice in the code

@mary,

character(:), allocatable :: string(:)

is not equivalent to embedding a single deferred length string in a derived type. All Fortran arrays
irrespecitve of type must have a fixed length or dimension(s). In your approach you would have to eventually do a

allocate(character(len=maxlen) :: strings(nrecs))

in other words you can’t have a Fortran array with variable length dimensions (sometimes called a ragged array). In your approach you would eventually have an array of size nrecs with each array component of length maxlen. If your text files has records that vary for one character per line to 132 characters per line you are wasteing a lot of memory if you have several thousand lines of text. Basically with your approach you are ending up with an array that is equivalent to doing

character(len=maxlen) :: strings(maxsize)  ! where maxlen and maxsize are set to some fixed values. 

By embedding the string in a derived type array, you have an true array in which each component has a variable length (at the expense of having to deal with derived type syntax). You now have a array that mirrors the way the text is stored in the file. ie. the length of each line of text is the same as in the text file. Note this is also why you can’t have an array of pointers in Fortran and have to do a similar approach (embedding in a derived type).

I have proposed in the past a syntax that would allow you to do what you are trying to do using curly brackets to signal an array like object with variable size storage of components.

ie.

character(len=:), allocatable :: strings{:}

that would effectively do the same thing as the derived type but this proposal has been met with resounding silence.

2 Likes

One does not need to read even a single character from each line. Just the statement read(iun, '(a)', iostat=istat) is enough to skip to the next record and to set istat correctly when the end-of-file has been reached. In fact, if there are embedded empty records, then attempting to read a single character might trigger an end-of-record condition, so that situation would be treated differently with the two approaches. In any case, testing for istat < 0 would narrow the termination conditions to just the end-of-file case.

Yes, once you have read the line (or the full file), you can parse it any way you want. You can have zero, one, or multiple outputs for each input line.

If you do read the full file into a string, then there might be some nonportable machine-dependence involved in how the end-of-records are identified. Maybe someone with more experience with Windows operating systems can fill in the details. Specifically, on POSIX machines with POSIX file systems, an end-of-record is identified by a single newline character, which in fortran is given by the intrinsic function new_line(c) where c is a character of the correct KIND. This covers linux, MacOS, SunOS, BSD, and any other unix-based or unix-like computer system. However, Microsoft operating systems, including the various MS-DOS and Windows versions, use instead a two-character sequence <charriage-return><new-line> to designate end of records. I believe a fortran i/o library can either read that two-character sequence into memory or it can convert it into the one-character sequence. This gotcha only involves the case where you use stream access to read the full file into memory. If you read the lines one at a time with the usual formatted i/o, then the fortran i/o library handles things correctly on any operating system. I also do not know what happens if you are running on a POSIX machine, but you are reading a nonstandard file system (e.g. a Windows file mounted over a network), or visa versa. One would need to read the standard very closely to see what is supposed to happen in these cases, and then there is also the possibility that the fortran i/o library does not handle everything consistently.

So I might suggest that the more portable approach might be to read the lines one at a time, but I have little experience with Microsoft operating systems, so I’m not certain about these potential problems. Maybe others can fill in these details.

1 Like

@RonShepard ,

One can also test against the intrinsic parameters iostat_end and iostat_eor in iso_fortran_env but like a lot of new Fortran features I forget they exist and resort to old (and sometimes bad) habits.

@rwmsu Thank you very much for your explanation. Now, everything is clear to me.

Thanks a lot, @RonShepard for your response. I thought maybe in “inquire”, there might be an option that can automatically identify lines starting with or without some characters. Otherwise, I think that I can identify those lines by using “index”.

I tried to replace the last part of your code with following:

  do while (.not. end_file)
     ipos = index(text(i_start:len_text),new_line("")) ! pos of first newline
     if (ipos == 0) then
        end_file = .true.
     else
        print*, 'line(', i, ') = ', text(i_start:i_start+ipos-1)
        i_start = ipos + 1
        i = i + 1
     end if
  end do

with the following variables declared at the beginning of the program:

  integer              :: i_start = 1, i = 1
  logical              :: end_file = .false. 

But it ends up in an infinite loop, and it prints only the first line.
So, I assume that I did not fully understand how this new_line is working.
In the text file:

a
ab
abcd
abc
de

So, this means that the text variable in Fortran is not considered as a" “ab” “abcd” “abc” "de. Is my understanding correct?

To say what the “best” method is depends on whether you are reading from stdin or just
files, whether you need to operate on the data before hitting EOF, what the biggest file
is you need to read, and how many times you need to do it. That being said, here is
one self-contained example. A bit long, but it is extracted from something relatively
general.

Note that a simple but perhaps non-obvious way to handle files like stdin you cannot
read twice is to write it into a Fortran scratch file so you have an actual file you can rewind
and read multiple times; one of the few things not mentioned so far that can be useful
depending on the method you use. For small files I often do not need to know the original
line lengths and for simplicity just read everything into an array of standard strings.

Here is a module that defines a procedure called fileread() that works for simple cases
where that is OK. There are several Open Source versions of routines that do this using
various methods. This one is a cut-down version of one
from General Purpose Fortran.

If you use fpm(1) you can just use that or the module M_io.

An abridged version that is a self-contained module that creates the readfile(3f) procedure:

M_slurp module
MODULE M_slurp
use, intrinsic :: iso_fortran_env, only : stdin=>input_unit, stdout=>output_unit, stderr=>error_unit
implicit none
private
public fileread
integer,parameter,private:: sp=kind(1.0), dp=kind(1.0d0)

character(len=*),parameter,private :: gen='(*(g0,1x))'

CONTAINS

subroutine fileread(FILENAME,pageout)
implicit none
character(*),intent(in)                      :: FILENAME   ! file to read
character(len=:),allocatable,intent(out) :: pageout(:) ! page to hold file in memory
character(len=1),allocatable             :: text(:)    ! array to hold file in memory

   call filebyte(FILENAME,text) ! allocate character array and copy file into it

   if(.not.allocated(text))then
       write(*,*)'*fileread* failed to load file '//FILENAME
   else  ! convert array of characters to array of lines
      pageout=page(text)
      deallocate(text)     ! release memory
   endif

contains
function page(array)  result (table)

!@(#) page(3fp): function to copy char array to page of text

character(len=1),intent(in)  :: array(:)
character(len=:),allocatable :: table(:)
integer                      :: i
integer                      :: linelength
integer                      :: length
integer                      :: lines
integer                      :: linecount
integer                      :: position
integer                      :: sz
character(len=1),parameter   :: nl = char(10)
character(len=1),parameter   :: cr = char(13)
   lines = 0
   linelength = 0
   length = 0
   sz=size(array)
   do i = 1,sz
      if( array(i) == nl )then
         linelength = max(linelength,length)
         lines = lines + 1
         length = 0
      else
         length = length + 1
      endif
   enddo
   if( sz > 0 )then
      if( array(sz) /= nl )then
         lines = lines+1
      endif
   endif

   if(allocated(table))deallocate(table)
   allocate(character(len=linelength) :: table(lines))
   table(:) = ' '

   linecount = 1
   position = 1
   do i = 1,sz
      if( array(i) == nl )then
         linecount=linecount+1
         position=1
      elseif( array(i) == cr )then
      elseif( linelength /= 0 )then
         table(linecount)(position:position) = array(i)
         position = position+1
      endif
   enddo
end function page
end subroutine fileread

subroutine filebyte(filename,text,length,lines)
implicit none

!@(#) M_slurp::filebyte(3f): allocate text array and read file filename into it

character(*),intent(in)                      :: filename    ! filename to shlep
character(len=1),allocatable,intent(out) :: text(:)     ! array to hold file
integer,intent(out),optional             :: length      ! length of longest line
integer,intent(out),optional             :: lines       ! number of lines
integer :: nchars=0             ! holds size of file
integer :: igetunit             ! use newunit=igetunit in f08
integer :: ios=0                ! used for I/O error status
integer :: length_local
integer :: lines_local
integer :: i
integer :: icount
character(len=256)  :: message
character(len=4096) :: label
character(len=:),allocatable :: line
   length_local=0
   lines_local=0
   message=''
   open(newunit=igetunit, file=trim(filename), action="read", iomsg=message,&
   &form="unformatted", access="stream",status='old',iostat=ios)
   label=filename
   if(ios == 0)then  ! if file was successfully opened
      inquire(unit=igetunit, size=nchars)
      if(nchars <= 0)then
         call stderr_local( '*filebyte* empty file '//trim(label) )
         return
      endif
      ! read file into text array
      if(allocated(text))deallocate(text) ! make sure text array not allocated
      allocate ( text(nchars) )           ! make enough storage to hold file
      read(igetunit,iostat=ios,iomsg=message) text      ! load input file -> text array
      if(ios /= 0)then
         call stderr_local( '*filebyte* bad read of '//trim(label)//':'//trim(message) )
      endif
   else
      call stderr_local('*filebyte* '//message)
      allocate ( text(0) )           ! make enough storage to hold file
   endif

   close(iostat=ios,unit=igetunit)            ! close if opened successfully or not

   if(present(lines).or.present(length))then  ! get length of longest line and number of lines
      icount=0
      do i=1,nchars
         if(text(i) == NEW_LINE('A'))then
            lines_local=lines_local+1
            length_local=max(length_local,icount)
            icount=0
         endif
         icount=icount+1
      enddo
      if(nchars /= 0)then
         if(text(nchars) /= NEW_LINE('A'))then
            lines_local=lines_local+1
            length_local=max(length_local,icount)
         endif
      endif
      if(present(lines))lines=lines_local
      if(present(length))length=length_local
   endif
end subroutine filebyte

subroutine stderr_local(message)
character(len=*) :: message
   write(stderr,'(a)')trim(message)    ! write message to standard error
end subroutine stderr_local

end module M_slurp

A little program to test the module reads the file “testin” in and prints the
lines surrounded with braces and then removes lines beginning with “#”;
except I either made a typo or hit a gfortran bug (works with ifx):kkkkkk

program testit
use M_slurp, only : fileread
implicit none
character(len=:),allocatable :: pageout(:) ! page to hold file in memory
integer :: i
logical,allocatable :: keep(:)

   call fileread('testin',pageout)

   write(*,*)'write file enclosed in braces'
   write(*,*)'size=',size(pageout),',len=',len(pageout)
   write(*,'(*(g0))')('[',trim(pageout(i)),']',new_line('A'),i=1,size(pageout))

   ! remove lines starting with "#"
   ! A bug inn gfortran? works with ifx
   keep=index(adjustl(pageout),'#').ne.1
   pageout=pack(pageout,keep)

   write(*,*)'remove lines starting with "#"'
   write(*,*)'size=',size(pageout),',len=',len(pageout)
   write(*,'(*(i5,": ",a,/))')(i,trim(pageout(i)),i=1,size(pageout))

end program testit
1 Like

I think the problem is how you are using the index() function. You will need something like the following:

Thanks @RonShepard. But, I have the same issue when I replace “ipos = index(text(i_start:len_text),new_line(""))” by “ipos = index(text(i_start:len_text),new_line('a'))”.

I think you might have missed the important part of my correction.

The index() function returns the position within its dummy argument of the located character. The returned value is not the position within the original character string.

If you examine the values of i_start and ipos before the print statement, you will see this.

The other change, which you did notice, was the argument to the new_line() intrinsic function. I know absolutely for certain that new_line('a') works because that is the example given in the standard document itself. It is unclear to me if a null string (a zero length string) is an acceptable argument. I can see some reasons that it should work, since even a null string has a KIND value, but the standard document does not explicitly state that as being allowed, so I think it is safer to use a nonnull string as the argument. BTW, you can also use achar(10) as that argument in a portable way if your compiler uses ascii as its default KIND (which all compilers I know of right now do) or if the ascii <newline> character is mapped to the equivalent local character. Of all fortran compilers, the chances are probably 99.99% that one or the other of those conditions is true.

1 Like

Thank you, @RonShepard. I corrected this part of the code, but I still encountered a problem. As you mentioned, the issue was that the value returned by index() was relative to the dummy argument and not the original string. Below, I am sharing the corrected version in case someone else makes the same mistake and is looking for a solution:

  do while (ipos.ne.0)
     ipos = index(text(i_start:len_text),new_line("")) ! note that ipos is relative to (i_start:len_text) and not absolute
     if (ipos == 0) exit
     print*, 'ipos = ', ipos
     print*, 'line(', i, ') = ', text(i_start:i_start+ipos-1)
     i = i + 1
     i_start = ipos + i_start ! istart is an absolute-value w.r.t to the entire original text
  end do

Thank you @urbanjost . I am going to study the solution you proposed.