Move back and forward reading a file

Dear All,

I need to read a file that looks like this

* Surface_1
1 3 90
37 272 92
30 494 48
...
272 99 100
302 87 65
* Surface_2
11 22 33
55 61 20
...
98 93 15

Currently I am reallocating an array to fit new values read at each line

 read_loop_nodes: do                                                                                         
 read (mesh_unit_file, '(A)',iostat=retcode) line                                                          
  if ( (retcode/=iostat_end).and.(line(1:1)/="*")) then                                                     
      read(line,*) n1_tmp,n2_tmp,n3_tmp                                                                   
       nodes_n1 = [nodes_n1,n1_tmp]                                                                       
       nodes_n2 = [nodes_n2,n2_tmp]                                                                       
       nodes_n3 = [nodes_n3,n3_tmp] 
 else
...

But of course this is terribly inefficient for large number of lines…

So I thought that an improved version could

  1. get the total number N of values (lines) to read by just reading each line and discarding values
  2. allocate an array of size N
  3. get back at the stard of the record to read and save data…so saving N-1 allocation/deallocation

How can I implement this ? How can I move back to a certain point in a file?

Also any suggestions about an even better solution for this problem?

Thank you so much

As I understand, you need to read the whole file once to count the number of values N, then to come back to the beginning of the file to read the values ? If yes you have two options:

  • close the file after the first pass and reopen it
  • use the rewind statement

Another option is to overallocate the arrays. If you can decide an upper bound Nmax, just allocate the arrays to Nmax elements, and at the end of reading the values reallocate them:

allocate( nodes_n1(Nmax), nodes_n2(Nmax), node_n3(Nmax) )
N = 0
read_loop_nodes: do                                                                                         
 read (mesh_unit_file, '(A)',iostat=retcode) line      
  if ( (retcode/=iostat_end).and.(line(1:1)/="*")) then          
      N = N+1                                           
      read(line,*) nodes_n1(N), nodes_n2(N), nodes_n3(N) 
  else
...
  end if
...
end do
nodes_n1 = nodes_n1(1:N)
nodes_n2 = nodes_n2(1:N)
nodes_n3 = nodes_n1(3:N)

Note that all commonly used systems todays use virtual memory: you can allocate very large amounts of memory, but it remains “virtual” as long as the elements are not accessed. A physical page is attributed in RAM only when an element of the page is first written. This means that you can allocate a 100GB array with a very large Nmax, and if you are writing only the first element it will occupy only 4kB of physical RAM (4kB being the size of a page).

The other possibility is creating a linked list to temporarily store the elements during the reading phase, but this is a bit cumbersome IMO.

Dear @PierU,

Thank you for all these great suggestions!

I’ll be considering all of the approaches you mentioned.

Execution wall-clock time is the primary reason for refactoring and is mainly influenced by disk-read speed and allocation/deallocation speed.

So, allocating once but reading two times after all could not be the faster strategy (disk read is really slow on the server I work on).

Need to run some tests…

Thank you

There is a possible mix between overallocation and reallocations, if you don’t want to allocate a very large Nmax from scratch. This is a simulation of what the C++ vector does. I illustrate it for nodes_n1 only:

Nmax = 1000 ! start with a reasonnable Nmax
allocate( nodes_n1(Nmax), nodes_n2(Nmax), node_n3(Nmax) )
N = 0
read_loop_nodes: do                                                                                         
 read (mesh_unit_file, '(A)',iostat=retcode) line      
  if ( (retcode/=iostat_end).and.(line(1:1)/="*")) then          
      N = N+1
      if (N > Nmax) then
         ! reallocations occur only when N gets over Nmax
         allocate( tmparray(2*Nmax) )
         tmparray(1:Nmax) = nodes_n1(:)
         call mv_alloc( tmparray, nodes_n1 )
         Nmax = size( nodes_n1 )
      end if
      read(line,*) nodes_n1(N), ...
  else
...
  end if
...
end do
nodes_n1 = nodes_n1(1:N)
nodes_n2 = nodes_n2(1:N)
nodes_n3 = nodes_n1(3:N)```

Great Solution! Overall this seems the most balanced approach.

I also thought that once a file is opened…
is it loaded entirely on the RAM or just a chunk of it ?

If the first is true, then using backspace or rewind and re-reading shouldn’t take much time…

but I believe this is compiler-dependent so better don’t make assumptions!

This is not specified by the standard, and this is managed by the OS. With the commonly used OS’s, just opening a file does not load any of the content in RAM.The parts of the files that are explicitly read afterwards are cached in RAM, at least as long as:

  • the file is not closed (so, the rewind option is actually better than closing/reopening the file)
  • the OS does not need to reclaim the space occupied by the cache

If your file is not too big, it is reasonable to assume that it will stay in cache between the 2 passes, therefore that the 2nd pass will be much faster than than the 1st pass

How many times do you need to reuse the file and how locked in are you to the file format? If you can change the format of the file other alternatives from HDF5 to binary files are available. If the files are used many times reading them and converting them to a file format that can be processed more efficiently may be desirable. If you cannot change the format and only read the file a few times there might not be as much of an advantage to such an approach. Big is a relative turn. Are these file sizes in the Gigabytes or larger or a few megabytes at most? How much time does it currently take to read a file and how often do you have to read one? Your current approach might be inefficient but it is only an academic exercise to improve it unless it currently is (or likely to be in the future) taking significant time or resources; in which case going to a higher-performing more easily consumed file format may well be worth the effort; as the root cause of the problem is the file format is not easy to access efficiently.