How to temporarily bunch data

I have a very long serial (time) data. I sequentially scan them once to find the numbers of serial data satisfying various “complicated” criteria. After that, I allocate the sizes of storage data structures (arrays). Next, I have to re-scan the data for second time to allocate the data in the corresponding arrays.

Is there any smart way to bunch (ordered or unordered) temporarily and dynamically the data at first scan to save the second scan? I understand that allocating dynamic arrays is very efficient for memory handling and later on for calculations. To note that while data are bunched, calculations are not done at all. Next, knowing the sizes required, data could permanently be transferred (in order or not) to allocated arrays and release the “abused memory” used for bunching. I believe that transferring data is less expensive than scanning them for second time, isn’t it?

Any suggestion? Tia

Note (added later on for clarification purposes): Data are already read from the file and are stored in an 1d array. Matched data found at first parse/loop/scan of array are required to temporarily stored and transferred to arrays upon parse completion when array sizes are known to avoid second parse of the array. First answers have mostly to do with reading and storing the data from files.

I understand that by sequential scanning you mean reading from an external file. If so, you should consider how much memory is available, how big is the file being scanned and what percentage of the data is expected to end up in working arrays.

So if your computer memory is enough to hold all the data, and the predicted percentage is high, you could probably read all the data into a dynamically growing allocated array(s) (which should be done in a clever way) and then use pointers to just store an index to the data which are to be processed.

If the percentage is small, you would rather transfer the needed data to new array(s) and thendeallocate the whole set.

In any case, you can just rewind the file and read it once again. If its size is significantly less than the memory size, it will be probably cached in memory anyway, so the reading will take place mostly between cache buffers and your program memory.

If your file is bigger than memory, you have a bad luck and you will have to read it twice from the disk (or whatever external device it is located). Unless you need only part of the data (say a few of many columns).

1 Like

In addition to your current method and those subsequently proposed, you can also read the data and create binary scratch files for the selected data as you go in the first pass, which then can be subsequently be read efficiently or store it into a pre-allocated array as you go; or create a direct-access scratch copy of the data and then create index arrays to the data. Heck, you can even import the data into a database or sqlite file and then make queries on the data.

It really depends on the original input file, how big it is, whether it is initially flat text or binary,
how often you need to create and use the data subsets, and what platform you are running on (how much memory you have, …) and how big your biggest input may be in the future. Any additional parameters you can supply would help narrow it down; but for relatively small files your current solution is direct and not overly complicated, so I am wondering what is prompting you to look for alternatives? Does the current method take an excessive amount of time, or …(?).

One general approach to this problem of reading data of unknown length is to read the items one at a time and append them to a linked list data structure. Then once the file has been read, you can transfer the linked list to a normal linear array of exactly the right size. This requires reading the data only a single time. The downside is that at least temporarily you need to store two copies of the data, one in the linked list form and the other in the array.

Read it twice. There is a good chance the data file will still be cached in I/O file buffers and so the reread will be quicker anyway.

The other approach is to have a config file that provides an estimate of the count of data values for each criteria/list you want to use.

If you generate the data yourself or have some control of this process, you could write the config file at the end of the data generation phase.
I find that data from a third party is rarely read only once, so an initial scan for statistics ( to the config file) and unrecognised data checks is a good first pass of understanding the data anyway. How many times do you read this data anyway ? Usually you have to revise the searching criteria as you better understand the source data. What might first be a data error could become extremely significant.
You may have to do this 5, 10, ?? times anyway.

I have learnt to never correct the raw data, but apply correcting rules in the search program.

A simple way to allow your arrays to grow is:

real, allocatable :: data(:), chunk(:)
allocate( chunk(100), data(100) )
do i = 1,...
    data = [data, chunk]
    ... read additional data ...
enddo

This is not efficient, because the data get copied time and again into the now larger array, but it is effective. Alternatives as suggested by @RonShepard are more efficient, but require more code. You could also use an array of arrays like:

real, pointer  :: chunk(:)
allocate( chunk(100) )
list(idx)%pdata => chunk
nullify( chunk )

so that you can access the data via:

write(*,*) list(i)%pdata(j)

Just some ideas.

I may add that I have been working on a general solution for this in the context of “Mathematics of Arrays” (see the thread by that name). It is an operation called catenation in MoA.

I understand that by sequential scanning you mean reading from an external file.

No, I do not. Data are read from an external file (csv, json or toml) once. This is something that I cannot avoid because data comes from external databases.

I do mean to sequentially loop over the data. I apologize for any inconvenience due to poor/unfortunate description!

If so, you should…

Thanks for the opportunity to make me consider huge files. In fact, I could partly read files in chunks. CL arguments, such as qty (quantity of data to be read) and oqty (overlapping q’ty to avoid loosing matches) could be added! I need to check if existing fpm libraries support partly read of data from files.

So if your computer memory …

Thanks again! I could also add a CL argument about min free memory margin required to be maintained. Based on that, code could choose strategy about reading and storing the data, etc. I need to check if there is an fpm library which returns memory size, file size, etc. Of course, this implies that after running the program, no other program starts in the meanwhile until the former ends.

For the time being, data are spread in series (rows or columns) but I can handle them as 1d arrays since (sub-)data are not related. And the size of file is relatively small compared to the memory.

In future, data will be 2d (eg huge matrices or pictures) where size is bigger but still manageable and small compared to the memory.

I shall need above considerations when data will be 3d (eg astrophysical data, etc).

Note: Don’t ask me more info about 2d or 3d data since I do not have personal knowledge/experience and info about why, what and how yet. Talfyr

you can even import the data into a database or sqlite file and then make queries on the data.

Thanks for the idea to use a database. I need to check if there are fpm libraries which support the use of in-memory databases.

Home page mentions that

SQLite reads and writes directly to ordinary disk files.

It would really be a solution in case of huge 3d data.

It really depends on the original input file, how big it is,

Estimated:
1d: 50KB - 1MB
2d: 3MB - 20MB
3d: 100MB - ?GB

Note: Don’t ask me more info about 2d or 3d data since I do not have personal knowledge/experience and info about why, what and how yet.

whether it is initially flat text or binary,

1d: flat text
2d: No idea
3d: No idea

how often you need to create and use the data subsets,

I need to create them only once.

and what platform you are running on (how much memory you have, …) and how big your biggest input may be in the future.

See the last part of my response to @msz59

… so I am wondering what is prompting you to look for alternatives? Does the current method take an excessive amount of time, or …(?).

The first (personal) part (SPA of 1d time series) of the project is over. I am trying to improve it in terms of code size, run time and beauty. Until I get more info about 2d and 3d data SPA and requirements I will just investigate what there is available, etc Talfyr

I use existing fpm libraries to read the data from the file. In fact, I do not know what kind of techniques they use and how.

What I have observed is:

  • size(.cvs) < size(.toml) < size(.json)

mainly because of code beautification.

  • time(.cvs) < time(.toml), time(.json)

Note: I have not carried out any serious benchmarks. It’s just an observation!

One general approach to this problem of reading data of unknown length is to read the items one at a time and append them to a linked list data structure.

Thanks for the idea!

The downside is that at least temporarily you need to store two copies of the data, one in the linked list form and the other in the array.

As long as both are stored in memory, the cost is less than looping for second time the data applying again the “complicated” criteria.

@JohnCampbell

How many times do you read this data anyway ?

Data are read from the file once. But they are scanned/parsed/looped twice. Once to calculate the size of the (sub-)data arrays and second to store the matched data to the allocated arrays. The latter loop is what I want to avoid.

The potential solutions are to temporarily store the (sub-)data (set):

  • in separate files (create, open, close, delete)
  • database, preferably in-memory
  • linked list data structure
  • using pointers

and transfer them to the arrays after being allocated.

Usually you have to revise the searching criteria …

There is not such a need.

I have learnt to never correct the raw data, but apply correcting rules in the search program.

So far, there are not correcting rules. But it is something I need to keep it in my mind.

@Arjen Thanks for the ideas!

You can check out Brad Richardson / SQLite for Fortran · GitLab. There’s examples in the test suite of creating in-memory databases. I.e. test/sqliteff_test.f90 · main · Brad Richardson / SQLite for Fortran · GitLab

@everythingfunctional Tal for the info! I have also found jacobwilliams / flist which also provides fpm manifest file.

I don’t think this applies to this situation, but another consideration is whether the input file can be rewound and read again at all. This is usually allowed for a normal file, of course, but when you consider the other possibilities of interactive terminal input, pipes, and here documents, then those are sometimes not seekable, and the program must be prepared to read them just once.

Here is a toy linked list code:

program linkedlist
   implicit none
   type ll_type
      integer :: i = -1
      type(ll_type), allocatable :: next
   end type ll_type
   type(ll_type), allocatable, target :: ll
   type(ll_type), pointer :: current
   integer :: n, i
   integer, allocatable :: array(:)
   allocate( ll )
   current => ll
   n = 0
   do
      read(*,*) current%i
      if ( current%i < 0 ) exit
      n = n + 1
      allocate( current%next )
      current => current%next
   enddo
   allocate ( array(n) )
   current => ll
   do i = 1, n
      array(i) = current%i
      current => current%next
   enddo
   deallocate( ll )
   write(*,'(a,i0,a,*(1x,i0))') 'array(1:',n,')=', array
end program linkedlist

This reads an arbitrary number of nonnegative integers, allocates the array of the correct size, transfers the linked list to the array, and deallocates the linked list. To make this more useful, the read statement should be generalized, maybe using non advancing i/o, to allow more than one integer per input line.

1 Like

If you are the producer of the data you may use NETCDF or HDF5.
Personally I have written some NETCDF file from Fortran in a format compatible with the python xarray library. This way I could easily read back them in Python with just an instruction.

I would use a standard binary format for large binary files (2d, 3d).

In the Fortran stdlib there is a library to read and write numpy “.npy” files, you may use it if you prefer.

2 Likes

I would start out keeping it simple. Scan the input, creating and writing separate files as needed. When that’s done, rewind the files (or close them and open them for reading), read and process the data, and see how it goes.

Efficiency will depend on how you create the files. I believe that opening the files with ACCESS=STREAM and FORM=UNFORMATTED will create files that look like arrays, but I could be wrong, and I’m ready to stand corrected.

You might already be doing something just like this; if you are, my apologies.

If the program isn’t fast enough, I would consider looking at your hardware first. If you’re using a hard drive, does it have a decent transfer rate, does it have a decent amount of available space, and does it need to be defragmented? Could you use a solid state disk instead? If I’ve been reading your posts correctly, one terabyte should be enough, and you can get a 1TB SSD for less than $100.

If you’re already using an SSD and things are slow, is the drive healthy? Does it still have lots of available space, or has it accumulated a lot of bad blocks (or whatever it is that SSDs accumulate as they age)?

If all of that fails, then I would see if a software solution would work better.

1 Like