fortranDF: A simple data frame library for Fortran

Through the needs of a different project that I am working on, I have created a library that implements a data frame data structure called fortranDF (fortran Data Frames):

The idea is to be somewhat similar to the data frames found in Pandas and R, where the data is column-based. Each column can be a different intrinsic type, but all data within a column must be of the same type. Each column also can have a header, which can be used to retrieve the column associated to said header.

Here is some example code that builds a data frame with an integer column and a real column and prints the whole data frame to the screen:

program example
    use iso_fortran_env,only: OUTPUT_UNIT
    use df_precision,only: rk
    use df_fortranDF,only: data_frame
    implicit none

    type(data_frame) :: df

    call df%new()
    call df%append([1.0_rk,2.0_rk,3.0_rk],"real_data")
    call df%append([1,2,3],"integer_data")

    call df%write(OUTPUT_UNIT)

end program example

Right now the library is just the data container/structure, with some very limited IO procedures (write to screen and read from a file with a very specific format). I hope to change this in the future (ideally add at least support for CSV format) but for now IO is delegated to the user with the use of extended types.

As an example for extending the data_frame type, I have also made fortranMR (fortran MESA reader) public, which is a very bare bones and unpolished library for reading in a type of file that the stellar evolution program MESA outputs. fortranMR is really for my own use, but it happens to also serve as decent example of making user-defined extensions to the data_frame type.

I am interested to hear any feedback for anything about this project, including feature ideas, code criticism, documentation, or even whether or not you think this would be useful to use.

20 Likes

Thanks for your project. As a heavy user of Python pandas, I have wondered how closely Fortran could replicate a pandas or R data frame. For my own use I wrote a “data frame” type in Fortran restricted to double precision values, character column headers, and type(date) (defined by me) index values, and it was useful.

If a data frame has n1 rows and n2 columns, you could add an allocatable component valid(:,:) with shape [n1, n2] that is .true. where data exists. You could set non-existent real data to NaN, but there is no equivalent for data of other types. If valid is not allocated it means all data exists.

Instead of having columns (1-D arrays) of arbitrary type, another implementation of a data frame would have it be composed of 2-D arrays of arbitrary type, since it is more efficient to deal with a 2-D array than an array of 1-D arrays, and since data frames will often be composed of groups of columns of the same type.

1 Like

I haven’t had a chance to look at your library closely yet, but one of the things I used to really like about Pandas was the ability to read in a file like

City, State, Mayor, Population
Nowhere, CA, Person, N/A
Somewhere, AL, , 2000

and be able to do something like (I may have the indexing backwards)

print(data["Nowhere"]["Mayor"]) # prints "Person"
print(data["Somewhere"]["Population"]) # prints 2000

and have things not explode. Part of what makes that possible is Python’s duck-typing. Not sure how close you can get to that in Fortran, but it’s what made data exploration in Python so easy.

2 Likes

If a data frame has n1 rows and n2 columns, you could add an allocatable component valid(:,:) with shape [n1, n2] that is .true. where data exists.

Originally, I had thought about letting each column have its own length, but at some point I made the decision to enforce all columns to be the same length for the ease of implementation. With my current implementation, it is possible for the user to append columns all of the same length but where the ‘good’ data ends early (and then keep track of the size of the ‘good’ data on their own). Currently I have a class member called nrows that is just scalar, but it might be worth adding a max_nrows scalar and turn nrows into a rank 1 array to keep track of the ‘good’ data. But I do like your idea of a mask array as well.

Instead of having columns (1-D arrays) of arbitrary type, another implementation of a data frame would have it be composed of 2-D arrays of arbitrary type, since it is more efficient to deal with a 2-D array than an array of 1-D arrays, and since data frames will often be composed of groups of columns of the same type.

This is also a very good idea. I think I had thought about something similar, but again thought it would be too difficult initially (mainly due to wanting to associate headers with columns). I have some new ideas on how to deal with the headers though, so I think I will revisit this.

Thanks for your input @Beliavsky.

I didn’t know that this was possible in Python. I will admit that I haven’t used data frames in Python that much but I did really like the idea of that type of container.

I think there would be a way to get similar functionality to your code snippet (albeit in more lines). I will assume that

City, State, Mayor, Population
Nowhere, CA, Person, N/A
Somewhere, AL, , 2000

is already stored in a data_frame object called df and where City, State, Mayor, Population are the headers.

We could then have something along the lines of

index = findloc(df%getch("City"),"Nowhere",dim=1)
print*, df%getch("Mayor",index)   ! prints Person

index = findloc(df%getch("City"),"Somewhere",dim=1)
print*, df%getch("Population",index)   ! prints 2000

where index is an integer, and the value 2000 must be a character string. This functionality could be hidden away into a function and overloaded with the getch functions. It seems to me that the "City" column is being treated as a second header but I would want to look more into the specific behaviour that Python allows before implementing this.

This is a very interesting suggestion @everythingfunctional, thanks for the input.

In case this project grows more complex at some point, it can be worth keeping the Apache Arrow project in mind when thinking about the in-memory storage format.

1 Like

Great work @jaiken!

It definitely will be useful! I’m convinced that a lack of data analysis libraries like Pandas is holding people making from moving over to Fortran. If you couple this with interactive notebooks via LFortran, then we’re onto a winner.

Coincidently, creating a Fortran dataframe library has been on my to-do list for far too long, so I’m very pleased to see this!

It might be worth keeping Polars in mind too (which follows the Arrow model). I think that is written in Rust. It would be nice if we have a Fortran Arrow-like library that outperforms Polars (because Polars boasts about how fast it is!).

Thanks @samharrison7 and @scottza for the input.

So far, I haven’t worried much about performance as I really just wanted something that worked and wouldn’t take too long to build. But as I progress with the project, I will keep performance in mind and will definitely look more closely into Polars and Apache Arrow.

1 Like