fortranDF: A simple data frame library for Fortran

jaiken · February 22, 2024, 2:19am

Through the needs of a different project that I am working on, I have created a library that implements a data frame data structure called fortranDF (fortran Data Frames):

The idea is to be somewhat similar to the data frames found in Pandas and R, where the data is column-based. Each column can be a different intrinsic type, but all data within a column must be of the same type. Each column also can have a header, which can be used to retrieve the column associated to said header.

Here is some example code that builds a data frame with an integer column and a real column and prints the whole data frame to the screen:

program example
    use iso_fortran_env,only: OUTPUT_UNIT
    use df_precision,only: rk
    use df_fortranDF,only: data_frame
    implicit none

    type(data_frame) :: df

    call df%new()
    call df%append([1.0_rk,2.0_rk,3.0_rk],"real_data")
    call df%append([1,2,3],"integer_data")

    call df%write(OUTPUT_UNIT)

end program example

Right now the library is just the data container/structure, with some very limited IO procedures (write to screen and read from a file with a very specific format). I hope to change this in the future (ideally add at least support for CSV format) but for now IO is delegated to the user with the use of extended types.

As an example for extending the data_frame type, I have also made fortranMR (fortran MESA reader) public, which is a very bare bones and unpolished library for reading in a type of file that the stellar evolution program MESA outputs. fortranMR is really for my own use, but it happens to also serve as decent example of making user-defined extensions to the data_frame type.

I am interested to hear any feedback for anything about this project, including feature ideas, code criticism, documentation, or even whether or not you think this would be useful to use.

Beliavsky · February 22, 2024, 2:10pm

Thanks for your project. As a heavy user of Python pandas, I have wondered how closely Fortran could replicate a pandas or R data frame. For my own use I wrote a “data frame” type in Fortran restricted to double precision values, character column headers, and type(date) (defined by me) index values, and it was useful.

If a data frame has n1 rows and n2 columns, you could add an allocatable component valid(:,:) with shape [n1, n2] that is .true. where data exists. You could set non-existent real data to NaN, but there is no equivalent for data of other types. If valid is not allocated it means all data exists.

Instead of having columns (1-D arrays) of arbitrary type, another implementation of a data frame would have it be composed of 2-D arrays of arbitrary type, since it is more efficient to deal with a 2-D array than an array of 1-D arrays, and since data frames will often be composed of groups of columns of the same type.

everythingfunctional · February 22, 2024, 3:06pm

I haven’t had a chance to look at your library closely yet, but one of the things I used to really like about Pandas was the ability to read in a file like

City, State, Mayor, Population
Nowhere, CA, Person, N/A
Somewhere, AL, , 2000

and be able to do something like (I may have the indexing backwards)

print(data["Nowhere"]["Mayor"]) # prints "Person"
print(data["Somewhere"]["Population"]) # prints 2000

and have things not explode. Part of what makes that possible is Python’s duck-typing. Not sure how close you can get to that in Fortran, but it’s what made data exploration in Python so easy.

jaiken · February 22, 2024, 4:13pm

If a data frame has n1 rows and n2 columns, you could add an allocatable component valid(:,:) with shape [n1, n2] that is .true. where data exists.

Originally, I had thought about letting each column have its own length, but at some point I made the decision to enforce all columns to be the same length for the ease of implementation. With my current implementation, it is possible for the user to append columns all of the same length but where the ‘good’ data ends early (and then keep track of the size of the ‘good’ data on their own). Currently I have a class member called nrows that is just scalar, but it might be worth adding a max_nrows scalar and turn nrows into a rank 1 array to keep track of the ‘good’ data. But I do like your idea of a mask array as well.

Instead of having columns (1-D arrays) of arbitrary type, another implementation of a data frame would have it be composed of 2-D arrays of arbitrary type, since it is more efficient to deal with a 2-D array than an array of 1-D arrays, and since data frames will often be composed of groups of columns of the same type.

This is also a very good idea. I think I had thought about something similar, but again thought it would be too difficult initially (mainly due to wanting to associate headers with columns). I have some new ideas on how to deal with the headers though, so I think I will revisit this.

Thanks for your input @Beliavsky.

jaiken · February 22, 2024, 5:35pm

I didn’t know that this was possible in Python. I will admit that I haven’t used data frames in Python that much but I did really like the idea of that type of container.

I think there would be a way to get similar functionality to your code snippet (albeit in more lines). I will assume that

City, State, Mayor, Population
Nowhere, CA, Person, N/A
Somewhere, AL, , 2000

is already stored in a data_frame object called df and where City, State, Mayor, Population are the headers.

We could then have something along the lines of

index = findloc(df%getch("City"),"Nowhere",dim=1)
print*, df%getch("Mayor",index)   ! prints Person

index = findloc(df%getch("City"),"Somewhere",dim=1)
print*, df%getch("Population",index)   ! prints 2000

where index is an integer, and the value 2000 must be a character string. This functionality could be hidden away into a function and overloaded with the getch functions. It seems to me that the "City" column is being treated as a second header but I would want to look more into the specific behaviour that Python allows before implementing this.

This is a very interesting suggestion @everythingfunctional, thanks for the input.

scottza · February 23, 2024, 8:38am

In case this project grows more complex at some point, it can be worth keeping the Apache Arrow project in mind when thinking about the in-memory storage format.

samharrison7 · February 23, 2024, 8:49am

Great work @jaiken!

It definitely will be useful! I’m convinced that a lack of data analysis libraries like Pandas is holding people making from moving over to Fortran. If you couple this with interactive notebooks via LFortran, then we’re onto a winner.

Coincidently, creating a Fortran dataframe library has been on my to-do list for far too long, so I’m very pleased to see this!

It might be worth keeping Polars in mind too (which follows the Arrow model). I think that is written in Rust. It would be nice if we have a Fortran Arrow-like library that outperforms Polars (because Polars boasts about how fast it is!).

jaiken · February 26, 2024, 5:40pm

Thanks @samharrison7 and @scottza for the input.

So far, I haven’t worried much about performance as I really just wanted something that worked and wouldn’t take too long to build. But as I progress with the project, I will keep performance in mind and will definitely look more closely into Polars and Apache Arrow.

omdaniel · March 20, 2025, 2:40pm

Another plug for a native Fortran interface to Apache Arrow is that if Fortran is storing data in-memory using Arrow one would be able to use other data analysis tool in Python or R to analyze the in-memory data without having to save to the file system storage first.

Beliavsky · March 21, 2025, 1:26pm

My dataframe type, specialized to integer indices, string column names, and double precision data is

I use it for time series data, where dates can be represented as YYYYMMDD integers. The README gives an example of using it to analyze stock price data. The code below shows how one can subset data

program xdataframe_loc
use dataframe_mod, only: DataFrame, display, random
type(DataFrame) :: df
call random(df, 4, 3)
df%index = df%index * 10
call display(df, title="df")
! subset by row and column numbers
call display(df%icol([2,3]), title="df%icol([1,3])")
call display(df%irow([2,4]), title="df%irow([2,4])")
! subset by index values and column names
call display(df%loc(rows=[20, 40]), title="df%loc(rows=[20, 40]")
call display(df%loc(columns=["C1", "C3"]), title="df%loc(columns=['C1', 'C3'])")
call display(df%loc(rows=[20, 40], columns=["C1", "C3"]), &
             title="df%loc(rows=[20, 40], columns=['C1', 'C3']")
call display(df%loc())
end program xdataframe_loc

with output

df
     index         C1         C2         C3
        10     0.9584     0.1462     0.3937
        20     0.5586     0.2374     0.3336
        30     0.1630     0.3181     0.6878
        40     0.5074     0.0701     0.7673

df%icol([1,3])
     index         C2         C3
        10     0.1462     0.3937
        20     0.2374     0.3336
        30     0.3181     0.6878
        40     0.0701     0.7673

df%irow([2,4])
     index         C1         C2         C3
        20     0.5586     0.2374     0.3336
        40     0.5074     0.0701     0.7673

df%loc(rows=[20, 40]
     index         C1         C2         C3
        20     0.5586     0.2374     0.3336
        40     0.5074     0.0701     0.7673

df%loc(columns=['C1', 'C3'])
     index         C1         C3
        10     0.9584     0.3937
        20     0.5586     0.3336
        30     0.1630     0.6878
        40     0.5074     0.7673

df%loc(rows=[20, 40], columns=['C1', 'C3']
     index         C1         C3
        20     0.5586     0.3336
        40     0.5074     0.7673

     index         C1         C2         C3
        10     0.9584     0.1462     0.3937
        20     0.5586     0.2374     0.3336
        30     0.1630     0.3181     0.6878
        40     0.5074     0.0701     0.7673

Topic		Replies	Views
Programming challenge (19th century writers) Humor	28	1666	March 18, 2024
Derived type initialized with array sections giving unexpected results	7	391	April 29, 2025
Fortran libraries to work with parquet files Help	23	2002	April 10, 2023
Apply Excel LOOKUP in Fortran Help	20	504	February 23, 2025
Python Fortran Rosetta Stone Ported from Fortran 90! Tutorials	13	1455	August 6, 2023

fortranDF: A simple data frame library for Fortran

Related topics