Fortran libraries to work with parquet files

Does anybody know of any Fortran interface to the Apache arrow (Apache Arrow — Apache Arrow v11.0.0) project? I am interested in any Fortran tool that allows reading/writing parquet files in particular, if any exists.

1 Like

In this technical report from NASA on cloud optimised formats parquet is rated as a good complement to cloud optimised geotiff for handling Earth observation data in the cloud. On page 14 they say that of the 5 languages they consider (Python, R, C, C++, Fortran) Fortran is the only one that doesn’t have a parquet library but the C library can be used with C interoperability.

While I’d like to look into it and see if writing the interfaces is within my reach I don’t have the right knowledge to get started. Would the group here have any advice on how to get started with something like this?

I gather apache arrow is c++ and the c bindings depend on glib (gnome?). An option is fortran bindings to the duckdb c api that has parquet read/write. Extending the existing fortran bindings for the parquet sqlite extension is another option I guess.

Thanks for you reply @freevryheid . Indeed arrow is a C++ project and you are right they have C libraries through Glib that provide access to the project from various languages. Additionally arrow does a lot more than just handling parquet so maybe it’s not the best choice for a dependency?

With the additional digging I have done in the last few days I have found this documentation for the C (Glib) API: Apache Arrow GLib (C) | Apache Arrow which also addresses the parquet specifically. My plan based on what I’ve gathered so far was to explore if that can be used directly from Fortran. Up to now I have managed to install arrow with one of the distributed builds and take an example from their repository for a C++ parquet roundtrip: https://github.com/apache/arrow/blob/main/cpp/examples/parquet/parquet_arrow/reader_writer.cc, link that against the library and run it. That was relatively easy.
Next I wanted to try to read/write that same parquet in C with the Glib (C) and then try from Fortran. Maybe along the way I’ll find some blocker.

I should also look into the duckdb and sqlite options you mentioned thanks. I am not familiar with either project. I see duckdb is also a C++ project, but I guess will expose a C API that allows binding only the parquet read/write functionality.

Extending the existing fortran bindings for the parquet sqlite extension is another option I guess.

I didn’t easily find a reference to these. Could you point me in the right direction?

I’ve included links in my original post.

Since we already have sqlite bindings for fortran you could read parquet files through the extension. To write I’m guessing you’ll need to export/convert from csv.

1 Like

Started working on duckdb fortran bindings - just the basics to run SQL queries, which can be used to read/write parquet files. I’ll flesh this out as I find time but we could work together on this if you’d like.

awesome. yes I’d definitively like to contribute. Will take a look at the repository in a bit.

If ever you are ready for that adventure, you may look at:

I have made a test with the header files in the directories /usr/include/arrow-glib and /usr/include/parquet-glib and, with some minimal tweaking (adding some missing GLib types in the wrapper and modifying a regex), obtained 1019 Fortran interfaces like:

!GParquetArrowFileWriter * gparquet_arrow_file_writer_new_path(GArrowSchema *schema, const gchar *path, GParquetWriterProperties *writer_properties, GError **error);
function gparquet_arrow_file_writer_new_path(schema, path, writer_properties,&
& error) bind(c)
  import :: c_ptr, c_char
  implicit none
  type(c_ptr) :: gparquet_arrow_file_writer_new_path
  type(c_ptr), value :: schema
  character(kind=c_char), dimension(*) :: path
  type(c_ptr), value :: writer_properties
  type(c_ptr), value :: error
end function

If it is those interfaces (arrow-glib and parquet-glib) that you need, I can send you the interfaces for testing.

Sure I’d like to test your auto generated interfaces. I’m have a couple of basic tests that I could run and can add more if it looks promising

I have sent the two files in a private message. Let me know if you succeed compiling them and use the functions you need.

1 Like

@vmagnin I’m testing the automatic interfaces here: GitHub - lucin81/arrow-fortran and I’ve shared the development environment I am using with vscocode/docker/modern fortran extension/fpm which should make it very easy for anyone using docker and vscode to fully reproduce this.

I think all that is amazing and in my mind it was unthinkable until a few years ago for a Fortran project. So very cool to use fpm and the modern fortran extension for vscode!

This is all basic getting started. I’ve been following along the arrow C++ getting started and implementing the same with the fortran bindings created by your cfwrapper script. The idea is to add data into an arrow table and then do I/O operations in the next chapter. If I manage to get it to work properly will look into organising the interfaces to make it more user friendly.

I’ve managed to work through the first functions and I’m now seeing some issue that maybe someone here knows about:

This function that creates a schema from an array of fields: garrow_schema_new defined in the API reference got translated by cfwrapper with the following:

!GArrowSchema *garrow_schema_new (GList *fields);
function garrow_schema_new(fields) bind(c)
  import :: c_ptr
  implicit none
  type(c_ptr) :: garrow_schema_new
  type(c_ptr), value :: fields
end function

Here the fields variable is a Glist array in C and the cfwrapper makes it into a type(c_ptr), value :: fields. I’m trying to us it as:

  type(c_ptr) :: field_days, field_months, field_years
  type(c_ptr) :: fields_array(3)
  type(c_ptr) :: schema 

  field_days = garrow_field_new('days'//c_null_char, garrow_int8_data_type_new())
  field_months = garrow_field_new('months'//c_null_char, garrow_int8_data_type_new())
  field_years = garrow_field_new('years'//c_null_char, garrow_int16_data_type_new())
  fields_array = [field_days, field_months, field_years]
  schema = garrow_schema_new(fields_array)

But that gives a memory error:

invalid uninstantiatable type 'void' in cast to 'GArrowField'
(process:77109): GLib-GObject-WARNING **: 16:52:26.625: invalid uninstantiatable type 'void' in cast to 'GArrowField'

(process:77109): GLib-GObject-WARNING **: 16:52:26.625: invalid uninstantiatable type 'void' in cast to 'GArrowField'

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0xffffa4f1d08b in ???
#1  0xffffa4f1c047 in ???
#2  0xffffa520c78f in ???
#3  0xffffa5170ac0 in ???
#4  0xffffa5110f27 in ???
#5  0xffffa513131b in ???
#6  0xaaaae3180e13 in MAIN__
        at app/main.f90:75
#7  0xaaaae3180e53 in main
        at app/main.f90:5
Segmentation fault
<ERROR> Execution failed for object " parquet-glib "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

Any idea how to deal with glibc GList object from Fortran?

1 Like

Writing the Fortran interfaces are half the work (or less than half the work)… When you use basic C types, it’s easy. But with derived types and elaborated structures, things become harder.

On the C side I have that PDF on GTK development which describes GLists:
http://devernay.free.fr/cours/IHM/GTK/GGAD.pdf
It’s not fresh (1999) but I guess that that part of the GLib must be stable.

I have also GLib/GLib Data Types — Wikiversité but sorry it’s in French.

Personally, I have never used GLists but J. Tappin (the author of the High Level library) used some g_list or g_slist functions in those files of gtk-fortran:
src/gdk-pixbuf-hl.f90
src/gtk-hl-button.f90
src/gtk-hl-chooser.f90
src/gtk-hl-tree.f90 (maybe the more interesting)
Maybe we could find some interesting code there…

I will try to have a look at your GitHub code.

Remark: do not confuse GLib and glibc. GLib is a library used by GNOME applications and GTK.

And if you need GList functions, you will need to copy in your src directory the module src/glib-auto.f90 from gtk-fortran.

You may also be interested by the GLib documentation:

struct GList {
  gpointer data;
  GList* next;
  GList* prev;
}

So I guess a GList would be in Fortran something like:

  type, bind(c) :: GList
     type(c_ptr) :: data
     type(c_ptr) :: next
     type(c_ptr) :: prev
  end type GList

Thanks for mentioning. That was confusing to me.

Indeed I was wondering if that is what’s needed here. Why does the cfwrapper script converts the GList argument to type(c_ptr)? Is it possible to just ignore any details of the glib data types and let C deal with it or do we need to have all the glib types available in fortran and therefore I guess modify the auto-generated bindings to use the glib types?

From a quick look at how some of the functions operating on GList are used in some of the gtk-fortran sources you mentioned they seem to operate on type(c_ptr) objects. I don’t see a fortran derived type that represents a GList. But maybe I need to take a closer look, thanks for pointing at those.

Yes, you are right, most of the time you don’t have to implement the GLib type. You call a GLib function that creates an “object” with that type and returns a C pointer to that object, and you pass that pointer to another function, etc.

Looking at the GList API in the doc, I see no g_list_new() function as I expected. But it seems you could use g_list_alloc() to create an empty GList (and obtain a pointer to that list). And then g_list_append() to append a data to that list, and so on.

Then you could pass the pointer to garrow_schema_new()

1 Like

To be more precise, I think that if you want to respect the GPL license and don’t mean to distribute your work under the same license, what you can do is just picking in glib-auto.f90 the interfaces you need for your project.

We can consider that a Fortran interface is a generic code (no creativity) and can therefore not be protected by a license: there is nearly only one way to write an interface to a C function (no choice on the data types, and you will generally use the same names).

But a file containing 4000 interfaces can be licensed as it implies either creativity (an automatic wrapper) or a considerable amount of work. That same idea is expressed in the databases protection laws: “You or the maker of the database can prevent the extraction and/or reuse of the whole or a substantial part of the database’s content”. The important word being substantial.

I’m actually very happy to share this with the appropriate license. I’ve add the GPL license file, hope that works.

1 Like

As you want!
Have you made some progress with the GLists?

I did try it, but unfortunately it does not solve the issue as there is still a segmentation fault when calling the arrow_schema_new function.

type(c_ptr) :: fields_list
type(c_ptr) :: schema

fields_list = g_list_alloc()
fields_list = g_list_append(fields_list, field_days)
fields_list = g_list_append(fields_list, field_months)
fields_list = g_list_append(fields_list, field_years)
schema = garrow_schema_new(fields_list)
output
root@9ebf5bb08910:/workspaces/parquet-glib# fpm run 
main.f90                               done.
parquet-glib                           done.
[100%] Project compiled successfully.
 days: int8
 months: int8
 years: int16

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0xffff83a3d08b in ???
#1  0xffff83a3c047 in ???
#2  0xffff83e7378f in ???
#3  0xffff83d80ee0 in ???
#4  0xffff83da131b in ???
#5  0xaaaad75e13cf in MAIN__
        at app/main.f90:94
#6  0xaaaad75e1413 in main
        at app/main.f90:5
Segmentation fault
<ERROR> Execution failed for object " parquet-glib "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

Edit: added valgrind output

valgrind output
root@9ebf5bb08910:/workspaces/parquet-glib# fpm run --runner 'valgrind --track-origins=yes'
parquet-glib                           done.
[100%] Project compiled successfully.
==12293== Memcheck, a memory error detector
==12293== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12293== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==12293== Command: build/gfortran_2A42023B310FA28D/app/parquet-glib
==12293== 
==12293== Conditional jump or move depends on uninitialised value(s)
==12293==    at 0x50D8178: arrow::ArrayData::Make(std::shared_ptr<arrow::DataType>, long, std::vector<std::shared_ptr<arrow::Buffer>, std::allocator<std::shared_ptr<arrow::Buffer> > >, long, long) (in /usr/lib/aarch64-linux-gnu/libarrow.so.1100.0.0)
==12293==    by 0x51AB4D3: ??? (in /usr/lib/aarch64-linux-gnu/libarrow.so.1100.0.0)
==12293==    by 0x50C0F27: arrow::ArrayBuilder::Finish(std::shared_ptr<arrow::Array>*) (in /usr/lib/aarch64-linux-gnu/libarrow.so.1100.0.0)
==12293==    by 0x48DC0D3: garrow_array_builder_finish (in /usr/lib/aarch64-linux-gnu/libarrow-glib.so.1100.0.0)
==12293==    by 0x1090AB: MAIN__ (main.f90:39)
==12293==    by 0x109413: main (main.f90:5)
==12293==  Uninitialised value was created by a stack allocation
==12293==    at 0x109014: MAIN__ (main.f90:1)
==12293== 
 days: int8
 months: int8
 years: int16
==12293== Invalid read of size 8
==12293==    at 0x4910EE0: garrow_field_get_raw(_GArrowField*) (in /usr/lib/aarch64-linux-gnu/libarrow-glib.so.1100.0.0)
==12293==    by 0x493131B: garrow_schema_new (in /usr/lib/aarch64-linux-gnu/libarrow-glib.so.1100.0.0)
==12293==    by 0x1093CF: MAIN__ (main.f90:94)
==12293==    by 0x109413: main (main.f90:5)
==12293==  Address 0xffffffffffffffe0 is not stack'd, malloc'd or (recently) free'd
==12293== 

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x4b3d08b in ???
#1  0x4b3c047 in ???
#2  0x580b917f in ???
#3  0x4910ee0 in ???
#4  0x493131b in ???
#5  0x1093cf in MAIN__
        at app/main.f90:94
#6  0x109413 in main
        at app/main.f90:5
==12293== 
==12293== Process terminating with default action of signal 11 (SIGSEGV)
==12293==    at 0x4D2F200: __pthread_kill_implementation (pthread_kill.c:44)
==12293==    by 0x4CEA67B: raise (raise.c:26)
==12293==    by 0x580B917F: ??? (in /usr/libexec/valgrind/memcheck-arm64-linux)
==12293== 
==12293== HEAP SUMMARY:
==12293==     in use at exit: 335,520 bytes in 2,330 blocks
==12293==   total heap usage: 4,003 allocs, 1,673 frees, 563,066 bytes allocated
==12293== 
==12293== LEAK SUMMARY:
==12293==    definitely lost: 0 bytes in 0 blocks
==12293==    indirectly lost: 0 bytes in 0 blocks
==12293==      possibly lost: 432 bytes in 1 blocks
==12293==    still reachable: 330,816 bytes in 2,286 blocks
==12293==         suppressed: 0 bytes in 0 blocks
==12293== Rerun with --leak-check=full to see details of leaked memory
==12293== 
==12293== For lists of detected and suppressed errors, rerun with: -s
==12293== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault
<ERROR> Execution failed for object " parquet-glib "
<ERROR>*cmd_run*:stopping due to failed executions
STOP 1

Same problem here in a Fedora Rawhide virtual machine.

But replacing this line:

  fields_list = g_list_alloc()

by:

  fields_list = c_null_ptr

It runs. I have just mimicked C code I found in a GLib example.

So it does not crash. Does it really do what you want? I don’t know, but there is hope.