Container for array with missing data

Beliavsky · August 29, 2024, 2:53pm

Often some of the data in an array is missing. For real data you can represent the missing data as NaN, although missing and NaN are distinct concepts. For arrays of other types NaN is not available. Would it be possible to create a container representing a multidimensional array of any type along with a logical array of the same shape signifying where data is present? Functions such as sum, product, minloc etc. would be defined appropriately. Maybe this could become part of stdlib?

There are derived types to store sparse matrices, but I am interested in the case where only a small fraction of the data is missing, so that storing the full array of data and a parallel logical array is convenient.

In a Python pandas dataframe, a column of floats can have NaNs to represent missing data, but for columns of integers the user must designate and keep track of a sentinel value representing missing data. It would be nice if Fortran had a mature dataframe derived type. A dataframe with missing data could be represented as a collection of arrays with missing data if a container for such arrays existed.

AniruddhaDas · August 29, 2024, 3:01pm

Will it make sense to expand the use of .nil. to include missing values in Fortran? It can simply represent that the value is missing. I think it’s current usage is only the ternary operator.

cmaapic · August 29, 2024, 3:02pm

When working with the met office historic data (Historic station data - Met Office) we treated the integer data as real and used nans throughout. The other option is to use flag values. This is what statistical pacakges we’ve used do. choose a value (-99 or -999) that can’t occur in the data and program round it.

cmaapic · August 29, 2024, 3:04pm

As a follow up support for the SQL null would be nice, but I don’t know many langauges that offer support for null, other than standard SQL.

Machalot · August 29, 2024, 4:36pm

Maybe your “data missing” array could be sparse.

Isn’t this already supported by the optional mask input of these intrinsic functions?

davidpfister · August 29, 2024, 8:32pm

DBNull is a convenient value to signify the absence of data. The whole .NET family of langages support it. Alternatively you can also use nullable values. This means that all types, including intrinsic types can be set to null/Nothing. You can mimic nullables in Fortran using allocatable scalars but you may sacrifice performance.

davidpfister · August 29, 2024, 8:40pm

An idea could be to create a sort of data frame type with a sqlite3 backend. Since database can be created in memory there should not be any data access issues. And standard functions like sum and others also exists im SQLite,

wspector · August 30, 2024, 4:50pm

Indeed, this technique has been used for ages. Associate a logical array with your data array. Then set the array elements to indicate which of the data values are valid or invalid. The Fortran WHERE statement and construct can be used to only compute valid results. And as mentioned, a number of intrinsics accept a mask argument for the same reason.

You’d probably want to set the logical kind value to use one-byte logicals to save space. This would also be an ideal use of a bit data type - if one were available in Fortran. (A few Fortran compilers in the distant past have had a bit data type for this very purpose. Notably the CDC compilers for the STAR-100/CYBER-203/CYBER-205 of the 1970s and early 1980s. Those machines supported bit vectors as optional “control vectors” for many of their hardware vector operations. In fact, this is one of the places where the WHERE construct came from.)

MarDie · August 30, 2024, 8:48pm

NumPy has something like that Masked arrays — NumPy v2.1 Manual

Topic		Replies	Views
Sparse arrays (not matrix) storage and access Help	6	578	October 1, 2021
Improving Fortran standardization process (lessons from C++23 getting multidimensional arrays)	42	3977	September 15, 2022
Array features not accepted in the Fortran standard	9	812	July 20, 2021
Implementing subsets of a collection and cartesian product	6	415	November 1, 2023
Complex vs real arrays for representing vectors	1	378	July 20, 2022

Container for array with missing data

Related topics