Container for array with missing data

Often some of the data in an array is missing. For real data you can represent the missing data as NaN, although missing and NaN are distinct concepts. For arrays of other types NaN is not available. Would it be possible to create a container representing a multidimensional array of any type along with a logical array of the same shape signifying where data is present? Functions such as sum, product, minloc etc. would be defined appropriately. Maybe this could become part of stdlib?

There are derived types to store sparse matrices, but I am interested in the case where only a small fraction of the data is missing, so that storing the full array of data and a parallel logical array is convenient.

In a Python pandas dataframe, a column of floats can have NaNs to represent missing data, but for columns of integers the user must designate and keep track of a sentinel value representing missing data. It would be nice if Fortran had a mature dataframe derived type. A dataframe with missing data could be represented as a collection of arrays with missing data if a container for such arrays existed.

1 Like

Will it make sense to expand the use of .nil. to include missing values in Fortran? It can simply represent that the value is missing. I think it’s current usage is only the ternary operator.

When working with the met office historic data (Historic station data - Met Office) we treated the integer data as real and used nans throughout. The other option is to use flag values. This is what statistical pacakges we’ve used do. choose a value (-99 or -999) that can’t occur in the data and program round it.

As a follow up support for the SQL null would be nice, but I don’t know many langauges that offer support for null, other than standard SQL.

Maybe your “data missing” array could be sparse.

Isn’t this already supported by the optional mask input of these intrinsic functions?

DBNull is a convenient value to signify the absence of data. The whole .NET family of langages support it. Alternatively you can also use nullable values. This means that all types, including intrinsic types can be set to null/Nothing. You can mimic nullables in Fortran using allocatable scalars but you may sacrifice performance.

An idea could be to create a sort of data frame type with a sqlite3 backend. Since database can be created in memory there should not be any data access issues. And standard functions like sum and others also exists im SQLite,

Indeed, this technique has been used for ages. Associate a logical array with your data array. Then set the array elements to indicate which of the data values are valid or invalid. The Fortran WHERE statement and construct can be used to only compute valid results. And as mentioned, a number of intrinsics accept a mask argument for the same reason.

You’d probably want to set the logical kind value to use one-byte logicals to save space. This would also be an ideal use of a bit data type - if one were available in Fortran. (A few Fortran compilers in the distant past have had a bit data type for this very purpose. Notably the CDC compilers for the STAR-100/CYBER-203/CYBER-205 of the 1970s and early 1980s. Those machines supported bit vectors as optional “control vectors” for many of their hardware vector operations. In fact, this is one of the places where the WHERE construct came from.)

NumPy has something like that Masked arrays — NumPy v2.1 Manual