FSML v0.1.0 (alpha) - initial release

Ahoy and Servus, Everyone.

I’m happy to announce the first release (v0.1.0-alpha) of FSML (Fortran Statistics and Machine Learning). Some notes below:

Summary

FSML is a toolkit for statistical and machine learning (ML) procedures, including basic statistics (e.g., correlation), hypothesis tests (e.g., Mann–Whitney U, ANOVA), linear parametric methods and models (e.g., multiple OLS regression, discriminant analysis), and non-linear statistical and ML procedures (e.g., k-means clustering). There are more procedures (esp. on the ML side) that I plan to rework and include, but integrating those requires a little more care and thought.

To make it more accessible to contributors, students, researchers, and other users, it is kept relatively simple (KISS), overengineering is avoided, requirements are minimal (stdlib for linalg, and fpm for building/distributing), and much of the API should look familiar to those used to popular packages for other languages (Python and R). Several of my students with no prior Fortran experience were able to use it already (with only the FSML docs pages and the fortran-lang.org tutorials). Ultimately, I hope LFortran’s interactivity will lower the barrier for adoption of the lib and ultimately of Fortran, because I see a lot of potential for Fortran in ML.

Some Background

After discovering the growing online Fortran community, stdlib, fpm, LFortran and other great projects, I had decided to rework parts of my personal statistics and ML Fortran research library (that grew organically over the past ~15 years), clean it up, and open it up to everyone. I replaced standalone LAPAPCK with stdlib, moved from Makefiles to fpm, added, improved and homogenised some things here and there, and also test it with LFortran (hence the couple of bug reports).

Code and Docs

The code can be found here, and its documentation (with Handbook and API doc pages) can be found on fsml.mutz.science. It’s MIT licenced, so it’s compatible with Fortran-lang projects, and the API documentation follows a similar style as Fortran-lang stdlib (so it would be less work to add some FSML code to new stdlib stats modules, for example).

JOSS Paper

The associated paper was just published in JOSS (Journal of Open Source Software). A “thank you” goes out to the editor (J. Atkinson) and reviewers (@ivanpribec and M. A. Kowalski) for their time and effort. Generally, having Fortran better represented in the publication landscape may help with its promotion, so why not. :slight_smile:

Blog Post

I also published a blog post to provide a little more context, notes on some design choices, and where I see it fit into the Fortran statistics/machine learning ecosystem. (I migrated my academic website to quarto and decided to include a blog - thanks for giving me the idea, @loiseaujc! I also took the liberty to cite you, @certik et al. about LFortran, and @jorgeg.)

21 Likes

Hej,

Congrats on the JOSS paper. I currently have one under review as well for LightKrylov and plan to wrap one up during vacations for Modern QuadProg.

I quickly looked at the code (although I’ve stared it there is quite some time). It looks pretty cool and very readable. I can definitely see how easy it can be for students to read the code and relate to the course material. That is definitely a big plus.

I do have some questions though. Take the ordinary least-squares estimator. The algorithm is very close to what the derivation might look like in a stats class, e.g.

  • Construct the covariance matrix X^\top X.
  • Compute the precision matrix \left( X^\top X\right)^{-1} (which you do in the code by first computing the eigendecomposition of X^\top X).
  • Get the coefficient of least-squares estimator as w = \left( X^\top X \right)^{-1} X^\top y.
  • Compute the variance-covariance matrix of the parameters as \Sigma = \sigma^2 \left( X^\top X \right)^{-1}.

This is all good, but if you’ve had a class on numerical linear algebra (or convex programming), you’d probably do everything with the QR factorization of X or its SVD decomposition. Sure enough, it diverts a bit from the stats class material, but it is closer to good practices in numerical linear algebra, while also leading to a more robust and potentially faster implementation.

Both approaches are equally good (one being closer to a stats class, the other being more robust/faster and closer to a numerical linear algebra class). As someone who teaches also, I am curious about what’s your take on this?

PS: If you want lasso as a linear model, I have a small implementation relying on ADMM available. I’d be happy to adapt it to your conventions and send a PR. Just like fsml, it relies only on stdlib_linalg.

6 Likes

Aww thanks for citing me!! That means a lot. I really enjoy this community and now you’ve inspired me to write my own JOSS paper

5 Likes

Thanks for the congrats and for engaging with the lib, @loiseaujc. :slight_smile:

Great point.

I am curious about what’s your take on this?

The honest answer is that this is something I annotated with mental question mark and am open to changing. When re-working the code, I decided to stick more to elements that my former students would have been more exposed to and comfortable with. Hence some code deviates a little from more standard implementations (as well as from my own former implementations in some places - the clustering is a good example of that). However, I do not know how merited this approach is, esp. since they would be exposed to linear algebra (albeit not as much as in a class on numerical linear algebra, for sure). After all, I could still use different implementations in a class outside this library, and I think it’s not such a big leap, as long as it’s clearly structured, commented, and referenced.

Edit: Since I plan to implement GLMs(and GAMs), another thought was to keep standalone OLS as it is and take a numerically more optimal approach to a separate implementation of GLMs.

This sounds great and LASSO was on my to-do list anyway, so a PR of an implementation that also just relies on stdlib_linalg is very welcome!! :slight_smile:

P.S. I noticed LightKrylov at JOSS already. I hope you get both wrapped up soon. I like the approach JOSS takes.

1 Like

Yes! We need more Fortran papers(!), and JOSS’s approach makes it relatively hassle-free. I see lots of larger, high quality Fortran projects that get little such visibility, while relatively small Python projects, for example, get it through publication.

3 Likes

Nice. A great effort on getting coverage for the effort as well. Your description is nearly a template for how to advertise such efforts. Your description or a link to it would be welcome on the Fortran Wiki as well.

3 Likes

Thanks, @urbanjost. I could modify it slightly and create an entry for it here on the Fortran Wiki if that’s what you mean?

In 2004 the prolific Alan Miller published A Collection of Mathematical and Statistical Routines in FORTRAN 90 in the Journal of Statistical Software, which is another outlet for papers on open-source statistics software. Other papers with Fortran in the title are

A Fortran 90 Program for the Generalized Order-Restricted Information Criterion

Rebecca M. Kuiper, Herbert Hoijtink

A Fortran 90 Program for Confirmatory Analysis of Variance

Rebecca M. Kuiper, Irene Klugkist, Herbert Hoijtink

BIEMS: A Fortran 90 Program for Calculating Bayes Factors for Inequality and Equality Constrained Models

Joris Mulder, Herbert Hoijtink, Christiaan de Leeuw

REGCMPNT – A Fortran Program for Regression Models with ARIMA Component Errors

William R. Bell

A Fortran 90 Program for Evaluation of Multivariate Normal and Multivariate t Integrals Over Convex Regions

Paul N. Somerville

FORTRAN 90 and SAS-IML Programs for Computation of Critical Values for Multiple Testing and Simultaneous Confidence Intervals

Paul N. Somerville, Frank Bretz

Modern Fortran: Style and Usage (book review)

Jan de Leeuw

Developing Statistical Software in FORTRAN 95 (book review)

Robert Gentleman

3 Likes

Thanks for the pointer. I know of the Journal of Statistical Software, but I hadn’t considered it here. I just had a closer look at the scope, organisation and information for authors. This looks really good.

Yes. you can put a link there to an external resource or to a new name. If you make a new name for a local article it will then appear with a ? in the name. Click on that and it opens up a new document for you to place your description in. The description you provided here looked like a description you could place there.

Congratulations on the package release and the JOSS article! It is absolutely a “labor of love”.

Since @Beliavsky has brought up other statistical software, it is worth mentioning that a number of the R stats base library procedures are in Fortran:

At one point I found a thread (it may have been a link or sub-page from the R Contributor site or the R Homepage; I couldn’t find it right now) about documenting the history of some of the FORTRAN codes and statistical algorithms used in R (or S?). As you may already know, the S statistical programming language from Bell Labs, was initially in FORTRAN (or RATFOR), until rewritten in C in 1988 (New S); more details in A Brief History of S (PDF, 151 KB).

2 Likes

Done. I’ve added a page here. I’ve modified the text slightly to make it less contextual/discourse specific. Also, I’m adding FortranWiki to my recommended resoucres (I admit I hadn’t looked at it this closely before); great project!

1 Like

Thanks, @ivanpribec.

Yes; it’s my understanding that quickr takes advantage of that to speed up parts of R code.

I’m browsing r-source now; seems to be GPL licenced, creating some barriers in using it for other projects.

My understanding was it is more like a transpiler (source-to-source translator). It could perhaps call R built-in functions (like the stats ones) directly by their C or Fortran name, but I guess that wouldn’t make a big runtime difference. It’s the switch from interpreted to compiled that brings the big boost in speed.

Yes, it has been a GPL project for a long time, from History and Overview of R,

In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software. This was critical because it allowed for the source code for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software later).

Some procedures in stats appear to be taken from netlib or journals (where other licenses may apply).

1 Like

Dataplot | NIST is public domain and contains a vast collection of statistical software in Fortran 77 last I looked. At least one version of DATAPAC is on github. Might be dated (or not). Used the interpreter heavily in the past but have not used it in recent years but it was accompanied by a very complete set of documentation as well as I remember it. Not sure if that is of interest or not, but it was all public domain.

3 Likes