GSoC: improve nonlinear least squares in R

On GitHub I see the project

GSoC21-improveNLS

Efforts to improve the functioning of the R nls() function for nonlinear least squares estimation. This is part of the Google Summer of Code program for 2021.

The guts of the package is old FORTRAN and C code, and looking at the project write-up, it appears that neither the GSoC student or the original programmer completely understands how it works. Quoting the report RefactoringNLS:

As the nls() man page states, when the “port” algorithm is used with the trace argument TRUE, the
iterations display the objective function value which is 1/2 the sum of squares (or deviance). It is likely that the trace display is embedded in the Fortran of the nlminb routine that is called to execute the “port” algorithm, but the discrepancy is nonetheless unfortunate for users.

Issue: code structure
The nls() code is structured in a way that inhibits both maintenance and improvement. In particular, the iterative setup is such that introduction of Marquardt stabilization is not easily available.
To obtain performance, a lot of the code is in C with consequent calls and returns that complicate the code. Over time, R has become much more efficient on modern computers, and the need to use compiled C and Fortran is less critical. Moreover, the burden for maintenance could be much reduced by moving code entirely to R.

Issue: code documentation for maintenance
setPars() – explain weaknesses. Only used by profile.nls()
The paucity of documentation is exacerbated by the mixed R/C/Fortran code base.
Following is an email to John Nash from Doug Bates. This is NOT a criticism of Prof. Bates work, but reflection on how difficult it is to develop code in this subject area and to keep it maintainable. We have experienced similar loss of understanding for some of our own codes.

As stdlib matures maybe its code can be used not only directly in Fortran but in packages for Python and R.

Descriptions of GSoC R projects are here.

2 Likes

R’s many basic stats functions are written in F77 and seem difficult to read. Hopefully, someone will modernize them otherwise they probably would be replaced by C codes by the core members.