Scientists are using artificial intelligence and large language models to rewrite old code in modern languages

ivanpribec · April 23, 2025, 8:16pm

If anyone wants to dig into the details of what the LANL researchers did:

Beliavsky · April 24, 2025, 2:52pm

Whatever results the second paper presents for automatic Fortran to C++ translation underestimate what can be achieved, since the authors do not take the obvious step of using LLMs to fix code that does not compile. I added the bolding of text below.

Compilation accuracy of the translated C++ measures how many translations successfully compile without errors (Wen et al., 2022b). We compiled each translated C++ using the g++ v5.3.0 compiler on Red Hat Enterprise Linux Workstation release 7.9. If a C++ translation failed to compile, we recorded the compiler output and did not proceed further with that translation (Figure 1). We reviewed the compiler output and categorized each
error as shown in Table 2.

I have a C++ agent to fix C++ compilation errors, and I’m sure there are much more powerful tools for this.

rouson · April 25, 2025, 12:12am

I see several issues with translating Fortran to C++. The most fundamental is that Fortran has concepts and capabilities that C++ lacks (of course, the reverse is true too). A significant example is single-program, multiple-data (SPMD) parallelism with a partitioned global address space (PGAS) that both work in distributed memory. The closest analogous C++ concept might be multithreaded programming, but that only works in shared memory and one would need to fork all threads at the beginning of execution, handle several setup tasks (e.g., establishing non-allocatable coarrays), not join the threads until the end of execution, and prevent the spawning of additional threads by individual loops – and that’s just a small sampling of the issues that would need to be addressed. One could translate the SPMD and PGAS features to one-sided MPI, but that’s going to be challenging to get right, less readable, and is likely to be hurt performance.

Then there’s a lot of information loss involved. Think about the long list of constraints that apply to pure procedures. Unless there’s a similar C++ concept, the reader of the translated code will have to read through each translated pure procedure to rediscover all the information that the single keyword pure provides in one fell swoop. Such rediscovery becomes especially important if the procedure gets called inside a parallel loop when translated to C++. By contrast, every procedure called inside Fortran’s do concurrent construct must be pure according to the Fortran standard.

Moreover, even though C++ now has multidimensional arrays, C++ still lacks array statements as far as I know. So are all array statements being converted to nested loops? If so, there again is a loss of information unless those are C++ parallel_for loops in order to retain the information that there is no implied ordering of iterations. Even then, it’s likely to lead to code bloat wherein what was one line in Fortran could become many more lines in C++.

And the languages are only continuing to diverge. Fortran 2028 templates will be type-safe – something that is not easily expressible in C++ because C++ doesn’t allow for specifying template requirements (relationships between types, procedures, and combinations thereof) so the loss of information could also lead to a loss in type safety if Fortran programmers take full advantage of the upcoming template feature.

I’ve only scratched the surface above. How about the fact that C++ allows overloading operators but does not facilitate user definitions of new operators. A common response is that user-defined operators are syntactic sugar, but such statements ignore the additional semantic constraints involved such as the requirement that the operands have the intent(in) property. That’s information the reader immediately knows when seeing the use of a user-defined operator in Fortran, whereas one would have to inspect the signature of every C++ function that replaces a Fortran operator to discover this same information about the arguments. And then there’s the argument that syntactic sugar can be exceptionally powerful in its communicative value.

There’s so much more that can be said about such topics as the differences between Fortran pointers and C++ pointers, e.g., target communicates important information to both the reader and the compiler. How will this information be communicated to the C++ compiler or developer?

Bottom line: the two languages are equivalent only in a superficial way that ignores a lot and accepts a considerable about of information loss, extreme restrictions, code bloat, and potential loss of safety and performance.

R_cubed · April 25, 2025, 2:43pm

I struggle to comprehend how LLMs, trained on background that is likely not relevant to the context at hand, are superior to using inductive logic programming for specification recovery.

This was an ancient line of CS research when logic and rigor were considered important (they don’t seem to be very important today, given the claims people with tech sophistication accept).

Clearly, specification recovery is less precise in the sense of being able to verify a program meets its spec, than a program derived from a spec from first principles, as it is only an approximation, based on observable behavior of the program based on finite input. But I can’t see how anyone can put more confidence (in the sense that frequentist statisticians use it) in any program transform derived from an LLM.

themos · April 25, 2025, 4:56pm

I would say that they are diametrically opposed. Fortran was always about computation, C was always about hardware control. In Fortran, the states that the machine goes through between I/O are unobservable. In C, they are of supreme importance because observable hardware is being changed.

Beliavsky · April 26, 2025, 7:37pm

On a cheerier note, the Lawrence Berkeley National Lab

recognized Computing Sciences’ Damian Rouson @rouson as Developer of the Year. Damian, who is in the Applied Mathematics and Computational Research Division, led the development of new software tools for testing and correctness checking and a library that supports Fortran/C/C++ language interoperability.

Some relevant projects are

cmaapic · April 27, 2025, 11:27am

Following on from what Themos said, I find the following publications worth a read for some of the history and development of C and C++.

ACM SIG PLAN Notices

Volume 28 Number 3 March 1993

History of Programming Langauges Conference
HOPL II

C Dennis M. Ritchie
The Development of the C Language
201-208

C++ Bjarne Stroustrup
A History of C++: 1979-1991

Bjarne Stroustrup
The Design and Evolution of C++
Addislan Wesley
ISBN 0201543303
March 2007

Ian Chivers

ivanpribec · April 27, 2025, 11:01pm

I found this working paper recently, which provides an analysis of the shortcomings of C++ array libraries compared to Fortran:

Abstract:

As a language for scientific computing, C++ is at a disadvantage compared to many other languages due to its lack of a well-designed standard for multi-dimensional arrays supporting efficient whole-array expressions, expressive array-subsetting syntax and linear algebra. To support the development of such a standard, in this paper I review the interface, capabilities and weaknesses of a number of free C++ array libraries (Adept, Armadillo, Blaze, Blitz++, Eigen, MTL4, ra-ra, uBLAS and Xtensor) as well as other languages supporting multi-dimensional arrays (particularly Fortran, Python, Matlab, IDL and Julia). These are contrasted with the verbose and limited whole-array capabilities in the C++20 Standard Template Library. To help ensure the standard meets the needs of large-scale scientific applications, I also present an analysis of array use in an Earth-system model for operational weather forecasting (2.2 million lines of code). I argue that an unlimited number of dimensions should be supported, not the limit of two imposed by many libraries focusing on linear algebra, and propose a solution to the lack of a matrix-multiplication operator in C++. A detailed investigation is presented of the problem that most C++ libraries cannot simply and efficiently pass a subset of an array to a function, and a solution is proposed. A total of 25 specific recommendations are made that will hopefully contribute to a discussion leading to the formulation of a standard.

Beliavsky · May 11, 2025, 11:24am

LLMs can help people understand large code bases and change them. There is a project Search-Engine-Integrated Multi-Expert Inference (SEIMEI) that claims to “optimize reasoning steps (with agents) and achieve SOTA results on tasks requiring deep reasoning”. The documentation on p10 briefly describes the use of an AI Agent on Nuclear Fusion Simulation, which can for example convert the coordinates of a simulation and explain what happens when running the GyroKinetic Vlasov simulation code.

LLMs can do a lot, but they are also hyped. I have not used SEIMEI. Maybe the author would be willing to apply the tool to other large codes. The code is provided, and it can be run on a local gpu or rental server gpu.

runborg · May 13, 2025, 8:25pm

I think there are a few things that would dramatically lower the barrier to entry for new scientific programmers. In no particular order:

Interactive fortran environments (a la Jupyter) to quickly start workshopping code. This is really nice because it can dramatically simplify the workflow and reduces the amount of knowledge needed to write and execute code. It could also reduce the burden of installing various software tools to build a project and makes the iteration of develop ↔ test more visually intuitive.
Better IO. This is a place where Fortran is really lacking in my opinion. I understand that there is a lot of history here, but reading a data file into an array takes a lot of lines of code and error messages are confusing or unhelpful. New scientists often have to load data and then manipulate it in some way and fortran doesn’t have great utilities for either. Utilities like pandas.read_csv or numpy.loadtxt or h5py or xarray are all extremely useful and move the burden from loading data to analyzing it.
Expansion of stdlib array operations. I know that this is in progress and is limited by people power. Python (and other languages) have so many utilities for altering and manipulating data. Common operations like rolling averages/convolutions, histogramming, ffts, curve fitting, interpolation, etc. would benefit the community and make Fortran more attractive to newer programmers.

@certik The Python Fortran Rosetta Stone is very helpful! I have looked at it many times especially when I was first learning. I have had aspirations to help improve it but keep getting dragged in to other projects.

ivanpribec · May 13, 2025, 8:54pm

Text format

  use stdlib_io, only: loadtxt
  implicit none
  real, allocatable :: x(:, :)
  call loadtxt('example.csv', x, delimiter=',')

NumPy binary array format

  use stdlib_io_npy, only: load_npy
  implicit none
  real, allocatable :: x(:, :)
  call load_npy('example.npy', x)

stdlib_io: io – Fortran-lang/stdlib

use h5fortran
call h5write('my.h5', '/x', x)
call h5read('my.h5', '/y', y)

h5fortran: GitHub - geospace-code/h5fortran: Lightweight HDF5 polymorphic Fortran: h5write() h5read()

Compatible with netCDF as far as I can understand. A good tutorial can be found here: NetCDF | Programming in Modern Fortran.

I copied the example shown there into a file, opened up a shell on my MacBook and ran the commands:

> brew install netcdf-fortran
...
> export NETCDF_ROOT=`brew --prefix netcdf-fortran`
> gfortran -I$NETCDF_ROOT/include -L$NETCDF_ROOT/lib example.f90 -lnetcdff
> ./a.out
Data to be written to NetCDF file:
----------------------------------------------------------------
   0   1   2   3   4   5   6   7   8   9  10  11
   6   7   8   9  10  11  12  13  14  15  16  17
  12  13  14  15  16  17  18  19  20  21  22  23
  18  19  20  21  22  23  24  25  26  27  28  29
  24  25  26  27  28  29  30  31  32  33  34  35
  30  31  32  33  34  35  36  37  38  39  40  41
----------------------------------------------------------------

Data read from NetCDF file:
----------------------------------------------------------------
   0   1   2   3   4   5   6   7   8   9  10  11
   6   7   8   9  10  11  12  13  14  15  16  17
  12  13  14  15  16  17  18  19  20  21  22  23
  18  19  20  21  22  23  24  25  26  27  28  29
  24  25  26  27  28  29  30  31  32  33  34  35
  30  31  32  33  34  35  36  37  38  39  40  41
----------------------------------------------------------------

So I think the basic needs are there.

IMO, Fortran does pretty well on structured and binary data. I see bigger issues with text formats like JSON, XML, TOML, YAML. There are Fortran libraries for each those, but the interfaces are not very consistent and usage tends to be very verbose. Most of them are volunteer projects maintained by a single person.

What Fortran projects absolutely lack is good documentation.

fxm · May 13, 2025, 8:59pm

Interactive fortran environments (a la Jupyter)

Such as LFortran + Jupyter-Lab?

… but there’s some way to go to make it suitable for classrooms and plotting in this environment is key.

stdlib already includes some procedures like loadtext. Ongoing development of stdlib on the io end will help cross more io-related barriers. NetCDF and HDF are fairly easy to handle with Fortran and fairly well documented. The main barrier I can see here is for people to get together all their tools with minimal hassle.

Agree. Expanding the stats modules would be a priority for the sort of things I teach.

So … we’re not quite there yet, but well on the way I think.

ivanpribec · May 13, 2025, 9:01pm

Maybe worthwhile to write a proposal for one of these? NumPy, xarray and related formats have strong ties with the NumFOCUS world, so I think it would be mutually beneficial.

Note that Fortran-Lang is member of NumFOCUS: Fortran-lang - NumFOCUS

urbanjost · May 14, 2025, 1:23am

Documentation that can be part of the code is supported in many languages but Fortran lacks even a block text capability. I use preprocessors to make up for that lack.
One useful mode is to allow code to be contained in markdown just like it is in discourse. ```fortran starts the code section. This allows placing documentation, links to external resources, and C code along with the Fortran code. The file is both the source and a github-compatible document.

The other mode is to allow for a free-format block of text in the input that the preprocessor can turn into comments and/or write to a file for further post-processing.

I used to put the code into html documents in an <XMP> </XMP> section but technically that is deprecated.

But I think Fortran desperately needs some method of block text that adds similiar functionality.

Even using cpp #if to skip over text is better than nothing but if a standard preprocessor is ever agreed upon it could be used to add similiar capability.

I find it far more maintainable if the documentation is right in the code.

To leverage the code itself as part of the available information Doxygen and Ford provide another approach.

fpm(1) would benefit dramatically from a standard form of documentation; particularly if the fpm command itself could locate and display it.

I think the most natural approach is if fpm could compile .md files instead of just .f/.f90/.c files myself; with a tool that could display and search the files in a CLI environment or convert it to HTML (which is what the original markdown perl script did) would be an appealing approach that could be done now without waiting for a change to the standard.

Interactive Fortran might make IMPLICIT statements be in vogue again. LFortran will take that to a new level but a small interpreter that allowed use of stdlib functions and included help text would be useful. There used to be several F77 Fortran interpreters but I cannot find any now.

M_matrix is not quite suitable, but it can be used to explore the concept. You can call it stand-alone or as a procedure from your code and it provides for a minimal embedded-language environment. It lets you explore two fpm packages – M_sets and M_orderpack, including built-in help for the procedures.

Perhaps a similar program that supported minimal Fortran interpretation and provided searchable help for the stdlib procedures and included stdlib procedures as built-in functions would help promote stdlib usage. gnuplot support similar to what @Beliavsky added to his expression parser would be a nice bonus.

M_matrix shows a model for what an interpreter/demonstrator/documentation tool might look like;
prep is a preprocessor that can read from a markdown file or convert text blocks to Fortran code or comments or extract into a file, and fman shows how a terminal-based markdown viewer might look like, although only a small subset of Discourse markdown might be possible (images, formulas, multimedia … would be hard to display just using a terminal emulator) …

PS:
single-file versions of lala, prep, and fman are at
mars/bootstrap at main · lockstockandbarrel/mars · GitHub; which would be an easier way for non-fpm users in particular to use to explore tools similiar to the proposed tool.

RonShepard · May 14, 2025, 2:09am

That is one of the ways that I solve this problem too.

#if 0
...arbitrary lines of text...
#endif

However, a downside of this approach is that modifying that text block can trigger a sequence of unnecessary recompilations, or, if the file is a low-level file, say in a library, it can trigger several unnecessary entire program rebuilds.

certik · May 14, 2025, 2:59am

Awesome, thank you, I am glad it was useful. It was useful even for me to realize that anything numerical you can do in Python/NumPy, you can do in Fortran, just as easily.

Yes, we’ll add plotting.

ivanpribec · May 19, 2025, 7:56am

Requirements for modern software: HPC software relies on the broader software landscape (e.g., programming languages, vendor tools, runtimes). Hence, the HPC stack must evolve accordingly and alongside the broader software ecosystem to meet the post-Moore era’s requirements—including trustworthiness (e.g., memory-safety, robustness) [20], reproducibility, maintainability, and energy-efficiency—and to align with national interests. Not meeting these requirements puts the field at risk in mission-critical scenarios, and the clearest example of vulnerability is the overwhelming reliance on Fortran [37,25] in legacy HPC codes.

fxm · May 19, 2025, 8:41am

headscratch
I’m confused. They list criteria ~easily met by Fortran (even old Fortran) and then go on to say reliance on Fortran is problem. Or are they specifically talking about the problem of a lack of expertise in f77? Regardless, this seems poorly phrased at best.

jorgeg · May 20, 2025, 3:04am

it is just that the people that wrote this article have a misconception of how Fortran works, both legacy and modern one.

PierU · May 20, 2025, 9:26am

As we say in french, “When you want to kill your dog, you accuse it of having rabies”. In the present case it’s also: “When you want to shine with your brand new rifle, you look for a dog to kill”.

Topic		Replies	Views
Fortran and Neural Networks	17	7063	November 14, 2021
"A Perspective on Sustainable Computational Chemistry Software Development and Integration": doutbtful comments about Fortran Advocacy	15	1003	March 7, 2024
AI Fortran Code Help	5	2599	April 1, 2024
Parallel Programming with Coarrays in Fortran (blog post)	5	762	April 1, 2024
A Modern Fortran Scientific Programming Ecosystem Announcements	8	1866	October 17, 2022

Related topics