At Codee, we recently shared two comparative analyses exploring how current AI coding assistants perform on Fortran code, compared to Codee’s compiler-based tools.
1 - Code Formatting
We assessed how the Codee Formatter and AI assistants like ChatGPT, Claude, and Gemini handle the modernization of legacy Fortran 77 code, focusing on improving formatting and readability:
AI assistants often struggled with large source files, sometimes introducing unintended semantic changes or breaking compilation.
In contrast, the Codee Formatter processed Fortran files almost instantly, ensuring the original logic and structure were preserved by relying on its compiler-based technology.
We also reviewed findings from the paper “Comprehensive Evaluation of LLMs in HPC Code Performance Optimization” by B. Cui, T. Ramesh, and K. Zhou (George Mason University) and O. Hernandez (Oak Ridge National Laboratory). The authors compared the Codee Analyzer with AI assistants such as ChatGPT, Claude, and Llama in HPC code optimization:
AI assistants were able to suggest meaningful optimizations and achieve performance speedups. However, they also failed in several benchmarks, producing code that failed to compile, crashed, or even generated incorrect results.
On the other hand, the deterministic static analysis of the Codee Analyzer consistently generated correct and compilable optimizations.
As a general takeaway, AI assistants are valuable for creative and exploratory tasks, such as prototyping new code. However, when code correctness and reproducibility are essential, such as in scientific computing, they can pose risks if not carefully supervised by experienced developers. That’s where deterministic, compiler-grade tools remain a reliable foundation for development workflows.
We’d be interested to hear your thoughts:
How do you see AI assistants fitting into Fortran development?
Have you tried using AI tools for Fortran development?
Isn’t this exactly what one would expect? LLM’s are not reasoning, they are “stochastic parrots” so unless trained exclusively on Fortran specific data, one would expect all sorts of nonsense to get mixed in. That they work at all is the amazing thing; that they cheerfully produce crap isn’t.
Another very public example, chatGPT gets the law wrong
LLM’s are not reasoning, they are “stochastic parrots” so unless trained exclusively on Fortran specific data, one would expect all sorts of nonsense to get mixed in.
Indeed! Given the rapid adoption of AI coding assistants in recent years, our goal was to just contextualize the continued importance of deterministic and specialized coding tools to assist developers in certain activities.
On that note, has anyone seen any ongoing efforts to tune LLMs specifically for Fortran code?
Having retired (at least for now) I don’t have a sizable code base to train one on. There are enough self hosted LLMs that I’d have thought the usual players (government labs, etc.) would have plenty of code to play with.
The “obvious” next step is to tie the toolchains together, the LLM based assistant could propose, and the semantically aware toolchain could validate, and should be able to send it back for rewrite until it is at least acceptable. No doubt that still leaves room for terrible numerics (existing code bases are generally going to be standard floating point, rather than intervals or unums, so automated proofs of numerical equivalence are probably infeasible).
^I kinda doubt it.^ I suspect that a lot of that code is NOT something that would be released for training. It feels more like Claude in particular has used Stackoverflow to train it on all the questions with examples from people that had flaws in their code.
I pretty much spent 2 months and >3 billion Claude tokens, and almost everything it wanted to do to speed up code made things slower. It did do well on unit tests and doing mixed C/F90, as well as documentation.
I almost exclusively use LLMs with Fortran and the more I use them and the more I refine my markdowns for what I like/want they less and less errors they made. See here which is the base of my markdown “good Fortran”.
Whenever I optimize code I do profiler based optimization, so I know where my hot loops are and using the output from the profiler such as cache misses, flop rate, etc. it is able to do some good optimizations but it is limited. The more you know the more you can get it to do. For example you get more from “The profiler says that we have a lot of thread divergence due to branching, can you explore some optimization tricks that would reduce branching?” and having tests that validate everything makes it simple, than asking “can you optimize my code?”
There’s tips and tricks for everything, I learn new things every day. I am exclusively writing GPU accelerated code using openacc and do concurrent, it is pretty good at it tbh.
I believe that if you went back to a 1960 university and asked to see a “computer scientist”, you would most likely be pointed to someone working on numerical analysis (the “study of algorithms for the problems of continuous mathematics” in Nick Trefethen’s memorable definition). That field has not exploded (in terms of numbers of practitioners) like the rest of computer-related endeavours. From O(1/2), in terms of the fraction of “computer-scientists”, it has become miniscule. I suspect that the reliability of AI contributions is proportional to the quantity of high-quality material that the AI system has trained on, and for numerical analysis that quantity is simply not there. Fortran being the natural home of “number-crunching”, in the brutal terminology that is prevalent, you may think twice before handing over responsibilities to an AI system for tasks requiring numerical analysis.
Would you even know if the AI suggestion destroyed the carefully built stability or convergence of your algorithm?
Having some end to end confidence tests before that was fashionable, does allow does for knowing when the AI has destroyed things.
Almost every attempt to “improve things” was a disaster for the AI. But I now have a bevy of unit tests to augment the previous end to end tests, and documentation thanks to the AI.
In terms of convergence I did spend days battling the AI. It was solid before, just a bit slow. I got a bit of a speed up and a testing framework to tune it up.. (but it’s still a MCMC approach). The AI wanted to do a spline fits and gawd knows what… which I knew was not going to work.
I pretty much learned after the 3.2 billion tokens, that Claude is most helpful for code organisation, testing framework, documentation, and helping with the CI/CD pipelining. And NOT to use it for complicated .F90 algorithms.
This is very similar to my experience. In my daily workflow, AI agents are tireless collaborators who improve day by day (in all fields, code review, optimization, documentation, etc).
"I believe that if you went back to a 1960 university and asked to see a “computer scientist”, you would most likely be pointed to someone working on numerical analysis " In 1960 I went from a
university with no computer to one where some people were writing an operating system easier to use than IBM’s horrible JCL and others were doing numerical analysis.
What we have found echoes the comments in this thread. The compiler-based tools, PlusFort, fpt, Codee … understand the language and can make systematic changes safely. AI doesn’t and won’t. But AI can make well specified changes and can save a lot of work.
There has been mention of code optimisation by AI. I would like to know what automated code optimisations people use or would like to use, and what contribution AI can make to this. I will start another thread to discuss this.
One example of my daily usage of AI collaboration is “parse, analyze, and summarize” the compiling building logs and profiling logs generated by the NVidia SDK (these kinds of logs are often very long and time-consuming to read and to understand): my AI assistant quickly points me to the most critical issues (sorting them with reasonable understanding, AI are very good in recognizing patterns), the workflow speedup is real and huge.
Another example of my daily AI usage is planning and brainstorming new features: in the image below, there is a screenshot of a small brainstorm I am having right now with my AI assistant concerning refactoring/new API design. The AI assistant is not able (for now) to substitute me, but it is a great collaborator that draft instantly my ideas in something very close to a real code over which I can understand better if my ideas are worth or not (more often, sadly).
I am looking forward to the assistance of AI to help me with my new codes, but I find strange is why anyone would want to update legacy Fortran77 to newer code for the sake of just updating it.
I understand writing new code in the newer standards (especially the parallel features), but I have personally found the majority of time the original Fortran77 code results in a faster executable than any of the newer standards (OOP, DO CONCURRENT, FORALL, etc.)—which is basically trading speed for flexibility, but the reason we use Fortran over other languages is for its speed.
For legacy software that is never updated, known to be correct, and runs fast, then why mess with something just to make it object oriented or using a “newer” standard that just results in slower code. I do write a lot of Fortran OOP, but it’s always in non-speed critical areas of projects (eg, file IO or procedures that are infrequently used). Otherwise, it’s highly optimized Fortran95/77 code to make use of the faster execution.
This has been a pet-peeve of mine because so many projects go down the rabbit hole of rewriting entire code bases in full OOP, only to spend another decade having to rewrite, debug, and optimize the code to match the performance of the original Fortran77 (and usually doing it by inserting very deep in the call stack a Fortran77 routine).
Fortran 77 code should be updated in stages. Converting it to free source form, replacing do ... continue with do ... end do, adding argument intents, and putting procedures in modules should not change the speed. If speed is important, the impact of more substantive changes should be measured before they are added to the production version of a code.
I think fortran also has many other advantages over other languages. As for using legacy f77 codes, there are also many reasons why a programmer might want to update to modern standards. The ability to specify IMPLICIT NONE, explicit interfaces, argument intents, and allocatable arrays is just the beginning. How about using more than one integer kind? How about using more then one complex kind? How about accessing a command line argument, or an environment variable? How about the C interoperability features? I think people forget just how limited and restrictive f77 is/was.
As @RonShepard points out, there are many advantages to the languages, and there’s more to modern Fortran.
I think we have to separate moving to modern Fortran from moving to OOP. Modern Fortran allows for OOP, but to write (or rewrite) a project in modern Fortran does not necessarily mean it should adopt OOP. Whether that’s reasonable or not depends on the project.
This statement couldn’t be more wrong. There are a lot of old FORTRAN codes whose development required huge effort and resources, and whose algorithms are too useful (and sometimes even still unsurpassed) to be abandoned.
This is because significant improvements in (mathematical) algorithm efficiency are usually achieved only on time scales that are longer than Fortran’s evolutionary time scale.
Some examples:
The Adams-Bashforth-Moulton methods for the solution of ODEs essentially stem from the 19th and 20th centuries, but are still competitive for a number of problems and accuracy requirements.
The Piecewise Parabolic Method of Colella & Woodward for solving the compressible Euler equations in CFD is almost five decades old, yet is still a competitive method and a standard tool in astrophysics. Etc.
I have myself refactored FORTRAN 77 codes into modern Fortran OO versions (always monitoring performance along the way) where the OO version ended up being faster. This myth that FORTRAN 77 code is the fastest, and that OO code is necessarily slower needs to stop.
As always, one needs to know what one is doing when embarking on such a modernization project.
Those are valid reasons to use modern Fortran for something that needs that interoperable, but not for rewriting a library in it. I am just saying that everyone is jumping on the idea of rewriting code bases as a means of just putting their name on a project and for them to feel a sense of ownership, rather than developing additional, new code (new algorithms, new features, new connections). If you need to link to a C-library, then you write newer wrapper code that connects to the f77 rather than rewriting the f77.
If you go to the extent of rewriting the entire code, you might as well rewrite it something more useful and flexible, such as C or Rust. The reason people hold onto Fortran is their inability to learn a new language (and I feel the same as I have spent the last 25 years exclusively coding in large Fortran projects and I personally hate C/C++).
Most people jump on this, saying I rewrote in OOP and got the same speed, usually are referring to a specific case of a subroutine that runs in nanoseconds and the speed difference is just the noise or their specific test case and compiler optimizations. When I was making that generalization, I was speaking, not from my perspective, but from that of my users who then report that their model ran in 70 hours vs 90 hours (where you see an actual time difference).
Almost every project I have seen that got a few million dollars to rewrite/update the Fortran had their entire proposal revolve around how great it would be to rewrite it in full OOP, which to me would be better spent enhancing the existing one with new features (or porting to another langauge). In fact, I know of one of those projects that is now submitting a new proposal to rewrite it to remove all the OOP because it suffers from too many runtime issues.