AI Coding Assistants vs. Codee — Insights on Fortran Correctness and Modernization

Hi everyone,

At Codee, we recently shared two comparative analyses exploring how current AI coding assistants perform on Fortran code, compared to Codee’s compiler-based tools.

1 - Code Formatting

We assessed how the Codee Formatter and AI assistants like ChatGPT, Claude, and Gemini handle the modernization of legacy Fortran 77 code, focusing on improving formatting and readability:

  • AI assistants often struggled with large source files, sometimes introducing unintended semantic changes or breaking compilation.
  • In contrast, the Codee Formatter processed Fortran files almost instantly, ensuring the original logic and structure were preserved by relying on its compiler-based technology.

Read the full article for more details: “Codee Formatter vs. AI Coding Assistants: A Focus on Fortran Modernization”.

2 - Performance Optimization

We also reviewed findings from the paper “Comprehensive Evaluation of LLMs in HPC Code Performance Optimization” by B. Cui, T. Ramesh, and K. Zhou (George Mason University) and O. Hernandez (Oak Ridge National Laboratory). The authors compared the Codee Analyzer with AI assistants such as ChatGPT, Claude, and Llama in HPC code optimization:

  • AI assistants were able to suggest meaningful optimizations and achieve performance speedups. However, they also failed in several benchmarks, producing code that failed to compile, crashed, or even generated incorrect results.
  • On the other hand, the deterministic static analysis of the Codee Analyzer consistently generated correct and compilable optimizations.

Read the full article for more details: “Codee Analyzer vs. AI Coding Assistants: A Focus on Correctness in Fortran/C/C++”.

As a general takeaway, AI assistants are valuable for creative and exploratory tasks, such as prototyping new code. However, when code correctness and reproducibility are essential, such as in scientific computing, they can pose risks if not carefully supervised by experienced developers. That’s where deterministic, compiler-grade tools remain a reliable foundation for development workflows.

We’d be interested to hear your thoughts:

  • How do you see AI assistants fitting into Fortran development?
  • Have you tried using AI tools for Fortran development?

— The Codee Team

3 Likes

Isn’t this exactly what one would expect? LLM’s are not reasoning, they are “stochastic parrots” so unless trained exclusively on Fortran specific data, one would expect all sorts of nonsense to get mixed in. That they work at all is the amazing thing; that they cheerfully produce crap isn’t.

Another very public example, chatGPT gets the law wrong

2 Likes

LLM’s are not reasoning, they are “stochastic parrots” so unless trained exclusively on Fortran specific data, one would expect all sorts of nonsense to get mixed in.

Indeed! Given the rapid adoption of AI coding assistants in recent years, our goal was to just contextualize the continued importance of deterministic and specialized coding tools to assist developers in certain activities.

On that note, has anyone seen any ongoing efforts to tune LLMs specifically for Fortran code?

Having retired (at least for now) I don’t have a sizable code base to train one on. There are enough self hosted LLMs that I’d have thought the usual players (government labs, etc.) would have plenty of code to play with.

The “obvious” next step is to tie the toolchains together, the LLM based assistant could propose, and the semantically aware toolchain could validate, and should be able to send it back for rewrite until it is at least acceptable. No doubt that still leaves room for terrible numerics (existing code bases are generally going to be standard floating point, rather than intervals or unums, so automated proofs of numerical equivalence are probably infeasible).

2 Likes

^I kinda doubt it.^ I suspect that a lot of that code is NOT something that would be released for training. It feels more like Claude in particular has used Stackoverflow to train it on all the questions with examples from people that had flaws in their code.

I pretty much spent 2 months and >3 billion Claude tokens, and almost everything it wanted to do to speed up code made things slower. It did do well on unit tests and doing mixed C/F90, as well as documentation.

I almost exclusively use LLMs with Fortran and the more I use them and the more I refine my markdowns for what I like/want they less and less errors they made. See here which is the base of my markdown “good Fortran”.

Whenever I optimize code I do profiler based optimization, so I know where my hot loops are and using the output from the profiler such as cache misses, flop rate, etc. it is able to do some good optimizations but it is limited. The more you know the more you can get it to do. For example you get more from “The profiler says that we have a lot of thread divergence due to branching, can you explore some optimization tricks that would reduce branching?” and having tests that validate everything makes it simple, than asking “can you optimize my code?”

There’s tips and tricks for everything, I learn new things every day. I am exclusively writing GPU accelerated code using openacc and do concurrent, it is pretty good at it tbh.

1 Like

I believe that if you went back to a 1960 university and asked to see a “computer scientist”, you would most likely be pointed to someone working on numerical analysis (the “study of algorithms for the problems of continuous mathematics” in Nick Trefethen’s memorable definition). That field has not exploded (in terms of numbers of practitioners) like the rest of computer-related endeavours. From O(1/2), in terms of the fraction of “computer-scientists”, it has become miniscule. I suspect that the reliability of AI contributions is proportional to the quantity of high-quality material that the AI system has trained on, and for numerical analysis that quantity is simply not there. Fortran being the natural home of “number-crunching”, in the brutal terminology that is prevalent, you may think twice before handing over responsibilities to an AI system for tasks requiring numerical analysis.

Would you even know if the AI suggestion destroyed the carefully built stability or convergence of your algorithm?

^100%^

Having some end to end confidence tests before that was fashionable, does allow does for knowing when the AI has destroyed things.

Almost every attempt to “improve things” was a disaster for the AI. But I now have a bevy of unit tests to augment the previous end to end tests, and documentation thanks to the AI.

In terms of convergence I did spend days battling the AI. It was solid before, just a bit slow. I got a bit of a speed up and a testing framework to tune it up.. (but it’s still a MCMC approach). The AI wanted to do a spline fits and gawd knows what… which I knew was not going to work.

I pretty much learned after the 3.2 billion tokens, that Claude is most helpful for code organisation, testing framework, documentation, and helping with the CI/CD pipelining. And NOT to use it for complicated .F90 algorithms.

1 Like

Hi Jorge,

This is very similar to my experience. In my daily workflow, AI agents are tireless collaborators who improve day by day (in all fields, code review, optimization, documentation, etc).

Thank you for sharing your experience and AI knowledge memory (mine is here GitHub - szaghi/dotfiles: my dotfiles...me! · GitHub).

My best regards
Stefano

"I believe that if you went back to a 1960 university and asked to see a “computer scientist”, you would most likely be pointed to someone working on numerical analysis " In 1960 I went from a
university with no computer to one where some people were writing an operating system easier to use than IBM’s horrible JCL and others were doing numerical analysis.