Benchmarking Large Language Models

The LLMs that can generate Fortran code from a natural language prompt include at least

ChatGPT
GitHub Copilot
Perplexity
Groq
Claude
Mistral
Vertex AI with Gemini from Google
Llama 3 on Meta AI

(I have tried all except Copilot, which works with specific IDEs.) Are there others? It would be interesting to come up with a benchmark of prompts in English asking an LLM to generate Fortran code and compare the performance of the LLMs in terms of their speed and code correctness. Maybe tasks from Project Euler, Rosetta Code, or Advent of Code? Comparisons could be made across programming languages. In the future it may be more important that an LLM can write correct code in a language than that humans can. “One-shot” tests of whether an LLM can generate correct code from a prompt can be supplemented by multi-shot tests where the LLM is given compiler error messages, is informed of run-time errors, and is given unit tests that verify a program and is allowed to iterate. Ideally an LLM incorporates a search engine so that it links to online code for a task if it exists.

The ability to translate code in Python/NumPy or Matlab to other languages can also be compared.

Humaneval GitHub - openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code" is one of the commonly cited (python) code benchmarks that is used to evaluate LLMs. A Fortran version of this benchmark may be a good startng point - I’d need to look a bit closer to see how feasible this is.

I can help with this if you want to discuss.

Edit: a benchmark based on humaneval has been done for COBOL GitHub - BloopAI/COBOLEval: Evaluate LLM-generated COBOL

2 Likes

From my personal perspective, Claude 3 Sonnet seems superior to ChatGPT 3.5 and Gemini for Fortran tasks. I haven’t done any systematic benchmarks though. I only prompt in Polish, never in English, and Claude appears to generate the best Fortran code despite its Polish language skills being quite evidently worse among these models.

By the way, I really recommend trying this: https://lmstudio.ai/ There is a vast selection of models which can be downloaded and run locally from LM Studio. Edit: If a downloaded model doesn’t load in LM Studio, you may need to deactivate GPU Acceleration in the Hardware Settings.

A recent preprint says that translating Fortran to Python/JAX to exploit GPUs can greatly increase speed.

[Submitted on 13 Feb 2024]

Proof-of-concept: Using ChatGPT to Translate and Modernize an Earth System Model from Fortran to Python/JAX

by Anthony Zhou, Linnia Hawkins, Pierre Gentine

Earth system models (ESMs) are vital for understanding past, present, and future climate, but they suffer from legacy technical infrastructure. ESMs are primarily implemented in Fortran, a language that poses a high barrier of entry for early career scientists and lacks a GPU runtime, which has become essential for continued advancement as GPU power increases and CPU scaling slows. Fortran also lacks differentiability - the capacity to differentiate through numerical code - which enables hybrid models that integrate machine learning methods. Converting an ESM from Fortran to Python/JAX could resolve these issues. This work presents a semi-automated method for translating individual model components from Fortran to Python/JAX using a large language model (GPT-4). By translating the photosynthesis model from the Community Earth System Model (CESM), we demonstrate that the Python/JAX version results in up to 100x faster runtimes using GPU parallelization, and enables parameter estimation via automatic differentiation. The Python code is also easy to read and run and could be used by instructors in the classroom. This work illustrates a path towards the ultimate goal of making climate models fast, inclusive, and differentiable.

The only comparable result in that paper w.r.t. Fortran is that Numba and Jax are about 1 order of magitude slower than Fortran.

1 Like

A preprint from 21 May 2024 is

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust
by Patrick Diehl, Noujoud Nader, Steve Brandt, Hartmut Kaiser

This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes – even the simple example we chose to study here – also difficult for the AI to generate correctly.

…

6 Discussion and Conclusion

In this work we have conducted an evaluation of three computational problems using ChatGPT versions 3.5 and 4.0 for code generation using a range of programming languages. We evaluated the compilation, runtime errors, and accuracy of the codes that were produced. We tested their accuracy, first with a basic numerical integration, the[n] with a conjugate gradient solver, and finally with a 1D stencil-based heat equation solver.

For the numerical integration example, codes generated by both versions compiled successfully in all languages except Fortran, and executed without any runtime errors. However, the accuracy of the outputs from the ChatGPT 4.0-generated codes was incorrect, possibly due to the misinterpretation of the keyword “area” in the prompt. In the case of the
conjugate gradient solver, all generated codes compiled successfully with the exceptions of Fortran and Rust. Despite these compilation issues, the resultant codes from all other languages produced correct results, except for R. The parallel 1D heat problem proved to be the most challenging for the AI. Compilation errors were noted in the codes for Fortran, Rust, and C++. Furthermore, a majority of the generated codes encountered runtime errors, and most failed to produce correct results, indicating substantial issues with the implementation logic or the handling of parallel computing constructs by the AI code generator models.

We then analyzed the lines of code for all the generated codes, and the code quality using the COCOMO metric. The analysis of lines of code across all examples showed that Matlab and R consistently produced the lower lines of codes values, followed by Python, Julia, and Fortran (Section 5). In terms of code quality, C++ and Java consistently demonstrated robustness across all the examples tested, followed by Matlab. These languages appear to offer a balance between code quality and complexity, making them suitable choices for more complex computational tasks.

3 Likes

Hi there, do you know of any new benchmarks for Fortran specifically?