Benchmarking Large Language Models

The LLMs that can generate Fortran code from a natural language prompt include at least

GitHub Copilot
Vertex AI with Gemini from Google
Llama 3 on Meta AI

(I have tried all except Copilot, which works with specific IDEs.) Are there others? It would be interesting to come up with a benchmark of prompts in English asking an LLM to generate Fortran code and compare the performance of the LLMs in terms of their speed and code correctness. Maybe tasks from Project Euler, Rosetta Code, or Advent of Code? Comparisons could be made across programming languages. In the future it may be more important that an LLM can write correct code in a language than that humans can. “One-shot” tests of whether an LLM can generate correct code from a prompt can be supplemented by multi-shot tests where the LLM is given compiler error messages, is informed of run-time errors, and is given unit tests that verify a program and is allowed to iterate. Ideally an LLM incorporates a search engine so that it links to online code for a task if it exists.

The ability to translate code in Python/NumPy or Matlab to other languages can also be compared.

Humaneval GitHub - openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code" is one of the commonly cited (python) code benchmarks that is used to evaluate LLMs. A Fortran version of this benchmark may be a good startng point - I’d need to look a bit closer to see how feasible this is.

I can help with this if you want to discuss.

Edit: a benchmark based on humaneval has been done for COBOL GitHub - BloopAI/COBOLEval: Evaluate LLM-generated COBOL

From my personal perspective, Claude 3 Sonnet seems superior to ChatGPT 3.5 and Gemini for Fortran tasks. I haven’t done any systematic benchmarks though. I only prompt in Polish, never in English, and Claude appears to generate the best Fortran code despite its Polish language skills being quite evidently worse among these models.

By the way, I really recommend trying this: There is a vast selection of models which can be downloaded and run locally from LM Studio. Edit: If a downloaded model doesn’t load in LM Studio, you may need to deactivate GPU Acceleration in the Hardware Settings.

A recent preprint says that translating Fortran to Python/JAX to exploit GPUs can greatly increase speed.

[Submitted on 13 Feb 2024]

Proof-of-concept: Using ChatGPT to Translate and Modernize an Earth System Model from Fortran to Python/JAX

by Anthony Zhou, Linnia Hawkins, Pierre Gentine

Earth system models (ESMs) are vital for understanding past, present, and future climate, but they suffer from legacy technical infrastructure. ESMs are primarily implemented in Fortran, a language that poses a high barrier of entry for early career scientists and lacks a GPU runtime, which has become essential for continued advancement as GPU power increases and CPU scaling slows. Fortran also lacks differentiability - the capacity to differentiate through numerical code - which enables hybrid models that integrate machine learning methods. Converting an ESM from Fortran to Python/JAX could resolve these issues. This work presents a semi-automated method for translating individual model components from Fortran to Python/JAX using a large language model (GPT-4). By translating the photosynthesis model from the Community Earth System Model (CESM), we demonstrate that the Python/JAX version results in up to 100x faster runtimes using GPU parallelization, and enables parameter estimation via automatic differentiation. The Python code is also easy to read and run and could be used by instructors in the classroom. This work illustrates a path towards the ultimate goal of making climate models fast, inclusive, and differentiable.

The only comparable result in that paper w.r.t. Fortran is that Numba and Jax are about 1 order of magitude slower than Fortran.

1 Like