ChatGPT for Programming Numerical Methods

A preprint ChatGPT for Programming Numerical Methods by Ali Kashefi and Tapan Mukerji

examine[s] the capability of GhatGPT for generating codes for numerical algorithms in different programming languages, for debugging and improving written codes by users, for completing missed parts of numerical codes, rewriting available codes in other programming languages, and for parallelizing serial codes. Additionally, we assess if ChatGPT can recognize if given codes are written by humans or machines. To reach this goal, we consider a variety of mathematical problems such as the Poisson equation, the diffusion equation, the incompressible Navier-Stokes equations, compressible inviscid flow, eigenvalue problems, solving linear systems of equations, storing sparse matrices, etc. Furthermore, we exemplify scientific machine learning such as physics-informed neural networks and convolutional neural networks with applications to computational physics.

The languages considered are C, C++, Python, MATLAB, and Julia. The prompts used are listed in the paper, and it would be interesting to compare the readability, correctness and performance of the generated Fortran programs with those of other languages.


Interesting read. There is also this other paper I came across a while back that assesses Finite Element codes generated via ChatGPT (in Python)

Assessing ChatGPT for coding finite element methods by Giuseppe Orlando

If we had an easy to use FEM framework in Fortran it would be worth it to try and replicate the work

PS I am actually writing a mini FEM Fortran framework.


Just a naive question (because I have no knowledge about how ChatGPT works) – To what extent does ChatGPT “understand” the meaning of the requested programming? Does it often make critical mistakes (like an error of sign), so more suited to getting a typical or illustrative code examples from user’s requests?

More specifically, I am wondering if ChatGPT is pretty useful for getting “zeroth-order” samples (template-like codes) as a starter for making a working program of interest, particularly when one is not familiar with that language or the problem-specific coding?

Yes, it often makes critical mistakes. I wrote a Python program that repeatedly asks it to code one or more tasks in Fortran, and it stochastically gives right or wrong answers. For example in response to the prompt

Write a Fortran program to compute Euler’s number using a Taylor series with the number of terms
nterms equal to 1000. Set integer, parameter :: dp = kind(1.0d0) and declare real variables as real(kind=dp). Use :: in declarations. Use implicit none and make sure to declare
all variables.

It came up with 2.718 4 times but 2.586 2 times, as shown here. Maybe you have it code a problem many times and look at a version of the code that produces the most common answer? I wonder if having it first describe the algorithm for a problem and/or have it solve the problem in Python and then asking it to code a translation to Fortran of the algorithm or the Python code would improve accuracy. It also often fails to declare all variables. Is there a Python script to add missing declarations to a Fortran code? Currently through the API I only have access to gpt-3.5-turbo. When gpt-4 is available in the API, results should improve a lot.


It does not understand anything about any programming language. It only knows what the most probable next symbol should be, given the prompt and the set of previous symbols. It is possible it will give you the correct answer, but it’s more likely it will write something that is close to the final answer but needs some manual tweaking. It will depend on how much of the training set included similar problems to what you are running, the more it could train on similar problems the better the final result will be. So common code that gets copied everywhere will work better than code no one has written before solving some problem no one has ever thought about.

You should definitely not assume that it won’t have any sign errors or deeper logic issues. It may not have any errors but it may also have errors.

1 Like

What it means for an AI system to understand something is a philosophical question. I wrote some small Fortran codes with bugs and asked ChatGPT-4 to identify the problems here. In many cases it answered correctly. You would say that a human that gave those answers has at least some understanding of Fortran.

Some time ago, I tested AI with some simple questions concerning writing a Fortran subroutine doing something I asked. What I asked was deliberately somewhat obscure, to see how it will react. It failed miserably (I actually posted an example about that before.) It didn’t ask for more details (as it should.) It just gave me a ridiculously wrong answer.

Two weeks or so ago, I tested it with Numerical Methods. This time it failed even more. I will give an example here, but trust me, it wasn’t the only one.
Specifically, I asked if the Cash-Karp method has the FSAL (First Same As Last) property. It happily answered that yes, Cash-Karp method has the FSAL property. This is wrong, it doesn’t. Not only that, but it also added the property was first introduced in the Fehlberg (RKF) method which, again, is wrong (but I guess I can somehow forgive that, since a modified version of RKF with FSAL was introduced way later than the original, although it definitely wasn’t the first method introducing the FSAL property.)
I asked “Are you sure the RKF method has the FSAL property?” This time it replied with apologies that no, it doesn’t, but there is a modified version which does (although I can safely say that nobody uses it.)

Anyway, back to Cash-Karp: As proof of its claim, it gave me two links. The first one was about the Dormand-Prince method (which does indeed has the FSAL property but that’s not the method I asked about,) and the second one didn’t mention Cash-Karp at all.
I insisted asking again: “Are you sure the Cash-Karp method has the FSAL property?” It replied “Yes, I’m sure!” (sic). Then it added some text, basically explaining what that property means. I kept “pushing” by asking it to give me more proof about its claim. This time it gave me three links, two of them dead, one with a text that mentioned Cash-Karp. The syntax was not very good, and a careless reader could interpret it as if Cash-Karp method has the FSAL property (although if you read the text carefully, it clearly says it doesn’t.)
I finally replied with “I think you are wrong. The Cash-Karp method does not have the FSAL property, at least not without heavy custom modifications”. This time it replied that "Yes, you are right, I apologize for my misunderstood answer (sic), the Cash-Karp method does not have the FSAL property. I misunderstood nothing, it clearly gave wrong answers.

Now, suppose I was a student who didn’t know much about those methods and asked AI about it. I would get a wrong answer which, unaware as I would probably be, I could trust it is correct, and further ask for a Fortran subroutine to implement the method, which I doubt it will be correct or optimal, because like I said it failed many times in rather simple Fortran programs.
Oh, and the cherry on top: Two weeks later, I asked again, the same questions as above, assuming the AI learns from its mistakes. It doesn’t seem it does. I got the same answers as before.

I don’t know about you, but my conclusion was, whoever trusts AI for serious work is shooting his/her own foot with a rocket launcher: Even if he misses a bit, the damage will still be devastating. Now, I didn’t use ChatGPT (because it asks to create an account which I refuse to do,) but as far I know, the AI I used has the same algorithms behind the scenes. My overall impression was it acted like a careless fast reader: It collected information from the Internet and combined it to make an answer, but the process was careless and the result may or may not be a correct answer.

Last but not least, maybe one day the AI will be “smart” enough to give better answers but even then, I wouldn’t trust it blindly. And frankly, I hope this will not happen anytime soon. I certainly hope if the time the AI will be good enough to be trusted will ever come, I won’t be around anymore to see that.


GPT-4 now seems to answer the question about Cash-Karp having the FSAL property correctly. My questions are bolded.

What is the Cash-Karp method?

The Cash-Karp method is a specific adaptation of the Runge-Kutta method, which is used to solve ordinary differential equations (ODEs).

Here’s a quick primer on the basic concept of Runge-Kutta methods: They’re numerical methods to solve ordinary differential equations, which involve approximating the solutions by iteratively stepping forward in time and estimating the derivative at different points within the time step.

The Cash-Karp method is a specific version of the Runge-Kutta method that employs adaptive step size control. This means that the method adjusts the step size (the increments of the independent variable, often time) based on the local behavior of the function.

This particular method was introduced by J. R. Cash and Alan H. Karp in a 1990 paper. It uses six function evaluations to calculate fourth- and fifth-order accurate solutions. The difference between these solutions is then used as part of the error estimate, which in turn is used to adapt the size of the next step.

By dynamically adapting the step size, the Cash-Karp method can often compute the solution more efficiently, by taking larger steps when the solution is changing slowly, and smaller steps when it’s changing quickly. This can often lead to a more accurate solution with less computational work than methods that use a fixed step size.

As of my last update in September 2021, the Cash-Karp method is widely used in scientific computing, alongside other adaptive Runge-Kutta methods like the related Dormand-Prince method.

Does the Cash-Karp method has the FSAL (First Same As Last) property?

The Cash-Karp method does not have the FSAL (First Same As Last) property.

The FSAL property means that the first stage of the method in a given step is the same as the last stage of the method in the previous step. This property can be used to eliminate one function evaluation per step, improving computational efficiency.

Cash-Karp method does not have this property, and thus does not benefit from this potential efficiency. However, an extension of the Cash-Karp method, called the Dormand-Prince method, does have the FSAL property. The Dormand-Prince method modifies the Cash-Karp coefficients to incorporate the FSAL property while maintaining the same order of accuracy.

So, in scenarios where computational efficiency is paramount, the Dormand-Prince method may be preferable to the Cash-Karp method. However, in other scenarios, the differences may be negligible or other considerations may be more important.

@Beliavsky Thank you for looking it up. As you can see, GPT-4 actually answered correctly. Still, that was just one of the tests I did. And even if it is able to give correct answers to all of them, none should trust AI blindly.

I will go even further. Let’s say the AI becomes so “smart” that it gives correct answers 99.9% of the time. Do we really want that to happen? How many students will study a given topic and use their brains, instead of the easy way, let a machine do their own homework? And in general, how many people will use such a “smart” AI to do good instead of bad things?
Now, one could argue that if you throw a rock to somebody’s head it’s not then rock to blame. Well, yes, but in this case we just give everybody a machine gun loaded with dumdum bullets and wait to see what happens. I can easily predict what will happen. Humans are not angels.

On the other hand, a lazy student who won’t do his/her homework won’t go too far just relying on AI, even if the AI becomes that good so that they could rely on it with a very good rate of success. Still, my intuition tells me this thing will do more bad than good. I definitely don’t trust it.

I have been using chatGPT4 for a number of Fortran, Matlab, and Python programming tasks. A few high-level thoughts:

  • It’s good in Fortran, but it excels in Python. From second hand knowledge, it also excels in JS. Essentially, if a language has a good amount of public code, GPT will perform.
  • It’s amazingly good sometimes, even when it’s wrong.
  • Prompts matter; prompts in the paper left me wanting (they are very basic)
  • In addition to prompts, pacing and problem decomposition are extremely important; both left me wanting in the paper (they are seriously naive, ie nonexistent).

I did a test to see for myself. I only considered GPT4. I did my best prompting, pacing, and directing it.
I tried the following example, page 39 of the paper:
“Please write a Python code for solving the 2D diffusion equation using the Alternating-direction implicit
(ADI) method.”


  • It did stop once (had to use the “Continue” button)
  • at runtime, it did encounter two out-of-bound issues on the tridiagonal_solver implementation, which it solved
  • I asked it to choose a test; chose a “Gaussian hill” in the middle of the grid.


I’m happy to elaborate if someone is interested.

1 Like

Are the wrong answers indeed stochastic (in terms of their appearance in the sequence)? From my (pretty sparse) experience with ChatGPT I have an impression that if you ask it the very same question again, it tries to change something. That could mean that it implicitly assumes that the previous answer was considered wrong by the user. If this is really so, it could easily lead from an essentially good answer to a next being wrong.

My Python program starts a new chat each time with the same prompt, in which case the generated programs are stochastic. I think there would be dependence if you repeated a question within a chat. Something I want to figure out is how to use the API to continue a chat. A simple improvement on the current Python script would be to ask ChatGPT, after it generated a code, “Have you declared all the variables in your program? If not, please do so.”

Now I have

a Python function
def debug_fortran(code, compiler="gfortran", compiler_options=["-std=f2018","-Wall"],
    niter_fix=1, gpt_model="gpt-3.5-turbo-0613", print_query=False,
    print_code_sent=False, print_gpt_output=False, run=True):
    """ get a Fortran code to compile by iteratively sending compiler error messages
    to ChatGPT and asking for fixes """
    new_code = code
    if print_code_sent:
        print("\ncode sent to debug_fortran:\n" + code)
    errmsg = ""
    for iter in range(1, niter_fix+1):
        print("\niteration ", iter, ":", sep="")
        errmsg = compiler_output(new_code, compiler,
        if errmsg:
            print("error message:\n", errmsg)
            new_output = fix_fortran(new_code, errmsg)
            if print_gpt_output:
                print("\nresult from fix_fortran:\n" + new_output)
            new_code = markdown_to_fortran(new_output, strip_blanks=True)
            print("\nnew code:\n" + new_code)
            print("no compiler error messages\n")
    if run:
        compile_and_run_fortran(new_code, compiler=compiler)
    return new_code, errmsg

that takes a Fortran source file, compiles it with gfortran (by default), passes the error messages (if any) and source code to ChatGPT asking for a fix, and repeats this process until the code compiles or the maximum number of iterations is reached. Common errors such as undeclared variables are typically fixed this way. Any warning message produced by the compiler is also considered an error. The project is here.

Another function
def run_exec_debug_fortran(exec_name, code, print_stderr=True, print_stdout=True,
    """ run an executable and debug code if it crashes """
    result =[exec_name], text=True, capture_output=True)
    if result.returncode != 0 and print_stderr:
        run_time_errmsg = truncated_string(result.stderr, "Error termination").strip()
        print(f'Execution failed with the following output:\n{run_time_errmsg}')
        new_output = fix_fortran(code, run_time_errmsg)
        if print_response:
            print("\nresult from fix_fortran():\n" + new_output)
        new_code = markdown_to_fortran(new_output, strip_blanks=True)
        new_code = code
        if print_stdout:
            print("\nprogram output:\n" + result.stdout)
    return result.returncode, new_code

tries to fix a code based on a run-time error message from the compiler.

Compilation errors and run-time errors cannot be ignored, but logic errors are harder to detect. For this ChatGPT can be combined with test-driven development.

In this preprint Fortran is one of the languages studied:

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation
by William F. Godoy, Pedro Valero-Lara, Keita Teranishi, Prasanna Balaprakash, Jeffrey S. Vetter

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and OpenACC), (3) Python (e.g., numba, Numba, cuPy, and pyCUDA), and (4) Julia (e.g., Threads, CUDA.jl, AMDGPU.jl, and KernelAbstractions.jl). We use the GitHub Copilot capabilities powered by OpenAI Codex available in Visual Studio Code as of April 2023 to generate a vast amount of implementations given simple + + prompt variants. To quantify and compare the results, we propose a proficiency metric around the initial 10 suggestions given for each prompt. Results suggest that the OpenAI Codex outputs for C++ correlate with the adoption and maturity of programming models. For example, OpenMP and CUDA score really high, whereas HIP is still lacking. We found that prompts from either a targeted language such as Fortran or the more general-purpose Python can benefit from adding code keywords, while Julia prompts perform acceptably well for its mature programming models (e.g., Threads and CUDA.jl). We expect for these benchmarks to provide a point of reference for each programming model’s community. Overall, understanding the convergence of large language models, AI, and HPC is crucial due to its rapidly evolving nature and how it is redefining human-computer interactions.


Fortran is of particular interest in this analysis owing to its importance in HPC and scientific computing. Despite not being a mainstream language in terms of code availability, Copilot can provide some good results because of Fortran’s domain-specific nature and legacy.

As shown in Table 3, using an “optimized” prompt and the subroutine keyword is particularly beneficial in this case. Not using it leads to very poor results, with the AXPY OpenMP case being the only exception due to its simplicity and availability.We observe a trend similar to the one we saw in the C++ case: the more mature solutions such as OpenMP and OpenACC provide better results for parallel codes that use Fortran.

The generated codes are at GitHub - keitaTN/Copilot-hpc-kernels: Auto generated HPC kernels.


I’ve just come across this page:

I’ve recently created an account for chatGPT (very late :sweat_smile:) so will try to ask something :slight_smile:

Yes, in my Python scripts setting gpt_model = "gpt-4" in the main program now works. If there are tasks that people want ChatGPT to code, please post. I can post a transcript on GitHub.

ChatGPT, like a human programmer, makes many trivial mistakes, such as not declaring variables or declaring variables that are never used. My scripts get it to correct its mistakes by feeding it the error and warning messages from gfortran, but even the shortest prompt to ChatGPT can take a few seconds to be answered. I think the most efficient way to use ChatGPT for coding at scale will be to have external programs that fix mistakes in code it writes based on compiler messages.

I wonder how well it will implement a parallel 2D Laplace equation solver with boundary conditions using MPI, specifically prompting it to perform minimal communication (only sending/receiving ghost regions). This is like the simplest meaningful parallel programming exercise, as opposed to an MPI “Hello, World” code or calculating pi using a simplified Monte Carlo-like method.

1 Like

The Code Interpreter plug-in is now available to all paid accounts ($20/month). It integrates a Python interpreter with ChatGPT, so if you ask it to code X in Python, it will the run the code it creates and iterate until it succeeds. You can then ask it to translate the code to Fortran.

I’ve been using Code Interpreter since Friday and I think it’s amazing.

I haven’t tried doing the Numerical Methods exercise though, just fed it whole files of Python, and it was happy to digest and critique in full.