A preprint from 21 May 2024 is
Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust
by Patrick Diehl, Noujoud Nader, Steve Brandt, Hartmut Kaiser
This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes – even the simple example we chose to study here – also difficult for the AI to generate correctly.
…
6 Discussion and Conclusion
In this work we have conducted an evaluation of three computational problems using ChatGPT versions 3.5 and 4.0 for code generation using a range of programming languages. We evaluated the compilation, runtime errors, and accuracy of the codes that were produced. We tested their accuracy, first with a basic numerical integration, the[n] with a conjugate gradient solver, and finally with a 1D stencil-based heat equation solver.
For the numerical integration example, codes generated by both versions compiled successfully in all languages except Fortran, and executed without any runtime errors. However, the accuracy of the outputs from the ChatGPT 4.0-generated codes was incorrect, possibly due to the misinterpretation of the keyword “area” in the prompt. In the case of the
conjugate gradient solver, all generated codes compiled successfully with the exceptions of Fortran and Rust. Despite these compilation issues, the resultant codes from all other languages produced correct results, except for R. The parallel 1D heat problem proved to be the most challenging for the AI. Compilation errors were noted in the codes for Fortran, Rust, and C++. Furthermore, a majority of the generated codes encountered runtime errors, and most failed to produce correct results, indicating substantial issues with the implementation logic or the handling of parallel computing constructs by the AI code generator models.
We then analyzed the lines of code for all the generated codes, and the code quality using the COCOMO metric. The analysis of lines of code across all examples showed that Matlab and R consistently produced the lower lines of codes values, followed by Python, Julia, and Fortran (Section 5). In terms of code quality, C++ and Java consistently demonstrated robustness across all the examples tested, followed by Matlab. These languages appear to offer a balance between code quality and complexity, making them suitable choices for more complex computational tasks.