Large Language Models and The End of Programming

Yes, such “very …high (A)Intelligent …species” designed by the very same cadre of “leaders” now in charge “who “all” are responsible for “corruption and injustice””!!! You don’t need “AI” at all to guess the outcome …

I think we should try to avoid general social commentary here.

4 Likes

I believe that us Fortran programmers should consider how we can adapt to this rapidly changing world. In early 2023, when the AI subject became so popular, thanks to ChatGPT, I thought that such discussions were somewhat like science fiction. I still maintain that ChatGPT is merely a tool, and, quite evidently, a simple text completion tool at that. However, my perspective has shifted. If all goes as planned, starting from October, I won’t need to work for more than 15 minutes a day for the rest of my life because my Fortran-based AI assistant will handle the workload for me. Fortran users find themselves in a privileged position; even on a personal laptop, you can run AI models that would require a small supercomputer in the case of Python. Let’s take advantage of this opportunity!

P.S. This post’s English has been enhanced by ChatGPT.

I have just read the following article regarding AI-assisted programming, quite interesting. Interestingly, the programming task used in the experiment conducted on MIT, was to solve a problem in the Fortran language, which none of them knew.

Sorry for reviving an old topic but I did not want to start a new one.

From the article

“This is an important educational lesson,” said Klopfer. “Working hard and struggling is actually an important way of learning. When you’re given an answer, you’re not struggling and you’re not learning. And when you get more of a complex problem, it’s tedious to go back to the beginning of a large language model and troubleshoot it and integrate it.”

I recall the famous mathematician R.L. Moore (known for the so-called “Moore Method” or “Inquiry Based Learning” approach to teaching mathematics that focuses on having the student develop his or her own proofs of key results) described his philosophy as “The student who learns best, is told the least.”

I fear that children will take this AI hype all too seriously, and not learn to use programming as an aid to critical thinking.

Here is a good video that is a very reasonable and honest test of what these LLMs can do. This technology is much worse than even a novice developer would do. Complicated code for the use of numerical methods in engineering and science looks pretty safe for the foreseeable future.

1 Like

Indeed, programming is a school for rigor, and one of the most demanding (one wrong character and everything fails…). It is a sufficient reason to teach and learn programming.

2 Likes

As someone who looked into this problem more than I want to admit, Morton orders, cycle-chasers, I don’t think there is a good answer for this question at all.

I tried recursive or blocked transpose before and one version was in SciPy before I removed it again. It is unnecessarily tricky to come up with a scheme that will work properly across all hardware unless you want to dig down into architecture detection in your code; something beyond my paygrade unfortunately. Cache-oblivious algorithms are a different type of beast and I don’t know how to do it properly, again for all hardware out there.

Considering the whole world still needs to link to an F77 library to do linalg, I think LLMs are last of our worries.

Relevant to the discussion: This conference paper presents some testing that was done with chatGPT and several programming languages, including Fortran. Even for small programmes and common scientific problems, basic mistakes like using dot_product as a function and variable are made, and it struggles with even simple parallel codes.

Looking at GitHub C++, Python, and Java were in the top ten most used languages in 2022. Maybe the larger training data set explains why C++ and Java had good results. However, Python is an outlier. It is the second most used language on GitHub, but the code quality was low.

That’s interesting. I wonder if that’s due to a larger ratio of poor:good code in the training dataset, or due to the fact Python scripts don’t implement these problems as much ~from scratch, so the relevant parts of the Python training dataset isn’t as big.

I often make mistakes such as forgetting the contains statement or the comma needed after a format string in a print statement, but if you hired me to write Fortran, those errors would not show up in the code I produce, since the compiler will force me to fix those errors. Arguably it is unfair to judge an LLM based on syntax errors, except to the extent that iteratively fixing those syntax errors, which can be done by an agent, slows them down. It’s the logic errors in codes that compile and run that are insidious.

3 Likes

Absolutely.

I also often forget things like the contains statement. I suppose the expectations of LLMs to not get syntax wrong like that stems from the fact that a) it’s relatively straightforward to check (or would be if LLMs worked differently), b) the hype around LLMs as all-rounder coding tools.

“The end of Programming” is a challenging statement.

My first response is this can’t be !
How could AI replace the computational dense analysis of large complex data sets ?

But my experience is that, rather than AI, the marketplace has reduced the need for the computational approaches I have used in the 80’s to 10’s.
Or is it that the marketplace has changed and wants a different solution approach for different problems ? ( perhaps one the client, rather than the expert can control )

Perhaps it will be a combination of the marketplace wanting solutions more aligned to what AI can produce.

I have to disagree with this statement. We only apply this kind of reasoning to AI, for whatever else we would be furious. Have you ever said “well, this screwdriver does not work but my fingers are no better at screwing in this screw so I guess I cannot be mad”?
And looking at it from the probabilistic perspective, if they based the training of the model on good code, how comes that the code generated has so many things that would not pass a compilation step?

(with this I’m not saying AI should not be adopted, but before creating the market for the tool I would like to have the tool finished and working as expected)

There are some deterministic tools to generate Fortran code (and many more such tools to generate C code). If such a tool generates invalid code, it’s a bug. LLMs, like humans, are not deterministic. Give ChatGPT 4o the same coding prompt many times, and it will sometimes generates code that

  • has syntax errors
  • compiles but crashes at run-time
  • compiles and runs to completion but generate wrong results due to a logic error
  • gives correct results

One can see this is a flaw, but this non-determinism means that if you give the LLM a coding task N times, possibly in parallel, the probability of the getting a correct program rises as N increases.

A large fraction of Fortran codes declare integer variables i and j. So LLMs will sometimes declare such variables even though they are never used. LLMs generate Fortran code based not only on Fortran code in their training set but on code from other languages. Since pi is a built-in constant in some languages, such as Matlab, LLMs can generate invalid Fortran code that uses pi without defining it. In C and C++ you can more easily intersperse declarations and executable than in Fortran, so LLMs generate such code that is invalid in Fortran without using block.

I think the ideal LLM coding agent will have an LLM generate code, have deterministic tools to fix errors they commonly make (for example compilation errors regarding undefined pi), and have the LLM try to fix remaining errors.

1 Like

If this is actually true it’s a major design flaw.
Let me use an LLVM to generate code but let me check all the array boundaries and loop boundaries because they could be either 0 or 1 depending on which language had the largest bias in the model

I’ll admit that I know very little about LLM’s so this is probably a dumb question. Do they have a way to filter out what I and most people would consider bad code from its training set. Some old codes have so many GO TOs, EQUIVALENCEs, and other evil coding practices (like a(1) for dummy argument dimensions with the actual argument a 2D array) that I wouldn’t trust anything that thought those thing were appropriate in modern Fortran.

Do they have a way to filter out what I and most people would consider bad code from its training set.

Definitely, if careful thought were put into that in the pre-training (or fine-tuning) phase. However, as an end-user of an already trained models, you’re much more limited. You’re basically left with having to manage the models’ outputs through prompts.

If this is actually true it’s a major design flaw.

I wouldn’t call it a design flaw. Just as for humans, it makes sense for these models to draw on similar “experiences” (with other programming languages in this case). This approach is a strength of these models in many instances, but comes with downsides for specific applications. But that’s the reason pure probabilistic models like these won’t be the all-rounder some hype them up to be.

I have played with LLMs a lot for Fortran coding and don’t recall seeing code with goto or equivalence. Arrays are typically declared as assumed shape x(:) or explicit shape x(n) with n an argument. They put procedures in modules if I ask them to and may do so without being asked. ChatGPT will learn about your coding style preferences over time.

1 Like

I have also used LLMs extensively for Python and Modern Fortran development. I have never seen a goto or equivalence. Simply modifying the system prompt to include best practices will make them modularize preemptively. Asking about 2003+ features will make them use OOP if it makes sense.

Newer models are not the GPT4 or 4o of old. They have an impressive grasp of Modern Fortran and best practices.

The prompts continue to make a difference in my experience. LLMs do like good contextualization, problem constrain, and task chunking.

If this is actually true it’s a major design flaw.

What do we actually optimize for? I remember early journal papers using GPT3.5 and sometimes GPT4 to one-shot problems. Is that it? Is that their best use case? I’d argue not.

Their best use case is iterative design and process. And for this use case, being able to consider concepts that don’t necessarily exist in Fortran codebases actually is brilliant. If the tradeoff is syntax errors or no bound checking, so be it.

Anecdotally, I’m seeing a sharp decrease in all these issues with newer models. They can sometimes one-shot a couple hundred lines of (simpler) code.

o3 is good at Fortran. What are your favorite newer models for Fortran coding?

Hmm, I don’t use o3 at all.

I use and mix a number of them, including: o4-mini-high , Gemini 2.5-pro, Grok3, Claude Sonnet 3.7 and 4.0. I do check out and use the latest available DeepSeek (V3.1 and R1 0528) for small and contained tasks.

All of them do fairly consistently well with Fortran.

Gemini has been the dark horse, unexpectedly becoming top 2 since it released. The huge context window helps a lot, and I enjoy its high-level planning.

Claude 4.0 is also top 2, though not unexpectedly so for those who followed Sonnet 3.5/3.7 updates.

Personally, every other model is within <10% margin of the top 2, and that’s on their worst performance.

I didn’t delineate between thinking and non-thinking models. It does affect the output but it’s tricky to quantify as it depends on the prompt a lot.