How to create "Copilot" for Fortran?

You have probably seen this: https://copilot.github.com/

I would like to have something like that for Fortran. Does anyone here have machine learning (ML) experience and would be willing to help?

I don’t have ML experience myself, but I can help from the data preparation side, it’s my understanding that having high quality data is essential and takes the most effort. LFortran can now parse almost any code to AST now.

Intuitively I would feel that it would be better to also do the semantic phase, so in conjunction with fpm, we compile the whole project using LFortran to ASR (Abstract Semantic Representation), and we can then transform the ASR to whatever form would be the best for ML — it would be canonical and we can include as much or as little semantic information as needed. We are getting better at compiling more codes, so we can start with something simple that we can compile. I can help with all this.

Copyright-wise we would simply list all the codes that we used for training and copy their licenses into some file that we distribute with the ML model.

Here are some questions:

  • How do you construct an ML model for something like this?
  • How much training data (Fortran code) would be needed to produce something useful?

References:

2 Likes

First, let me thank you for what you are doing to revive Fortran. I have been a Fortran programmer since The Watfor/Watfiv days with keypunch machines in the early 70s and in the last few years have adopted Julia as my main programming language, but would love to see Fortran continue as a robust language as it has some definite advantages because of its strong typing, static compilation, and ability to produce small, stand-alone executables.

Regarding Copilot, I think using Copilot is very problematic for open-source projects. It grabs and copies code indiscriminately regardless of the licensing and doesn’t inform the user as to the license status. I would anticipate that using it would lead to lots of legal problems.

2 Likes

Thanks @PeterSimon for the encouragement and welcome to the forum!

Regarding this, as suggested above, we would only use BSD/MIT licensed code, which only requires to copy the license and include in documentation or code somewhere. So I agree that it might not be very practical for end users to having to copy the large file with all the 1000+ licenses of all the code we used. But everything should be perfectly legal.

Alternatively, we can just provide the tools, and people train their own models.

I believe Copilot already includes Fortran code (See this tweet).

However, I am a bit concerned about what Fortran code might be in the training data set. I hope it’s not going to start causing lots of new FORTRAN 77 code to be written. Should we reach out to Copilot and ask them about it?

1 Like

How many lines of code are needed to get useful results from Copilot? I wonder if it could be useful for restricted codebases at the personal or research group level. I have written about 600K lines of code. Would Copilot make useful suggestions to me if trained on my code? Or just help me write the same buggy code faster :slight_smile:

Can machine learning be used to spot legal code that may be buggy, as in the codes below? (Implicit none might catch the latter, but not if both yy and y are declared variables.)

do i1=1,n1
   do i2=1,n2
      do i3=1,n2 ! should be n3
      end do
   end do
end do
select case (dist)
   case ("normal")  ; yy = one_over_sqrt_two_pi * exp(-0.5*xx**2)
   case ("Laplace") ; yy = 1/(sqrt_two)*exp(-sqrt_two*abs(xx))
   case ("sech")    ; yy = sech(pi_over_2*xx)/2
! line below should have yy
   case ("t5")      ; y = student_t_density(xx,dof=5.0_dp,mu=0.0_dp,xsd=1.0_dp) 
   case ("t10")     ; yy = student_t_density(xx,dof=10.0_dp,mu=0.0_dp,xsd=1.0_dp)
   case default     ; yy = bad_real
end select
2 Likes

I’d think ML would be more useful once there was a clear specification of the problem, either formally, or indirectly via a massive test case DB:

Blockquote
ML and AI have had their greatest successes in high signal:noise situations, e.g., visual and sound recognition, language translation, and playing games with concrete rules. What distinguishes these is quick feedback while training, and availability of the answer. Things are different in the low signal:noise world of medical diagnosis and human outcomes. A great use of ML is in pattern recognition to mimic radiologists’ expert image interpretations. For estimating the probability of a positive biopsy given symptoms, signs, risk factors, and demographics, not so much.

and

so I’m not fully convinced AI programmers are actually going to work out so well.

1 Like

Is there a shortage of developers who don’t know what they are doing but can copy stuff from the internet?

1 Like

I think you will need the implementation of this proposal

3 Likes

Gfortran is already a mind-reader :). For my second code above, placed in a module, gfortran says

temp_pdf.f90:95:23:

   95 |    case ("t5")      ; y = student_t_density(xx,dof=5.0_dp,mu=0.0_dp,xsd=1.0_dp)
      |                       1
Error: Symbol 'y' at (1) has no IMPLICIT type; did you mean 'yy'?

One could write a a fixit program that corrects certain types of errors by reading compiler error messages. When refactoring code, I often get warnings about unused variables that could be fixed automatically. Of course, fixing code with logic errors is a much harder problem.

Yes, I have seen that sort of messages from gfortran - pleasingly helpful indeed. Unfortunately it will not help against the n2/n3 mistake in your example and that is where said proposal would be really helpful :slight_smile:

GitHub Copilot is meant to reduce the time spent on tedious boilerplate. Like making HTTP requests and parsing their responses, or routine DB queries. These are fun to do for the first or second time, but they soon get old. I think Copilot will be most useful in languages like Javascript and Python where such boilerplate is common. I think it will be less so in Fortran where domain expertise is more important (e.g. which finite difference scheme to use to approximate this partial derivative?).

Copilot is not meant to replace programmers, but to allow programmers to spend their time on more creative, innovative tasks.

Do you like intellisense stuff like autocomplete and docstrings on hover? Copilot is a further step in that direction.

I see many responses on the internet get hung up on buzzwords like AI or ML. That’s a distraction. A line fit through two data points is machine learning. GPT-3 is glorified linear regression. Who cares how it’s called or what’s under the hood. It’s a tool, and it’s useful for some things. It’s not an end all.

5 Likes

Now I want to try it for writing my CMake build files, which are usually 90% boilerplate. I wonder whether this is going to improve the user experience with CMake or deteriorate it further.

2 Likes

As Milan said, Copilot is autocomplete, just better. I can imagine all kinds of stuff that I would like a help with in Fortran, such as calling a subroutine with lots of arguments, I would love if autocomplete can fill in all the arguments the best it can figure out, and I just fix it up.

I don’t use finite difference too much, but a lot of those expressions are repetitive, especially in 3D, so I think I would use an autocomplete that can actually almost do it.

In some sense, Fortran might be ideal for such intelligent autocompletion: imagine just depending on a package using fpm, then going to a code and starting some subroutine, and autocomplete can finish calling the third party API. Yes, for this kind of usage you almost don’t need ML. But I would not be surprised if people figure out how to use ML for all kinds of useful help for Fortran that I can’t think of right now.

I still think fpm will be a better user experience than CMake + Copilot. In some sense, fpm is the copilot for cmake.