AI Coding Assistants vs. Codee — Insights on Fortran Correctness and Modernization

A lot of the rewrites are ill inspired but have good intentions. I think it is worth to rewrite codes to adopt GPUs, since a lot of the original algorithms and paradigms might scale and work poorly on GPUs. I have been part of rewrites and ports and porting a codebase to GPUs without willing to rewrite the architecture ends up producing a difficult to work with result. It doesn’t result in the 20x speedup that you’d want and it might make the CPU slower or basically duplicate the codebase.

I feel that a good rewrite focuses on keeping the algorithms as much as possible while creating new APIs that use these algorithms in a safe way so that you can spend more time designing an architecture that will use the algorithms. For example, this bit for some PPM fluxes:

      do concurrent(j=1:ny, i=3:nx - 2) &
         local(dh_m1, dh_0, dh_p1, h_left, h_right)
         call ppm_limited_slope(bs%h(i - 2, j), bs%h(i - 1, j), bs%h(i, j), dh_m1)
         call ppm_limited_slope(bs%h(i - 1, j), bs%h(i, j), bs%h(i + 1, j), dh_0)
         call ppm_limited_slope(bs%h(i, j), bs%h(i + 1, j), bs%h(i + 2, j), dh_p1)
         h_left = 0.5_wp*(bs%h(i - 1, j) + bs%h(i, j)) - (dh_0 - dh_m1)/6.0_wp
         h_right = 0.5_wp*(bs%h(i, j) + bs%h(i + 1, j)) - (dh_p1 - dh_0)/6.0_wp
         call ppm_cell_limiter(bs%h(i, j), h_left, h_right)
         if (do_pos) call ppm_limit_pos(bs%h(i, j), h_left, h_right, h_min_pos)
         this%h_face_right_x%data(i, j, 1) = h_left
         this%h_face_left_x%data(i + 1, j, 1) = h_right
      end do

Before, the code for ppm_limited_slope was inline in the do concurrent, repeated 3 times like:

      dh_left = h_i - h_im1
      dh_right = h_ip1 - h_i
      dh_centered = 0.5_wp*(dh_left + dh_right)
      if (dh_left*dh_right > 0.0_wp) then
         dh = sign(min(abs(dh_centered), &
                       2.0_wp*abs(dh_left), &
                       2.0_wp*abs(dh_right)), &
                   dh_centered)
      else
         dh = 0.0_wp
      end if

So I spent the time first refactoring so that I could create pure subroutines that are callable from a do concurrent to get to the first bit of code. Then all the data needed to compute the fluxes can be just passed in by: pure subroutine continuity_compute_fluxes_barotropic(grid, metrics, this, bs)

The extent of object orientation for me is data structures that hold all necessary stuff for a computation. Their only type bound procedures are initialization, data transfer, finalization, data deletion:

   type :: continuity_t
      logical :: is_init = .false.
         !! True between `init` and `destroy`. 
      integer :: ppm_variant = PPM_VARIANT_H3_MONO
         !! Active PPM scheme variant.
      logical :: monotone = .true.
         !! Enforce monotonicity on the reconstructed face values.
      real(wp) :: h_min = 1.0e-6_wp
         !! Lower clip on cell-centred thickness during the update.
         !! Also used as the floor for `ppm_limit_pos` 
      real(wp) :: cfl_max = 0.5_wp
         !! Soft cap on per-face CFL before falling back to upwind.
      logical :: use_ppm_limit_pos = .false.
         !! Positivity limiter

      ! ---- Face-reconstruction workspace ----
      !
      !   h_face_left_x(i, j, k)  — value AT east face i extrapolated
      !                             from the LEFT-side cell (i-1, j, k),
      !                             i.e. h_R of cell i-1.
      !   h_face_right_x(i, j, k) — value AT east face i extrapolated
      !                             from the RIGHT-side cell (i, j, k),
      !                             i.e. h_L of cell i.
      !
      ! Upwind: the kernel picks left if u >= 0 (left cell donates),
      ! right if u < 0.
      type(scratch_3d_buffer_t) :: h_face_left_x
         !! Left-state thickness at east faces.
      type(scratch_3d_buffer_t) :: h_face_right_x
         !! Right-state thickness at east faces.
      type(scratch_3d_buffer_t) :: h_face_left_y
         !! Left-state thickness at north faces.
      type(scratch_3d_buffer_t) :: h_face_right_y
         !! Right-state thickness at north faces.
   contains
      procedure :: init => continuity_init
      procedure :: destroy => continuity_destroy
      procedure :: enter_data => continuity_enter_data
      procedure :: exit_data => continuity_exit_data
   end type continuity_t

So that I can then, for example use the scratch_3d_buffer_t like:

      ! East-face shapes: (nx+1, ny, nz)
      call this%h_face_left_x%init(nx + 1, ny, nz, "continuity_h_face_left_x")
      call this%h_face_right_x%init(nx + 1, ny, nz, "continuity_h_face_right_x")
      ! North-face shapes: (nx, ny+1, nz)
      call this%h_face_left_y%init(nx, ny + 1, nz, "continuity_h_face_left_y")
      call this%h_face_right_y%init(nx, ny + 1, nz, "continuity_h_face_right_y")

      this%is_init = .true.

All the simplifications I’ve done make it so that my RK2 step looks like:

      do stage = 1, 2
         call run_stage_split(grid, metrics, dyn, eos, cor, ct, pgf, hv, bd, ss, &
                              va, hd, vd, vmix, ms, dt, n_inner, &
                              sf=sf, stage=stage, vcoord=vcoord, bc=bc, t=t, &
                              lateral_mix=lateral_mix, epbl=epbl, kshear=kshear)
      end do

      call rk2_average(ms)
      if (allocated(ms%tracers)) then
         do it = 1, size(ms%tracers)
            call rk2_average_field_3d(ms%tracers(it)%hTr0, ms%tracers(it)%hTr, &
                                      size(ms%tracers(it)%hTr, 1), &
                                      size(ms%tracers(it)%hTr, 2), &
                                      size(ms%tracers(it)%hTr, 3))
         end do
      end if

It is a bit useless to expose a public API for a continuity PPM solver, since it is quite specific. But then being able to expose a do_one_dynamics_step(handle) API so that someone can call it from Python and have it run on the GPU? That is what a million dollar refactor should get you. All of this results in me being able to write an integration test that is super easy to read:

  pure subroutine ocean_dyn_step_barotropic(grid, metrics, dyn, cor, ct, bs, dt)
      !! Public only for the unit-test suite (no production module imports it);
      !! ignore when developing production code in other modules.

      type(hgrid_t), intent(in) :: grid
      type(ocean_metrics_t), intent(in) :: metrics
      type(ocean_dyn_t), intent(inout) :: dyn
      type(coriolis_adv_t), intent(inout) :: cor
      type(continuity_t), intent(inout) :: ct
      type(barotropic_cgrid_state_t), intent(inout) :: bs
      real(wp), intent(in) :: dt

      integer :: i, j, nx, ny, nx_face, ny_uface, nx_vface, ny_face

      nx = grid%nx_total
      ny = grid%ny_total
      nx_face = size(bs%u_face_x, 1)
      ny_uface = size(bs%u_face_x, 2)
      nx_vface = size(bs%v_face_y, 1)
      ny_face = size(bs%v_face_y, 2)

      ! ---- 1. Save u^n into h0 / u_face_x0 / v_face_y0 ----
      do concurrent(j=1:ny, i=1:nx)
         bs%h0(i, j) = bs%h(i, j)
      end do
      do concurrent(j=1:ny_uface, i=1:nx_face)
         bs%u_face_x0(i, j) = bs%u_face_x(i, j)
      end do
      do concurrent(j=1:ny_face, i=1:nx_vface)
         bs%v_face_y0(i, j) = bs%v_face_y(i, j)
      end do

      ! ---- 2. Stage 1: tendencies at u^n, FE step -> u^(1) ----
      call continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
      call coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
      call continuity_apply_fluxes_barotropic(bs, dt)
      call coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)

      ! ---- 3. Stage 2: tendencies at u^(1), FE step -> u^(1) + dt*L(u^(1)) ----
      call continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
      call coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
      call continuity_apply_fluxes_barotropic(bs, dt)
      call coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)

      ! ---- 4. RK2 average: u^(n+1) = 1/2 * (u^n + (u^(1) + dt*L(u^(1)))) ----
      do concurrent(j=1:ny, i=1:nx)
         bs%h(i, j) = 0.5_wp*(bs%h0(i, j) + bs%h(i, j))
      end do
      do concurrent(j=1:ny_uface, i=1:nx_face)
         bs%u_face_x(i, j) = 0.5_wp*(bs%u_face_x0(i, j) + bs%u_face_x(i, j))
      end do
      do concurrent(j=1:ny_face, i=1:nx_vface)
         bs%v_face_y(i, j) = 0.5_wp*(bs%v_face_y0(i, j) + bs%v_face_y(i, j))
      end do

      dyn%outer_step_count = dyn%outer_step_count + 1
   end subroutine ocean_dyn_step_barotropic

Which to me is the magic of Fortran, if you had gone the full object oriented way in C++ would you get (AI generated)…because a lot of the new rewrites focus on migrating to C++:

template <class Op>
concept BarotropicOperator = requires(Op op, const HGrid& g, const OceanMetrics& m,
                                      BarotropicCGridState& s, wp dt) {
    op.compute(g, m, s);
    op.apply(s, dt);
};

template <BarotropicOperator Cont, BarotropicOperator Cor>
class BarotropicRK2Stepper {
public:
    BarotropicRK2Stepper(Cont& cont, Cor& cor) : cont_(cont), cor_(cor) {}

    void step(const HGrid& grid, const OceanMetrics& metrics,
              OceanDyn& dyn, BarotropicCGridState& s, wp dt) {
        s.saveSnapshot();                              // 1. save u^n
        forwardEulerStage(grid, metrics, s, dt);       // 2. u* = u^n + dt L(u^n)
        forwardEulerStage(grid, metrics, s, dt);       // 3. u* + dt L(u*)
        s.averageWithSnapshot();                       // 4. Heun average
        ++dyn.outerStepCount;
    }
private:
    void forwardEulerStage(const HGrid& grid, const OceanMetrics& metrics,
                           BarotropicCGridState& s, wp dt) {
        cont_.compute(grid, metrics, s);
        cor_ .compute(grid, metrics, s);
        cont_.apply(s, dt);
        cor_ .apply(s, dt);
    }
    Cont& cont_;
    Cor&  cor_;
};

But now, unless you’re very familiar with C++ and objects this might read like a foreign language. Because each object abstracts away what compute is doing you get that indirection that is very nice for generality but a bit difficult to cope if you’re starting out. I think heavily from the academic perspective, new students often without experience at all.

I feel that the Fortran implementation has a simplicity to the reader that is difficult to get with C++ unless you write C++ that looks more akin to C. I feel that the Fortran can look like how someone would write a python code for this to prototype: (also AI generated)

def ocean_dyn_step_barotropic(grid, metrics, dyn, cor, ct, bs, dt):
    bs.h0[:]        = bs.h
    bs.u_face_x0[:] = bs.u_face_x
    bs.v_face_y0[:] = bs.v_face_y

    continuity_compute_fluxes_barotropic(grid, metrics, ct, bs)
    coriolis_adv_compute_tendencies_barotropic(grid, metrics, cor, bs)
    continuity_apply_fluxes_barotropic(bs, dt)
    coriolis_adv_apply_tendencies_barotropic(cor, bs, dt)
    # ... stage 2 ...

    bs.h[:]        = 0.5 * (bs.h0 + bs.h)
    bs.u_face_x[:] = 0.5 * (bs.u_face_x0 + bs.u_face_x)
    bs.v_face_y[:] = 0.5 * (bs.v_face_y0 + bs.v_face_y)
    dyn.outer_step_count += 1

So, basically my idea of a refactor/port/modernization is hide the complexity away through nice interfaces without abstracting too much that the algorithm gets lost. I like the idea of data structures carrying data used in computation and then use procedural style calls for this. I am sure you could rewrite my dynamics step to look nicer. My main goal is: write your code such that a new student can read it and compare it to the algorithm they see in a book/paper

6 Likes

You are painting with a broad brush. I learned Python, C++ (to some extent) and R after Fortran and still prefer Fortran for some codes.

6 Likes

I agree — I learned Fortran after I knew C, C++ and Python well. @ScottBoyce I think Fortran still has its place, even with AI. Whether it’s now more important or less important than before, that I don’t know, I could argue both sides well. :slight_smile: Ask me in 2 to 3 years, I think it will become more clear then.

4 Likes

You do realize that in the modern world everything is written in terms of C (ELF, PE, etc.), don’t you?

(Debugging information formats like DWARF, have added module- and OOP-related stuff over the years, but the fundamental format of libraries and executables has remained C-centric.)

The OOP implementations are just abstractions that get lost once the linker has done its job (e.g., C++ compilers erase those abstractions to the point it sometimes becomes difficult to understand errors). So it all comes down to whether a Fortran rewrite using OOP correctly used “structs of arrays” instead of “arrays of structs” in the right places (since being contiguous and taking advantage of the CPU cache is all that matters).

And in the same vein, Rust’s advantage is only in what can be done during the compilation (i.e., the whole memory safety is mostly a compilation-related thing). But once the LLVM stack is done with it, it’s all the same.

One of the reasons why I like the newer programming languages over Fortran is that the standards committee also develop the compiler so that the two are in step with each other. The problem with Fortran is the standards committee develops what they think is useful and then throw it out into the wild with the hopes that someone will develop a compiler that can implement it. That leaves the developer at the whims of the companies that make the compilers and their interpretation of the standard.

I’ve had code that ran perfectly with one version of Intel Fortran and then they update it and then either it either fails to compile, doubles the runtime, or flat won’t work with a specific version of visual studio. Nvidia’s getting a little into the game to port more GPU programming, but again we’re still at the whim of their interpretation of the standard.

For other languages you can build your own compiler, but at the same time the standards committee provides a base compiler that you know is stable and will always honor the standard.

The other issue is that most the compilers now just convert the syntax to LLVM IR, then optimize that. So to me it would seem better to use a newer programming language that was developed after LLVM IR came about so that the syntax and the compilers are designed around optimally creating the intermediate representation.

It’s a similar issue with C and C++ not having their own standard compiler (well I would probably argue GCC) and a lot of the newer compilers for them are just translating over to IR, but I’m thinking more in terms of the newer compiled languages (Zig/Rust, which provided dedicated compiler that doesn’t create as much chaos every time there’s a version update).

For me the memory safety is more the sales pitch for Rust over C++ (in a way their ownership model is similar to Fortran allocatable variables), What is nice about Rust is what tends to be easy syntax in others, it makes it more difficult to guide you torwards faster code. For example, it makes static dispatch code easier to write over dynamic dispatch, where other languages make dynamic easier and thus resulting the creation of a vtable and less optimized code. Another benefit is that newer languages try to optimize the abstraction away (their equivalent to a type bound procedure becomes an inline standard function at compile time). Also, you get as part of the standard macros, preprocessing, documentation, a package manager, custom compilation, and something akin to cmake. Basically, they take all the best features of all the legacy languages and all the add on steps that we do (cmake, preprocessing) and build it into the standard.

I am digressing, I did not want my post to be able the benefit of a newer language, but rather the main power of Fortran is that, once something is stable and well designed for speed, it never needs to be modified again and given the promise of backward compatibility should forever be compliable and linkable to future code enhancements (adding new features, not modifying existing libraries).

Completely agree with you/your entire post :wink:

One thing that might make your code a few nanoseconds faster is to add to the type bound procedures the NON_OVERRIDABLE binding attribute.

That will flag to the compiler that you’re not using any inheritance and at compile time it should do a zero cost abstraction and substitute the tight bound procedures for traditional ones. Sometimes the compiler figures it out on its own, but this ensures that the compiler knows.

Also, this will require some timing to check, and may change with compiler versions, but you can force the vectorization by changing


do concurrent(j=1:ny, i=1:nx)
  bs%h0(i, j) = bs%h(i, j)
end do

! and changing to (best option because each loop is parallelized and vectorized)

do concurrent(j=1:n)
  bs%h0(:, j) = bs%h(:, j)
end do

! or if they are known to be dimensionaly identical always
! (this sometimes can cause a stack overflow due to compiler optimizer bugs)
! (Or at least I have had that happen to me for very large arrays over a standard do loop)

bs%h0 = bs%h

Do concurrent will then try to parallelize the loops on the outer index while vectorizing the inner index.

I have to admit, most of my experience with DO CONCURRENT versus a traditional DO, the concurrent version results in slower code. It’s probably more an issue with the compiler being better at peeling the arrays with old style DOs versus handing multiple indices in the CONCURRENT version (often I have found when its given more then one index to iterate over it chooses the worse one to move along first).

1 Like

Most of the programming languages that were born this century, are the product of a single body (e.g., C#, Go, Rust, Zig, etc.), and therefore there’s only one implementation that matters.

Fortran, C and C++, on the contrary, were standardized only after multiple implementations popped up. So even agreeing on trivial things like source file extension has been difficult (.f and .f90 are a given for Fortran, but I think I’ve seen .f95, .f03, .f08 and .f18 around somewhere).

And I think the LFortran project is intended to become the reference implementation —thus replacing NAG as the first compiler that always interprets the standard correctly.

So you’re leaning towards freezing the code and never adapting it to further uses beyond its original MVP status. That’s understandable (although it reminds me of a Perl-related meme).

But when taking the adaptation route, one must take into account that even though the implementation is fast, it might not have been very well written in the first place, and therefore must at least be cleaned up to ease further maintenance.

At least in the ifort/ifx case, do concurrent translates to OpenMP, which means that, unless the loop is guaranteed to handle thousands of elements, the overhead will be significant.

And ifort/ifx also uses the indices in the given order, so you still have to apply the reverse ordering in the do concurrent construct.

In the late 1980’s and early 1990’s I worked on the development of the ADSIM simulation language. The practice, in adding new features was:

  1. Develop and write a standard for the new feature;
  2. Ask the users for a review;
  3. Add it to an unreleased version of the compiler;
  4. Run it across a large library of existing programs to prove that it didn’t break anything.
  5. Let it loos on the users.

I still think that this is the only workable way of extending a language standard.

3 Likes

I think the order should actually be:

  1. Develop a new feature, add it to an unreleased version of the compiler
  2. Ask the users for a review
  3. Run it across a large library of existing programs to prove that it didn’t break anything
  4. Let it loose on the users, collect feedback, iterate 2.-4. until satisfied
  5. Write a standard for the new feature

Probably this is not in conflict with your list, but I made it very explicit.

4 Likes

I’m happy with this, but we tried to avoid the iteration - we thought users would get bent-out-of-shape if there were different versions of the compiler with different specifications. Extending features was not a problem and this happened.

What are the prospects of Fortran development working this way? I think that earlier in this thread LFortran was proposed as a test compiler.

1 Like

I think we are already almost there with LFortran: for example we implemented the new generics, people can try it out and the committee is still iterating on it. Although I am still hoping to achieve wider participation of users on this, but as LFortran matures, I think it will happen. We plan to implement the traits generics proposal as well and iterate.

My guess is that AI will be writing most of the Fortran code going forward, so that changes things a bit, so as a community we likely have to adjust to this new world. I don’t know myself how things will turn out. But my guess is that compilers and language development is still important (I can’t even tell if it is more important, or less important). So I am personally not changing any plans radically.

2 Likes

What is interesting is the way Rust does is something similar to Python PEPs, where they make a proposal and suggested compiler code (since Rust is a self-hosted compiler), then if approved makes it in the experimental compiler for a year, then if it passes that step, is moved to the beta compiler for six months, then finally the stable compiler (cant remember the exact time frames, but those are the steps). The idea is when you install the compiler you can run rustup and the toolchain version you want, and then if you want to go with the experimental, beta or stable compiler. So then you are testing for at least a year in the wild any new experimental features. The also upside is you get a rolling release of compiler updates that slowly include features/options that actually result in an improvement of the standard. Then you also can put into the folder that contains your code a rust-toolchain.toml that specifies the specific compiler version you want to use fore that folder.

1 Like

@ScottBoyce yes, I think that’s a very good way to do it.

I’ve never seen codee before but putting all you material behind a registration form is a sure way to turn people away. Also, not having a pricing policy is a red flag. I understand to contact for group subscribers or teams, but a contact us for even a single user? Maybe I’m just weird…

I have refactored FORTRAN 77 codes to be much faster by introducing simd and OpenMP and also better use of contiguous memory. I don’t see OO to offer this improvement, especially in computationally intensive codes.
What sort of “faster” is OO achieving ?

I was not referring to the substantial kind of speedups that one can achieve with the techniques you mentioned.

I simply observed that several of the refactored codes ended up being faster than their FORTRAN 77 originals in the few per cent range. I considered this interesting enough to share, given some of the preconceived notions against OOP.

Hence, having both efficient and modern, maintainable, code is no contradiction.

I have “refactored” old F77 codes by introducing mainly F90/95 features, including;

  • converting COMMON to modules and introducing derived type data structures to better document data structures.
  • modifying f77 memory management to ALLOCATE arrays
  • with cleaner array descriptions, modify subscript order to improve memory access.
  • move disk based data structures into derived type memory data structures.

I have limited my use of CONTAINS to routines that I consider utility routines that support the data structures, while for computational routines, I try to limit the scope for local computational intensive arrays.

I find these refactoring approaches I have adopted have lead to robust new codes with minimal introduced errors.

For computational intensive codes i have not identified what contribution an OO approach provides. If it produces only minimal improvement in performance, why try your approach ?

Object-orientation is not about trying to enhance performance. It is about managing rigid source code dependencies, and enabling the programmer to realize software plugin architectures.

This decoupling of code leads to code locality, and hence to far better code reuse, maintainability, extensibility, flexibility, readability, and reliability. The fact that with a well-optimizing compiler (e.g. ifx) one doesn’t have to take a performance hit to enjoy all of these benefits is just the icing on the cake.

Performance is not the only relevant code metric. Maintainability and extensibility count, too.

If I were to make a pie chart I’d say performance is 50% maintainability is 25% and extensibility is 25%. It also depends on what you want the code to do and its reach.

For example, in quantum chemistry there is this code called pyscf, which is written in python with a C/CUDA backend for accelerated computing. It did not exist very long ago and now it is one of the most popular and widely used programs in the world. They went the route of usability, maintainability, extensibility early on - they have EVERYTHING you’d wish to do, albeit, slowish. But being able to do everything and just pip install pyscf won many people over. Older codes which prioritized performance at the cost of maintainability suffered because adding a new method or upgrading something turns into half a PhD project! in pyscf I could write a new method in a day. Once they nailed things down, they added performance improvements and now they are fast where it matters.

I am mostly interested in the GPU world but a growing part of me looks into the CPU side. If I want GPU code that is fast and usable I need to abandon certain abstractions that would make the code much easier to maintain/extend. I hope that this gets better in Fortran as we progress in time.

My hope is that by developing GPU native Fortran codes in usable and modern lingo it will make life easier for vendors and compilers to do some cool innovations.

1 Like