DeepSeek's code generation capabilities for traditional HPC languages

There is a new arXiv preprint

LLM & HPC:Benchmarking DeepSeek’s Performance in High-Performance Computing Tasks
by Noujoud Nader, Patrick Diehl, Steve Brandt, Hartmut Kaiser

Our results demonstrate that while DeepSeek generates functional code for HPC tasks, it lags behind GPT-4, in terms of scalability and execution efficiency of the generated code.

One important observation is that LLMs struggle to write scalable codes. In first two examples, both the C++ and Python codes showed consistent scaling with the increase in the number of cores, while the Fortran and Julia implementations showed limited scalability. This is consistent with prior research on HPC applications, which highlights the challenges of parallelism in certain programming languages. In example three, both C++ and Fortran scaled, but the Julia code could not be executed on RISC-V CPUs due to Julia not supporting this device. The Stream Triad clearly improved in performance with increasing array sizes for the all generated codes.

They survey LLM HPC research and study performance on the following prompts:

  1. Conjugate Gradient Solver
    Write a language code to solve the linear equation system using the conjugate gradient solver and validate it.

  2. Parallel 1D Stencil-Based Heat Equation Solver
    Write a parallel language code to solve the one-dimensional heat equation using a finite difference scheme for the discretization in space and the Euler method for time integration, validate it and plot the solution.

  3. Parallel Matrix Multiplication
    Write a parallel language code for matrix multiplication and validate it.

  4. DoublePrecision General Matrix Multiplication
    Write a parallel language code to perform DGEMM on large matrices, optimize the implementation for performance using parallel computing techniques, validate the results, and compare the performance with different matrix sizes and parallelization
    strategies.

  5. STREAM
    Write a parallel language code to perform the STREAM triad operation on large arrays.

For a more explicit version of the first prompt

Write a Fortran subroutine in a module to solve a system of linear equations using the conjugate gradient solver and validate it. The equations are M * x = b, where M is

4 -1 0
-1 4 -1
0 -1 4

and b is [1 2 3]. The L2 norm of residuals should be <= 1.0e-6.

Declare real variables as real(kind=dp) with the module constant dp defined as

integer, parameter :: dp = kind(1.0d0)

Print the solution x and verify that it is correct by printing M * x.

my Groq-Fortran-agent using the model llama-3.3-70b-versatile gave the code

module conjugate_gradient_module
    use iso_fortran_env, only: error_unit
    implicit none
    private
    public :: conjugate_gradient_solver, dp

    integer, parameter :: dp = kind(1.0d0)

contains
    subroutine conjugate_gradient_solver(M, b, x, tol)
        real(kind=dp), intent(in) :: M(:, :)
        real(kind=dp), intent(in) :: b(:)
        real(kind=dp), intent(out) :: x(:)
        real(kind=dp), intent(in) :: tol

        real(kind=dp) :: r(size(b)), p(size(b)), Ap(size(b))
        real(kind=dp) :: alpha, beta, rnorm
        integer :: n, iter

        n = size(b)
        x = 0.0_dp
        r = b - matmul(M, x)
        p = r
        iter = 0

        do
            Ap = matmul(M, p)
            alpha = dot_product(r, r) / dot_product(p, Ap)
            x = x + alpha * p
            r = r - alpha * Ap
            rnorm = sqrt(dot_product(r, r))
            if (rnorm <= tol) exit
            beta = dot_product(r, r) / dot_product(r - alpha * Ap, r - alpha * Ap)
            p = r + beta * p
            iter = iter + 1
        end do

        print *, 'Solution x = ', x
        print *, 'M * x = ', matmul(M, x)
    end subroutine conjugate_gradient_solver
end module conjugate_gradient_module

program main
    use conjugate_gradient_module, only: conjugate_gradient_solver, dp
    implicit none

    real(kind=dp) :: M(3, 3), b(3), x(3)
    real(kind=dp) :: tol

    M = reshape([4.0_dp, -1.0_dp, 0.0_dp, -1.0_dp, 4.0_dp, -1.0_dp, 0.0_dp, -1.0_dp, 4.0_dp], [3, 3])
    b = [1.0_dp, 2.0_dp, 3.0_dp]
    tol = 1.0e-6_dp

    call conjugate_gradient_solver(M, b, x, tol)
end program main

in 3 tries (8 seconds), which gives output

 Solution x =   0.46428571746412267       0.85714278307017844       0.96428568742458887
 M * x =    1.0000000867863124        1.9999997273920023        2.9999999666281769

I put the instruction “use the Fortran intrinsic matmul function” in the prompt, because otherwise the LLM defines its own matmul. I need to think about what compiler options to use for the “parallel language code” prompts.

Another recent preprint by some of the same authors is LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages that compares LLM codes using

  • C++ GCC 12
  • FORTRAN GCC 12
  • Python 3.12
  • Matlab R2024a
  • CUDA 12.4.1
  • OpenMPI 4.1.5
  • mpi4py 4.0.1
  • R 2024a
4 Likes

If I copy paste your code as-is into godbolt I get 8 iterations (with Gfortran).

Also, the code looks incorrect:
(From Conjugate gradient method - Wikipedia)

you have to change the sign to that (awful) operation dot_product(r + alpha * Ap, r + alpha * Ap), I get convergence in 2 iterations.

1 Like

Indeed, the generated code seems to be incorrect, and with the above modification I also needed only 2 iterations to reach convergence (maybe reasonable because CG should converge with at most size(b) iterations, in principle?)

!! wrong code
beta = dot_product(r, r) / dot_product(r - alpha * Ap, r - alpha * Ap)          
!! gives
 Solution x =   0.46428571746412267       0.85714278307017844       0.96428568742458887
 M * x =    1.0000000867863124        1.9999997273920023        2.9999999666281769
 iter =            8

!! correct code
beta = dot_product(r, r) / dot_product(r + alpha * Ap, r + alpha * Ap)
!! gives
 Solution x =   0.46428571428571430       0.85714285714285721       0.96428571428571408
 M * x =    1.0000000000000000        2.0000000000000004        2.9999999999999991
 iter =            2

The result of the corrected code seems more accurate and also agrees with the result from the sample Julia code (in the above Wiki page).

x = [0.4642857142857143, 0.8571428571428572, 0.9642857142857142]
A * x = [1.0, 2.0, 2.9999999999999996]
iter = 2

It is interesting that the machine-generated code still gives the “correct” result even with the above bug, which cannot be detected with only the final result…

This is not uncommon for iterative procedures, and it is a feature that makes debugging sometimes difficult. Sometimes you can use this to advantage, by purposely making additional approximations (early truncations of expansions, coarse grid approximations, data compression, etc.) to speed up some aspect of the problem where you know that the errors you introduce will eventually be corrected.

1 Like

Yeah, I agree that it is very common + useful to use a crude approximation at the early stage of optimization etc. What I felt interesting is that the “incorrect” machine-generated code is not too bad (convergence reached at 8 iterations), unlike more primitive methods like steepest descent. The wrong code uses an incorrect stepsize (in terms of CG) and so incorrect search directions, but maybe going downward overall, so reaching the minimum… (But IMHO, if the code cannot give the performance of CG, it is not much useful in practice.)

I also feel that, given this kind of example, it may be rather difficult to make a reliable validation of machine-generated codes for any production use (because the actual code may be doing something wrong while giving apparently reasonable results…)

I don’t know if this is the correct example to test the “not too bad”-ness of this code. I’m pretty sure with another matrix you will spot it right away. For example, without even running the code I can see that dot_prod(r , r) is computed three times, which is totally unnecessary. Moreover, dot_product(r + alpha * Ap, r + alpha * Ap) is nothing more than the scalar product between the previous residuals, so there’s a fourth useless inner product. A better version would be:

subroutine conjugate_gradient_solver(M, b, x, tol)
        real(kind=dp), intent(in) :: M(:, :)
        real(kind=dp), intent(in) :: b(:)
        real(kind=dp), intent(out) :: x(:)
        real(kind=dp), intent(in) :: tol

        real(kind=dp) :: r(size(b)), p(size(b)), Ap(size(b))
        real(kind=dp) :: alpha, beta, rnorm
        integer :: n, iter
        ! Addition
        real(kind=dp) :: r_dot_r_old, r_dot_r

        n = size(b)
        x = 0.0_dp
        r = b - matmul(M, x)
        p = r
        iter = 0
        r_dot_r_old = dot_product(r, r)

        do
            Ap = matmul(M, p)
            alpha = r_dot_r_old/ dot_product(p, Ap)
            x = x + alpha * p
            r = r - alpha * Ap
            r_dot_r = dot_product(r, r)
            rnorm = sqrt(r_dot_r)
            if (rnorm <= tol) exit
            beta = r_dot_r / r_dot_r_old
            p = r + beta * p
            r_dot_r_old = r_dot_r
            iter = iter + 1
        end do

        print *, 'Solution x = ', x
        print *, 'M * x = ', matmul(M, x)
    end subroutine conjugate_gradient_solver

It has only two dot_product(x,y) (as written in several sources in the literature) and thus would be much faster when the dimension starts to be something relevant.

If this was a paper from a student, they wouldn’t get a high grade.

Sorry for posting faulty code. My tools try to create code that compiles by iteratively sending compiler error messages to LLMs, but that is the easy part. For both human and LLM-generated code, one should also have tests that fail if a procedure is not coded correctly. It is possible that a compiler would detect that dot_product(r, r) need not be recalculated in the line

beta = dot_product(r, r) / dot_product(r - alpha * Ap, r - alpha * Ap)

but it is better to write the code as you did. I wonder if an LLM could write a Python script to remove duplicated calculations from code in Fortran and other languages.

In general, a comprehensive Fortran agent could go through the following checklist.

  1. Does the code compile with default options?
  2. Does it compile with strict options?
  3. Does it pass a linter such as Fortitude?
  4. Is it formatted properly? If not use a tool such as fprettify.
  5. Does it run to completion without errors such as out-of-bounds array access?
  6. Does it pass unit tests?
  7. Is it fast? If there are scaling laws, such as sorting taking N*log(N) operations, check that they are obeyed. Create multiple versions of a code, possibly using multiple LLMs, and compare their speed.

No need to be sorry at all!! :slightly_smiling_face: Actually I really appreciate posting various recent articles about LLM for machine-generated codes. What I found interesting this time is that the point 6 in the above list (“Does it pass unit tests?”) could be very nontrivial depending on the calculation, particularly when it is related to function optimization. Indeed, the behavior of optimization algorithms could change dramatically depending on the details of the code, but the result could appear to be “correct” if the answer is just compared with the minimum location (regardless of the number of iterations). For the above test, the number of iterations is relatively small (6), so I think it does look like “reasonable” (together with the well-converged minimum location), but at the same time it also makes it more difficult to notice a potential issue in the machine-generated code… I guess it is a separate challenge to prepare good unit tests for given problems in hand (possibly, LLM itself may be used for preparing such unit tests..? :thinking:

test the “not too bad”-ness of this code

Just for clarity, I meant by “not too bad” only the number of iterations (here 6, which is relatively small as compared to steepest descent with small steps, for example). But the mistake of +/- by the LLM loses the efficiency of CG, so I think it is rather critical… I wonder if there may be some additional prompts that requests the LLM itself to test the “correctness” of the code somehow (in terms of the logic of the algorithm).

RE computational speed (like too many use of dot_product()), LLM might say that “you did not request me to optimize the code” (!), so it might have some effects if we try adding more requests for further code optimization :sweat_smile:

1 Like