Fortran to C++ translation

LFortran has a C++ backend (besides its LLVM backend) that can take Fortran code and translate to C++. For now only very simple things work:

But eventually all of Fortran will be supported.

@lkedward what are your needs / requirements for such a feature? For example, right now we target Kokkos, but we can target any other C++ library for arrays.

1 Like

I’m not yet sure on a complete set of requirements. Currently the idea is to add a minimal set of language keywords that allow the generation of an OpenCL kernel from a Fortran subroutine. Eventually I’d like to be able to use assumed-shape arguments, multi-dimensional arrays, derived-types, optional args and operator overloading in OpenCL, just using existing Fortran syntax.
Memory management and device synchronization can be handled purely through the OpenCL API, but incorporating this into the Fortran syntax could also be useful.

Kokkos is a higher abstraction level than OpenCL C, but it looks like an OpenCL backend could be made from your C++ backend. There’s also the SYCL framework based on OpenCL which is more similar to Kokkos in abstraction level. Very exciting!

1 Like

Why don’t we collaborate on this. We can add the option to generate OpenCL directly. Indeed, we want to use the existing Fortran syntax whenever possible, and for things that current Fortran doesn’t allow we want to investigate these options (are there any other options?):

  • using pragmas / directives
  • adding new keywords to the language
  • specify the extra information in a separate config file / script, or on a command line (e.g., function A, loop number 2 should have a simd length 16).

And yes, we want to support all of Fortran.

It turns out you can just use AST (Abstract Syntax Tree) to get something going, that’s the current translation in LFortran, but you find that you really need to know the semantics (types). You can collect it of course from the AST, bu it quickly turns tedious. It turns out LFortran already translates AST to ASR (Abstract Semantic Representation), which has all the semantics built in. So ASR is a better starting point to do this translation, which then becomes a routine (local) task of going over all nodes and spitting out C++ code. I am finishing up a first demo using the ASR, I need probably another week or two (you can follow my progress at https://gitlab.com/lfortran/lfortran/-/merge_requests/402). (The ASR is also used as a starting point to the LLVM translation, which is the normal compilation pipeline – we don’t need that for this particular project.)

1 Like

I discovered the RefactorF4Acc project today. It includes a backend to transform Fortran to C/OpenCL.

1 Like

Just found this – hip-fortran : A Fortran layer for AMD HIP to enable GPU acceleration on AMD GPUs

@lkedward The concepts look kind of similar to your OpenCL library, with the nuisance that the kernel is launched on the C++ side.

2 Likes

Interesting that RefactorF4Acc is written in Perl.

Regarding HIP, we have a work in progress translator from Fortran to C++/HIP also, here is an example:

1 Like

Interesting, I hadn’t come across RefactorF4Acc before - this is close to what I want to do, but with modern Fortran using the LFortran frontend.

I did see hip-fortran, but I’ve so far avoided AMD HIP because of it’s relative immaturity and complete lack of support for Windows. It’s essentially a CUDA clone/subset that supports AMD devices as well, so still a vendor-specific solution.

1 Like

Just thought I’d share, Fluid Numerics has teamed up with AMD to support “hipfort” - https://github.com/ROCmSoftwarePlatform/hipfort

1 Like

@fluidnumerics_joe Thanks for the tip, and welcome!

In hipfort, you have to write the kernels in C++/HIP, but then you can call them from Fortran, correct?

That’s correct. hipfort exposes the HIP API through ISO_C_BINDING. GPU memory management can be handled directly in Fortran with this. However, GPU kernels (global and device types) are written in C++/HIP and we currently recommend writing wrapper routines in C++ that call hipLaunchKernelGGL; those wrapper routines can then be exposed to Fortran through ISO_C_Binding.

I’m currently working on a few videos that will be posted to our “Maintaining Scientific Software” playlist on youtube. Coincidentally, I’m looking for folks who would like to share their stories on maintaining scientific software…

I’ve gotten a few questions about whether or not their are plans to support a “CUDA-Fortran” style interface, where routines can be written in fortran syntax and offloaded to GPUs. This type of implementation would need to be supported through a compiler, whereas hipfort is a library interface. Implementing a something similar to CUDA-Fortran would be a good question for folks working on LLVM (Flang), GNU, etc.

@lkedward to get things moving, why don’t I target this kernel to start: https://github.com/LKedward/focal/blob/a5235fef89fd0909689618e22fd68347d2d6c1d6/examples/nbody.cl. If I can generate this from Fortran, would you be interested in exploring how to integrate this with Focal somehow? It will take time to get everything working robustly, but one has to start somewhere and then keep improving it.

1 Like

This is a good idea to get started!
Yep I’ll start thinking about integrating with Focal - I think ideally LFortran would also generate a Fortran subroutine that launches the kernel through the desired backend.
I will also try to jot down my ideas about syntax for you later.

1 Like

Perfect. In fact we can just start with the sum kernel and then go to more complicated ones.

@lkedward can you write Fortran code that you would like to translate to your sum kernel? Here is the sum kernel:

__kernel void sum(const int size, const __global float * vec1, __global float * vec2){
  int ii = get_global_id(0);
  if(ii < size) vec2[ii] += vec1[ii];
}

So a direct Fortran counterpart would be:

kernel subroutine sum(size, vec1, vec2)
integer, intent(in) :: size
real, global, intent(in) :: vec1(:)  ! or vec1(size)
real, global, intent(in) :: vec2(:)  ! or vec2(size)
integer :: ii
ii = get_global_id(0)
if(ii < size) vec2(ii) = vec2(ii) + vec1(ii)
end subroutine

However, why cannot this kernel be simply generated from do concurrent (it seems it would work for other similar kernels also):

real, allocatable :: vec1(:), vec2(:)
integer :: ii
...
do concurrent (ii = 1:size(vec2))
    vec2(ii) = vec2(ii) + vec1(ii)
end do

Isn’t it semantically equivalent? It seems do concurrent would be simpler for this case and does not need any extra extensions to the Fortran language, it is already part of it, and LFortran simply generates the correct kernel out of it.

2 Likes

The if guard if(ii < size) in the OpenCL code is only required because more threads are launched than work items - this is the kind of thing that I would like to abstract out in the Fortran code (but still needs to be generated).

I agree you’re right; perhaps we should first focus on what we can do with do concurrent to start with, and then see what limitations arise?

1 Like

I think both your sum and nbody kernels can be expressed using do concurrent, and so I think we should start with that. It maps nicely to Kokkos also. The only thing missing from Kokkos as I can see is that we need to enable specifying a layout of the array, so that can be an extra attribute at its declaration. Looking at CUDA examples, the only kernels that might be harder to translate are those that do something fancy with blockDim, blockIdx and threadIdx. I found one example here:

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    volatile float* result)
{
    float partialSum = calculatePartialSum(array, N);
    if (threadIdx.x == 0) {
        result[blockIdx.x] = partialSum;
        __threadfence();
        unsigned int value = atomicInc(&count, gridDim.x);
        isLastBlockDone = (value == (gridDim.x - 1));
    }
    __syncthreads();
    if (isLastBlockDone) {
        float totalSum = calculateTotalSum(result);
        if (threadIdx.x == 0) {
            result[0] = totalSum;
            count = 0;
        }
    }
}

So it seems to me that do concurrent is a large subset of all possible kernels, but perhaps not all of them. So I think if we start with do concurrent, it will get us pretty far, and I think most users would prefer using that if it can do the job. And then later we can see how to support writing more general kernels directly.

1 Like

That example shows how convoluted it can be to do reductions on GPUs efficiently. Reductions will have to be treated separately - was there ever a proposal to add a reduction clause to do concurrent or is this not allowed by the standard?

:+1:

1 Like

Yes, the current plan is that Fortran 202X will have reduce in do concurrent. And just this week we implemented it in LFortran (!432, !433 and !434, although I had to temporarily disable it in the parser in !435 as my current implementation slowed things down, but that will get fixed).

I asked a colleague who uses Kokkos a lot, and he said that most CUDA kernels that people would like to write are possible to do in Kokkos. So what Fortran is currently missing from Kokkos are things that almost for sure we have to add to Fortran one way or another: parallel reduce (already planned), parallel scan, memory layout of arrays, location of an array (host, device, hybrid). There might be a few more things, but it seems do concurrent gets us far.

3 Likes

I just noticed kokkos has published some Fortran interoperability examples.

1 Like

Yes, they are developed by LANL. I plan to use them with LFortran’s C++ translation where it makes sense.