LFortran now supports all intrinsic functions

We are very happy to announce that LFortran now supports all intrinsic functions.

Blog post: LFortran supports all intrinsic functions -

X: x.com

Big thanks to @certik and all the contributors for their hard work!

35 Likes

Thank you Harshita, @Parth1211, @Pranavchiku and the rest of the team! Great job on this. It was a big effort and I am very happy how it turned out, we have a very solid implementation and design of all intrinsics, see the blog post for all the details.

7 Likes

Fortran array intrinsic functions such as sum are very flexible, working on arrays of any rank, with a dim argument allowing a sum along a specified axis, and a mask argument specifying which elements to sum over.

A user can write array reduction functions such as mean or sumsq (sum of squares) and use an interface with module procedures such to overload them to various ranks, but this is tedious. One can write sum(x**2) instead of sumsq(x), but maybe it is faster to access the elements of x once instead of first squaring the elements and then calling sum.

Would it be possible for LFortran to define extra array intrinsics, perhaps in a module lfortran_intrinsics_mod? Given a C or Fortran function that handles the 1D case of an array reduction function without a mask, it would be nice if the machinery LFortran has to implement sum could be applied to other functions, allowing them to have array arguments of any rank and to have a mask argument.

3 Likes

There are several array intrinsics in MATLAB that I would love to see in Fortran (and am puzzled why they haven’t already been implemented). An example would be the set operations. I’ve written my own work-alikes for several of these functions to help define the support for least-square operators in finite volume CFD codes. It would be great to have these as intrinsic functions. I’m sure others can identify several other MATLAB functions they would like to see implemented. Maybe this is something the community could do as part of stdlib.

1 Like

was made just to look like a subset of Matlab set functions. They were originally a one-off for a specific project but others have used them since with no reports of issues. They are available via fpm(1) as a project file. So far, I do not know that they have been used at large scale so not sure how they perform as-is against Matlab if used in large CFD problems but there is plenty of room to speed them up. Do you have a simple benchmark in Matlab that can be used for comparison? I do not currently have access to Matlab.

1 Like

Nice progress!

Such functions (e.g., mean, variance, …) are implemented in the Fortran stdlib, and could be a good starting point IMO.

3 Likes

My uses are primarily sets of around 16 to 100 integers. Mostly I’m using functions like union, unique and intersect to filter out repeated node and cell indicies. My code supports the following. unique, union, intersect, setdiff, ismember, setxor. I use an optional tolerance argument for floating point functions to avoid the additional functions MATLAB has for that purpose. For testing I just used the examples on the MATLAB site. I no longer have access to MATLAB but I’ll see if Octave supports the set functions. I’ve not benchmark for speed but I doubt that the Fortran code would be slower than the MATLAB code.
See the MATLAB link above for the examples. If you want I’ll package up my code and post it on Github. It will take me a few days to get everything in one package and get it to work with fpm

1 Like

@Beliavsky we can. Can you write an example? Are you thinking of using some new attribute or pragma?

1 Like

I was concerned about performance if called with large arrays and/or millions and up invocations so that should not be a problem. It is fine as-is although any additional versions packaged as fpm packages is always welcome from my perspective.

I am asking if given a C++ function that takes an array argument and length and returns a value of the same type of the array, for example sumsq below, can LFortran generalize it to various ranks and to have an optional mask, so that the user has access to a sumsq function with the flexibility of the intrinsic sum?

#include <stddef.h> // For size_t

double sumsq(const double *array, size_t length) {
	if (length == 0) {
		return 0.0;
	}
	double total = 0.0;
	for (size_t i = 0; i < length; i++) {
		total += array[i] * array[i];
	}
	return total;
}
1 Like

@Beliavsky we can implement it, but I would like to see more details how it can work.

So you write sumsq in Fortran:

function sumsq(array, length) result(total)
integer, intent(in) :: length
real(dp), intent(in) :: array(length)
real(dp) :: total
integer :: i
total = 0
do i = 1, length
    total = total + array(i) * array(i)
end do
end function sumsq

Now how do you tell the compiler to use it like the intrinsic sum with the dim attribute and any rank?

Some options are a pragma !$lf anyrank, or anyrank attribute. How would the compiler figure out how to insert the dim argument structure in the loop? Are you thinking that it should “pattern match” a single loop inside the function?

In this case it’s a reduction, so maybe using a do concurrent with reduce we can make the compiler to automatically make it work for any rank and dim structure.

If you can write a full example of code that you are hoping to compile, with the function declaration and usage and a pragma or a language extension, that would be helpful. We should propose it on the proposals repository as well.

I think programmers would also expect an interface like

real(dp), intent(in) :: array(:)

to allow the actual argument to be an array slice of some kind. For functions like this where each element is referenced a single time, a copy-in(/copy-out) kind of argument association would basically double (or even 3x or 4x) the overall memory references. This is, after all, the same kind of unnecessary memory overhead that these functions are trying to avoid when written as, for example, sum(x*x). There is nothing in the language that prevents the expression sum(x*x) from being evaluated with a single memory reference for each array element, but this is not at present under direct control of the programmer.

With the present language features, the programmer would need to write separate functions for the various ranks, and include all of them within a generic interface. Another approach would be to use an assumed rank declaration for the dummy argument, and inside the routine a select rank construct would be used to separate out the various rank code blocks. So that part is at least possible now within the language. Another useful feature would be for the compiler to recognize the argument rank at compile time, and to somehow branch directly to that code within the select rank construct, avoiding any runtime branching overhead in the same way that the generic interface approach avoids that overhead. Perhaps the proposed generic features will streamline this process?

I think a good Fortran compiler should figure out this type of optimization itself. If you create a function:

real function sumsq(x)
    implicit none
    real, intent(in), contiguous :: x(:)
    sumsq = sum(x**2)
end function

and compile it with gfortran -O3 -march=skylake -mprefer-vector-width=256, the compiler evaluates the square on the fly:

        vxorps  xmm0, xmm0, xmm0           ; sum = 0
        shr     rsi, 3
.L5:
        mov     r8, rcx
        inc     rcx
        sal     r8, 5
        vmovups ymm1, YMMWORD PTR [rax+r8] ; load 8 element into ymm1
        vmulps  ymm1, ymm1, ymm1           ; ymm1 := ymm1**2 
        vaddss  xmm0, xmm0, xmm1           ; add
        vshufps xmm3, xmm1, xmm1, 85
        vshufps xmm2, xmm1, xmm1, 255
        vaddss  xmm0, xmm0, xmm3           ; add
        vunpckhps       xmm3, xmm1, xmm1
        vextractf128    xmm1, ymm1, 0x1
        vaddss  xmm0, xmm0, xmm3           ; add
        vaddss  xmm0, xmm0, xmm2           ; add
        vshufps xmm2, xmm1, xmm1, 85
        vaddss  xmm0, xmm0, xmm1           ; add
        vaddss  xmm0, xmm0, xmm2           ; add
        vunpckhps       xmm2, xmm1, xmm1
        vshufps xmm1, xmm1, xmm1, 255
        vaddss  xmm0, xmm0, xmm2           ; add
        vaddss  xmm0, xmm0, xmm1           ; add
        cmp     rsi, rcx
        jne     .L5                        ; jmp to .L5 if rsi /= rcx

If you compile with ifx -O3 -xSKYLAKE, the body of the loop contains

        vxorps  xmm0, xmm0, xmm0                ; tmp(1:8) = 0
        xor     edi, edi
.LBB0_7:
        vmovups ymm1, ymmword ptr [rcx + 4*rdi] ; xtmp(1:8) = x(i:i+8), load 8 elements 
        vfmadd231ps     ymm0, ymm1, ymm1        ; tmp = tmp + xtmp**2
        add     rdi, 8
        cmp     rdi, rsi
        jle     .LBB0_7                         ; jmp to .LBB0_7 if rdi <= rsi
        vextractf128    xmm1, ymm0, 1
        vaddps  xmm0, xmm0, xmm1                ; tmp(1:4) = tmp(1:4) + tmp(5:8)
        vshufpd xmm1, xmm0, xmm0, 1   
        vaddps  xmm0, xmm0, xmm1                ; tmp(1:2) = tmp(1:2) + tmp(3:4)
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1                ; tmp(1) = tmp(1) + tmp(2)

Obviously, there are peel loops (not shown here) to take care of the remainder. So the compilers are clearly capable of doing the squared summation without creating a temporary array.

2 Likes

Quite a few mentions of sum upthread. I wonder whether there is any optimization preventing

  integer, parameter :: ten8=1e8
  real :: tab(ten8)
  call random_number(tab)
  print *, 'sum= ', sum(tab), '  expected= ',0.5*ten8

to give (gfortran (up to 14.2) and ifx (2025.0.0)):
sum= 1.6777216E+07 expected= 5.0000000E+07
Only ifort (2021.13.1) gives the proper
sum= 4.9997040E+07 expected= 5.0000000E+07

I remember that topic. Just curious if lfortran attempts to be as smart as Intel compiler was (ifort is now EOL)

1 Like

The above large sum example segfaults with LFortran for me as I think we are storing the array on stack, I reported it at Local (stack) array with 1e7 elements segfaults on macOS · Issue #5427 · lfortran/lfortran · GitHub.

Regarding the accumulation, our current implementation does:

real(4) function Sum_4_1_0(array) result(result)
    integer(4) :: __1_i
    real(4), dimension(:), intent(in) :: array
    result = 0.00000000e+00
    do __1_i = lbound(array, 1), ubound(array, 1)
        result = result + array(__1_i)
    end do
end function Sum_4_1_0

But we could change it to use double precision for accumulation. I don’t know if we should always do it, or only with some compiler option? I made this Consider accumulating `sum()` with higher accuracy · Issue #5428 · lfortran/lfortran · GitHub.

Regarding sum(x**2) we currently do not “inline” it, but we’ll add such an optimization later — we’ll focus on such performance after we reach beta. I made this Add optimization to "inline" sum(x**2) · Issue #5429 · lfortran/lfortran · GitHub.

1 Like

Does the compiler perform the same type of optimization without the contiguous attribute? As discussed upthread, being able to compute this summation with a single memory reference to each array element is important, and the contiguous attribute for the dummy argument would require an extra memory allocation and copy-in argument association if the actual argument is not contiguous.

1 Like

Yes. gfortran (but I assume so do other compilers) does multi-versioning based on contiguity:

if (is_contiguous(x)) then
   ! vectorized version
else
   ! scalar version
end if

Also the scalar version uses a single memory reference:

        vxorps  xmm0, xmm0, xmm0           ; sum = 0
        xor     ecx, ecx
.L5:
        vmovss  xmm1, DWORD PTR [rax]      ; load x(i)
        mov     rdi, rcx
        add     rax, rsi
        inc     rcx
        vfmadd231ss     xmm0, xmm1, xmm1   ; sum = sum + x*x
        cmp     rdi, rdx
        jne     .L5

The loop multi-versioning only kicks in at -O3. At -O2 you get only the scalar loop.

1 Like

@certik if you are interested, maybe take a look at this implementation fast_math/src/fast_sum.f90 at main · jalvesz/fast_math · GitHub, there are two chunks versions which are rather interesting. The pure chunked one has on my side proven faster and at worst, one order of magnitude more accurate than gfortran’s intrinsic. The chunked kahan version is not necessarily faster but it is for sure more accurate. Some reference results are in the readme of the project.

2 Likes