We are very happy to announce that LFortran now supports all intrinsic functions.
Blog post: LFortran supports all intrinsic functions -
X: x.com
Big thanks to @certik and all the contributors for their hard work!
We are very happy to announce that LFortran now supports all intrinsic functions.
Blog post: LFortran supports all intrinsic functions -
X: x.com
Big thanks to @certik and all the contributors for their hard work!
Thank you Harshita, @Parth1211, @Pranavchiku and the rest of the team! Great job on this. It was a big effort and I am very happy how it turned out, we have a very solid implementation and design of all intrinsics, see the blog post for all the details.
Fortran array intrinsic functions such as sum
are very flexible, working on arrays of any rank, with a dim
argument allowing a sum along a specified axis, and a mask
argument specifying which elements to sum over.
A user can write array reduction functions such as mean
or sumsq
(sum of squares) and use an interface with module procedures such to overload them to various ranks, but this is tedious. One can write sum(x**2)
instead of sumsq(x)
, but maybe it is faster to access the elements of x
once instead of first squaring the elements and then calling sum
.
Would it be possible for LFortran to define extra array intrinsics, perhaps in a module lfortran_intrinsics_mod
? Given a C or Fortran function that handles the 1D case of an array reduction function without a mask, it would be nice if the machinery LFortran has to implement sum
could be applied to other functions, allowing them to have array arguments of any rank and to have a mask argument.
There are several array intrinsics in MATLAB that I would love to see in Fortran (and am puzzled why they haven’t already been implemented). An example would be the set operations. I’ve written my own work-alikes for several of these functions to help define the support for least-square operators in finite volume CFD codes. It would be great to have these as intrinsic functions. I’m sure others can identify several other MATLAB functions they would like to see implemented. Maybe this is something the community could do as part of stdlib.
was made just to look like a subset of Matlab set functions. They were originally a one-off for a specific project but others have used them since with no reports of issues. They are available via fpm(1) as a project file. So far, I do not know that they have been used at large scale so not sure how they perform as-is against Matlab if used in large CFD problems but there is plenty of room to speed them up. Do you have a simple benchmark in Matlab that can be used for comparison? I do not currently have access to Matlab.
Nice progress!
Such functions (e.g., mean
, variance
, …) are implemented in the Fortran stdlib
, and could be a good starting point IMO.
My uses are primarily sets of around 16 to 100 integers. Mostly I’m using functions like union, unique and intersect to filter out repeated node and cell indicies. My code supports the following. unique, union, intersect, setdiff, ismember, setxor. I use an optional tolerance argument for floating point functions to avoid the additional functions MATLAB has for that purpose. For testing I just used the examples on the MATLAB site. I no longer have access to MATLAB but I’ll see if Octave supports the set functions. I’ve not benchmark for speed but I doubt that the Fortran code would be slower than the MATLAB code.
See the MATLAB link above for the examples. If you want I’ll package up my code and post it on Github. It will take me a few days to get everything in one package and get it to work with fpm
@Beliavsky we can. Can you write an example? Are you thinking of using some new attribute or pragma?
I was concerned about performance if called with large arrays and/or millions and up invocations so that should not be a problem. It is fine as-is although any additional versions packaged as fpm packages is always welcome from my perspective.
I am asking if given a C++ function that takes an array argument and length and returns a value of the same type of the array, for example sumsq
below, can LFortran generalize it to various ranks and to have an optional mask, so that the user has access to a sumsq
function with the flexibility of the intrinsic sum
?
#include <stddef.h> // For size_t
double sumsq(const double *array, size_t length) {
if (length == 0) {
return 0.0;
}
double total = 0.0;
for (size_t i = 0; i < length; i++) {
total += array[i] * array[i];
}
return total;
}
@Beliavsky we can implement it, but I would like to see more details how it can work.
So you write sumsq
in Fortran:
function sumsq(array, length) result(total)
integer, intent(in) :: length
real(dp), intent(in) :: array(length)
real(dp) :: total
integer :: i
total = 0
do i = 1, length
total = total + array(i) * array(i)
end do
end function sumsq
Now how do you tell the compiler to use it like the intrinsic sum
with the dim
attribute and any rank?
Some options are a pragma !$lf anyrank
, or anyrank
attribute. How would the compiler figure out how to insert the dim
argument structure in the loop? Are you thinking that it should “pattern match” a single loop inside the function?
In this case it’s a reduction, so maybe using a do concurrent
with reduce
we can make the compiler to automatically make it work for any rank and dim
structure.
If you can write a full example of code that you are hoping to compile, with the function declaration and usage and a pragma or a language extension, that would be helpful. We should propose it on the proposals repository as well.
I think programmers would also expect an interface like
real(dp), intent(in) :: array(:)
to allow the actual argument to be an array slice of some kind. For functions like this where each element is referenced a single time, a copy-in(/copy-out) kind of argument association would basically double (or even 3x or 4x) the overall memory references. This is, after all, the same kind of unnecessary memory overhead that these functions are trying to avoid when written as, for example, sum(x*x)
. There is nothing in the language that prevents the expression sum(x*x)
from being evaluated with a single memory reference for each array element, but this is not at present under direct control of the programmer.
With the present language features, the programmer would need to write separate functions for the various ranks, and include all of them within a generic interface. Another approach would be to use an assumed rank
declaration for the dummy argument, and inside the routine a select rank
construct would be used to separate out the various rank code blocks. So that part is at least possible now within the language. Another useful feature would be for the compiler to recognize the argument rank at compile time, and to somehow branch directly to that code within the select rank
construct, avoiding any runtime branching overhead in the same way that the generic interface approach avoids that overhead. Perhaps the proposed generic features will streamline this process?
I think a good Fortran compiler should figure out this type of optimization itself. If you create a function:
real function sumsq(x)
implicit none
real, intent(in), contiguous :: x(:)
sumsq = sum(x**2)
end function
and compile it with gfortran -O3 -march=skylake -mprefer-vector-width=256
, the compiler evaluates the square on the fly:
vxorps xmm0, xmm0, xmm0 ; sum = 0
shr rsi, 3
.L5:
mov r8, rcx
inc rcx
sal r8, 5
vmovups ymm1, YMMWORD PTR [rax+r8] ; load 8 element into ymm1
vmulps ymm1, ymm1, ymm1 ; ymm1 := ymm1**2
vaddss xmm0, xmm0, xmm1 ; add
vshufps xmm3, xmm1, xmm1, 85
vshufps xmm2, xmm1, xmm1, 255
vaddss xmm0, xmm0, xmm3 ; add
vunpckhps xmm3, xmm1, xmm1
vextractf128 xmm1, ymm1, 0x1
vaddss xmm0, xmm0, xmm3 ; add
vaddss xmm0, xmm0, xmm2 ; add
vshufps xmm2, xmm1, xmm1, 85
vaddss xmm0, xmm0, xmm1 ; add
vaddss xmm0, xmm0, xmm2 ; add
vunpckhps xmm2, xmm1, xmm1
vshufps xmm1, xmm1, xmm1, 255
vaddss xmm0, xmm0, xmm2 ; add
vaddss xmm0, xmm0, xmm1 ; add
cmp rsi, rcx
jne .L5 ; jmp to .L5 if rsi /= rcx
If you compile with ifx -O3 -xSKYLAKE
, the body of the loop contains
vxorps xmm0, xmm0, xmm0 ; tmp(1:8) = 0
xor edi, edi
.LBB0_7:
vmovups ymm1, ymmword ptr [rcx + 4*rdi] ; xtmp(1:8) = x(i:i+8), load 8 elements
vfmadd231ps ymm0, ymm1, ymm1 ; tmp = tmp + xtmp**2
add rdi, 8
cmp rdi, rsi
jle .LBB0_7 ; jmp to .LBB0_7 if rdi <= rsi
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1 ; tmp(1:4) = tmp(1:4) + tmp(5:8)
vshufpd xmm1, xmm0, xmm0, 1
vaddps xmm0, xmm0, xmm1 ; tmp(1:2) = tmp(1:2) + tmp(3:4)
vmovshdup xmm1, xmm0
vaddss xmm0, xmm0, xmm1 ; tmp(1) = tmp(1) + tmp(2)
Obviously, there are peel loops (not shown here) to take care of the remainder. So the compilers are clearly capable of doing the squared summation without creating a temporary array.
Quite a few mentions of sum
upthread. I wonder whether there is any optimization preventing
integer, parameter :: ten8=1e8
real :: tab(ten8)
call random_number(tab)
print *, 'sum= ', sum(tab), ' expected= ',0.5*ten8
to give (gfortran (up to 14.2) and ifx (2025.0.0)):
sum= 1.6777216E+07 expected= 5.0000000E+07
Only ifort (2021.13.1) gives the proper
sum= 4.9997040E+07 expected= 5.0000000E+07
I remember that topic. Just curious if lfortran
attempts to be as smart as Intel compiler was (ifort is now EOL)
The above large sum example segfaults with LFortran for me as I think we are storing the array on stack, I reported it at Local (stack) array with 1e7 elements segfaults on macOS · Issue #5427 · lfortran/lfortran · GitHub.
Regarding the accumulation, our current implementation does:
real(4) function Sum_4_1_0(array) result(result)
integer(4) :: __1_i
real(4), dimension(:), intent(in) :: array
result = 0.00000000e+00
do __1_i = lbound(array, 1), ubound(array, 1)
result = result + array(__1_i)
end do
end function Sum_4_1_0
But we could change it to use double precision for accumulation. I don’t know if we should always do it, or only with some compiler option? I made this Consider accumulating `sum()` with higher accuracy · Issue #5428 · lfortran/lfortran · GitHub.
Regarding sum(x**2)
we currently do not “inline” it, but we’ll add such an optimization later — we’ll focus on such performance after we reach beta. I made this Add optimization to "inline" sum(x**2) · Issue #5429 · lfortran/lfortran · GitHub.
Does the compiler perform the same type of optimization without the contiguous
attribute? As discussed upthread, being able to compute this summation with a single memory reference to each array element is important, and the contiguous
attribute for the dummy argument would require an extra memory allocation and copy-in argument association if the actual argument is not contiguous.
Yes. gfortran
(but I assume so do other compilers) does multi-versioning based on contiguity:
if (is_contiguous(x)) then
! vectorized version
else
! scalar version
end if
Also the scalar version uses a single memory reference:
vxorps xmm0, xmm0, xmm0 ; sum = 0
xor ecx, ecx
.L5:
vmovss xmm1, DWORD PTR [rax] ; load x(i)
mov rdi, rcx
add rax, rsi
inc rcx
vfmadd231ss xmm0, xmm1, xmm1 ; sum = sum + x*x
cmp rdi, rdx
jne .L5
The loop multi-versioning only kicks in at -O3
. At -O2
you get only the scalar loop.
@certik if you are interested, maybe take a look at this implementation fast_math/src/fast_sum.f90 at main · jalvesz/fast_math · GitHub, there are two chunks versions which are rather interesting. The pure chunked one has on my side proven faster and at worst, one order of magnitude more accurate than gfortran’s intrinsic. The chunked kahan version is not necessarily faster but it is for sure more accurate. Some reference results are in the readme of the project.