# Vectorizing a conditional product

Unlike C, which allows us to inline some assembly, Fortran sometimes requires second-guessing the compiler. Here is an operation I’d like to vectorize or impliment with SIMD instructions

``````                                        real(4) :: ZT(4,2)
.
(code that determines ZT)
.
T = 1.0
if( Zinv > ZT(1,1)) T = T * ZT(1,2)
if( Zinv > ZT(2,1)) T = T * ZT(2,2)
if( Zinv > ZT(3,1)) T = T * ZT(3,2)
if( Zinv > ZT(4,1)) T = T * ZT(4,2)
``````

I could use Merge() instead, but it is quite slow.

Any ideas?

1 Like

I’m not sure if a single-precision array with length 4 is long enough to allow efficient vectorization, but it looks like both Intel and gfortran compilers do add SIMD instructions to your calculation if done this way:

``````program test_simd_product
implicit none

real(4) :: ZT(4,2),ZINV

call random_number(ZT)
call random_number(ZINV)

print *, get_T(ZT,ZINV)
print *, get_T_orig(ZT,ZINV)

contains

! Use more modern syntax
pure real function get_T(ZT,ZINV) result(T)
real(4), intent(in) :: ZT(4,2),ZINV
end function get_T

! from the post
pure real function get_T_orig(ZT,ZINV) result(T)
real(4), intent(in) :: ZT(4,2),ZINV
T = 1.0
if( Zinv > ZT(1,1)) T = T * ZT(1,2)
if( Zinv > ZT(2,1)) T = T * ZT(2,2)
if( Zinv > ZT(3,1)) T = T * ZT(3,2)
if( Zinv > ZT(4,1)) T = T * ZT(4,2)
end function get_T_orig

end program test_simd_product
``````

You can check yourself what several compilers do at this link

2 Likes

By the looks at compiler explorer both the solution from Federico with the `product` function (a transformational functions to reduce arrays) and the explicit OpenMP SIMD version (see below) appear to generate vectorized instructions with the Intel Fortran compilers (make sure to use `-qopen-simd` and `-x`code, where code specifies the intstruction set you’d like to target.

``````pure function condprod(a,b,bmax) result(t)
real, intent(in) :: a(:), b(:), bmax
real :: t
integer :: i
t = 1.0
!\$omp simd reduction(*: t)
do i = 1, size(a)
if (bmax > b(i)) t = t * a(i)
end do
end function

pure function condprod2(a,b,bmax) result(t)
real, intent(in) :: a(:), b(:), bmax
real :: t
end function
``````

Some scholarly papers that may be of assistance re vectorization of a conditional product

Could you indicate what is the probability of “( Zinv > ZT(i,1) )”

You could approach by first using “if( Zinv <= ZT(i,1) ) ZT(i,2) = 1.0”
then T = product ( ZT(1:4,2) ), which should be SIMD

John, (et alia)

Here is something we’re come up with here. It relies on Intel’s compiler’s form of boolean result: it can be used as an integer with value of 0,-1 for F, T. So, It may not be portable across compilers.

T = product( ( -(Zinv <= ZT(:,1)) - (Zinv > ZT(:,1))*ZT(:,2)) )

There are no code branchings. Pure SIMD. If floats are involved then the integer result of the boolean evaluations are converted to float 0 or 1. If not, the entire process is integer. But in either case, it’s SIMD.

There are theological arguments against such coding, since nothing in the Fortran Standard specifies how a compiler has to manifest a boolean result outside of it being interpreted a ‘logical.’ Nevertheless, if you’re doing this > billion times, it’s an interesting option.

David

I just added this to compiler explorer specializing the three demonstrated approaches to arrays of 4 elements. Both your latest suggestion and the explicit OpenMP SIMD version do not involve branching: Compiler Explorer. The OpenMP versions even needs two instructions less than the non-standard mixed boolean/real expressions. Note I didn’t check the functional equivalence of the three calls.

With the GNU Fortran compiler it fails with

``````    Output of x86-64 gfortran 12.1 (Compiler #1)

/app/example.f90:21:45:

21 |     t = product( ( -(bmax <= b) - (bmax > b)*a) )
|                                             1
Error: Operands of binary numeric operator '*' at (1) are LOGICAL(4)/REAL(4)
``````

The first two versions however continue to work also with the GNU Fortran compiler: Compiler Explorer. The SIMD version also produces branchless instructions that are very similar looking to Intel Fortran.

Relying on the OpenMP standard seems like a much better option to me than convoluted “hacks”, which fail when ported to other compilers. The intentions are also made clearer to fellow coders, unlike the mixed boolean-real expression where it takes some effort to see what’s going on. (Big applaud everyone in the OpenMP committee for their work on the SIMD vectorization constructs. cc @mklemm )

I wonder if OpenMP could widen the semantics of the `!\$omp simd` construct to include also Fortran’s array reduction routines like `product` and `sum`?

TL;DR: OpenMP SIMD vectorization gives functional and performance portable results, while keeping the code intentions clear

The OpenMP `workshare` construct does support that kind of pattern. A good implementation would exploit the fact that work distribution my involve adding an implicit `simd` construct while parallelizing across OpenMP threads. Having said that, this question is not about executing the above code in parallel, but SIMD only.

One thing that we are working on for OpenMP API version 6.0 is to allow all OpenMP loop features with implicit Fortran loops. We are still debating the exact semantics of this and waht kind of implicit loops to allow for. We are certain that we can do this for simple Fortran array statements, but to make it really useful, we need to apply the additional patterns like the `PRODUCT` function.

Once we have this, programmers will be able to explicit vectorize also using the `simd` construct.

Thanks @mklemm for this insight.

The problem I see with this feature is that there is a considerable overhead in initiating a !\$OMP DO PARALLEL region (of about 20,000 processor cycles for the first) I presume / hope a lot of this time is for the first region initiation, but the idea of applying subsequent !\$OMP regions to many loops needs to address this problem.
There is also the issue of how many threads are available, especially if these loops are in a secondary thread.

The overhead of entering an OMP region many times needs better documentation.

I also wonder about the applicability of “OpenMP SIMD” efficiency due to startup delays, although I have no experience of this usage.

That clearly depends on the number of threads. The main cost of this is how long it takes to broadcast the change of a view pointer-like values from one cache to all the other caches of the system to inform the waiting worker threads that the main thread want to fork (effectively the release operation of a barrier). Then, the other major cost is to do the join operation (corresponds to a gather operation in barriers).

Well, you can factor out the `!\$OMP PARALLEL` and have individual `!\$OMP FOR`-style directives inside to parallelize the loops, without facing repeated fork/join operations. You can even then avoid barriers between the parallel loops via the `NOWAIT` clause under certain conditions.

Well, that clearly depends on the implementation, the broadcast algorithm used to implement fork (simply flipping a bit in a cache line means that the cache is sequentially updated for all the waiting threads/cores), and the cache and coherency latencies of the underlying machine. It also depends on the number of threads and may mean that for different thread counts, different algorithms can be used. That makes it very hard to document that.

SIMD does have much less startup overhead. There may be some additional checks for data alignment and for loop iteration spaces that do not evenly divide by the SIMD width, in which case a remainder loop needs to process the excess loop iterations. These are all simple `if` expressions, but of course they add a bit of overhead. But in total, you’ll save quite a bit on the number of evaluations of the loop termination condition.

1 Like

I really do not see the clarity in 20,000 processor cycles to initiate a !\$OMP PARALLEL DO, which I have observed as independent of the number of threads; this is an incredibly slow process.
My limited OpenMP experience is using I7-4790K (8 threads), I7-8700K (12 threads) and Ryzen 5900X (24 threads), they all take roughly .01 miliseconds using gfortran on Windows OS and this startup delay appears to be independent of the number of threads.

What is significant about this delay is the loop that is being converted to multi-thread OMP must be taking of the order of 20,000 processor cycles of computation, before any gain might be achieved.
This computation is not a trivial DO i = 1,10 dot_product loop, an array syntax or simd instruction of perhaps size 4 or 8 which can be suggested as being replaced by an OMP region.

This significant OMP initiation overhead does limit the types of calculations that are suitable for OMP replacement. I don’t see how SIMD instructions would qualify, although again, I have no experience of this usage on other hardware and OS.

The general principal for multi-thread success has been to apply !\$OMP to the outer loop and vectorise the inner loop (and hopefully there are loops in between to make the computation substantial).

I think there is more improvement to be achieved before these approaches become a reality, unless this is the case with other OpenMP implementations ?