Avoiding if-statements inside loops with procedural pointers

Hi everyone,

I’m working with a code that uses if-statements inside big loops to call certain procedures. In many cases, the procedure should be called if a certain condition is met, or not called otherwise. These conditions are defined prior to the start of the loop so the results of the if-statements called in each iteration are the same. My goal is to eliminate the computation load imposed by these conditional structures.

In cases where a procedure A is called if a a given condition is verified, and a procedure B is called otherwise, I’ve managed to define a procedural pointer prior to the beginning of the loop to the adequate procedure, and hence avoid the use of if-statements in each iteration. However, in cases where a procedure is called if a condition is met, but not called otherwise, I’ve not been so successful. My idea was to set the pointer to a “do nothing” procedure with the same number and type of inputs and outputs, but this does not seem to be a good approach.

I was wondering if anyone has any suggestion on how to implement this type of construct, assuming it is possible to do so in Fortran. I suspect that part of the solution might be related to Object Oriented Programming, of which I know little about when it comes to its use with Fortran.

Edit1: grammar

I would try something simple first, such as setting the pointer to null(), and calling it only when it is associated. That test would probably be less expensive than setting up the stack and calling a dummy procedure.

If you have only a few possible procedures to call, then another approach would be to call them within a select case, where the integer case value is determined outside the loop. That compiles to a simple table lookup, so it might be even cheaper than an if() test. The no-call case would be just one of the cases, and with this approach the argument lists need not be all identical.

Thank you @RonShepard !

The approach I’m using at the moment is to use an if construct with a logical variable as its condition, like so:

...
logical, intent(in) :: ical
...
do i=1, bignumber
    if (ical) then
        call sub(a,b,c)
    end if
...
end do

I don’t know how the compiler treats this, but I thought that using a logical variable directly in the the if construct would be better than using an expression. So in cases where I can only have two possible outcomes, is the construct

logical, intent(in) :: ical
procedure (sub), pointer :: p1 => null()
...
if (ical) p1 => sub

do i=1, bignumber
    if(associated(p1)) call p1(a,b,c)
...
end do

better than the first?

Edit 1: typo

@ffff ,

With compiler advances over the years and with most recent Fortran compilers, you will likely not notice any performance difference, assuming the code snippets capture the essence of this code.

What you may end up benefitting from is the readability and maintainability of your code, what approach you and your colleagues (who may be you yourself coming back to the code after a period of time!) will find easier to understand and enhance and maintain.

Going with code you can read and work with more easily is what will pay off best.

1 Like

Have you done any performance measurements or analysis to actually verify that the if checks are actually a problem? My understanding is that the branch prediction on modern processors is sufficiently good, especially in the case the condition doesn’t change for the duration of the loop, that it’s probably not worth worrying about. evaluating the conditional expression prior to the loop and then using that variable in the if statement is probably as performant as you can get. Anything else will involve extra indirection, preventing certain optimisations.

2 Likes

I think your first goal should be to identify the computation load imposed by these conditional structures.

I would expect that the time difference between “if (ical) then” and “if(associated(p1))” would be insignificant when compared to the time taken to process either “call sub(a,b,c)” or “call p1(a,b,c)”

My preference would be the clearest representation, which to me, appears to be “if (ical) call sub(a,b,c)”

If you can identify a significant performance difference due to the IF, I would like to see a documented example.

Would the following provide a significant difference ?

 if (ical) then
    do i=1, bignumber
        call sub(a,b,c)
     ...
    end do
 else
    do i=1, bignumber
     ...
    end do
 end if

I think you would struggle to find a timer that shows a reliable measure of difference.

1 Like

Thank you for your input @everythingfunctional !

I understand now that, as @RonShepard pointed out, setting up a dummy procedure would be more expensive than just evaluating the if statement in each iteration.

Unfortunately, I believe that I can’t easily make some meaningful performance measurements since this would require restructuring hundreds of lines of code. I created this topic so as to be sure of what would be the best approach before diving into a (probably needless) full code restructuring process. The current version of the code is using something like this

do i=1, bignumber
    if (ical) call sub(a,b,c)
    ...
end do

which, considering the present discussion, I believe to be the best approach.

Thank you @JohnCampbell !

The initial focus of my question was to understand which of the following snippets would be the most performant

if (ical) then
   p1 => sub
else
   p1 => sub_do_nothing
end if
do i=1, bignumber
    call p1(a,b,c)
end do

or

do i=1, bignumber
    if (ical) call sub(a,b,c)
end do

meaning that the question is about choosing between an if-statement inside the loop or avoiding this nested conditional statement altogether with the use of a procedural pointer. As I understand now, it seems that the second snipped might be the most performant.

Yet, I agree with you that probably there would not be much of a difference between the “if (ical)” and the “if (associate(p1))” statements in terms of performance and, of course, my preferred one is the “if (ical)” construct.

Regarding the question of nesting the loop inside the if-statement, this would not be ideal since I have a significant number of conditional calls to various procedures inside the loop which would lead to a big number of combinations and, hence, of if branches.

The version of code @JohnCampbell has shown is what would be expected following loop invariant code motion. One form of such code motion is also loop unswitching.

If ical does not change for the duration of the loop, the compiler can create two versions of the loop, one where ical is true, and the other where it is false. For an insightful article on branches and performance I would suggest: How branches influence the performance of your code and what can you do about it?

Here’s an example I’ve copied from the article above:

subroutine average(a,include_negative,avg)
implicit none
real, intent(in) :: a(:)
logical, intent(in) :: include_negative
real, intent(out) :: avg

integer :: i, k

avg = 0
k = 0

do i = 1, size(a)
  if (include_negative) then
    avg = avg + a(i)
  else
    if (a(i) > 0) then
      avg = avg + a(i)
      k = k + 1
    end if
  end if
end do

if (include_negative) then
  avg = avg / size(a)
else
  avg = avg / k
end if

end subroutine

You can ask your compiler to emit an optimization report. With GCC, the option is -fopt-info or -fopt-info-<option> for specific optimizations. For the subroutine above:

$ gfortran -O3 -c -fopt-info-loop average.f90 
average.f90:14:6: optimized: Unswitching loop on condition: if (pretmp_76 != 0)

average.f90:14:6: optimized: versioned this loop for when certain strides are 1
average.f90:14:6: optimized: versioned this loop for when certain strides are 1

You can see that unswitching was performed. At -O2 on the other hand, this optimization was not performed.

You could inspect the generated intermediate representation with the flag -fdump-tree-optimized, however for the untrained eye it’s difficult to understand what’s going on.

Dump of optimized version
$ cat average.f90.252t.optimized 

;; Function average (average_, funcdef_no=0, decl_uid=4239, cgraph_uid=1, symbol_order=0)

Removing basic block 3
Removing basic block 27
Removing basic block 28
Removing basic block 29
Removing basic block 30
Removing basic block 31
Removing basic block 32
Removing basic block 33
Removing basic block 34
Removing basic block 35
Removing basic block 36
Removing basic block 37
Removing basic block 38
__attribute__((fn spec (". r r w ")))
void average (struct array01_real(kind=4) & restrict a, logical(kind=4) & restrict include_negative, real(kind=4) & restrict avg)
{
  unsigned long ivtmp.40;
  unsigned long ivtmp.34;
  unsigned long ivtmp.28;
  unsigned long ivtmp.21;
  integer(kind=4) k;
  integer(kind=4) i;
  real(kind=4)[0:D.4276] * restrict a.0;
  integer(kind=8) ubound.0;
  integer(kind=8) _1;
  integer(kind=8) _2;
  integer(kind=8) _3;
  integer(kind=8) _5;
  real(kind=4) prephitmp_11;
  real(kind=4) _13;
  real(kind=4) cstore_14;
  unsigned long _15;
  real(kind=4) _16;
  real(kind=4) prephitmp_20;
  real(kind=4) prephitmp_21;
  real(kind=4) pretmp_24;
  real(kind=4) _25;
  real(kind=4) _26;
  real(kind=4) _28;
  real(kind=4) _29;
  integer(kind=8) iftmp.6_33;
  real(kind=4) prephitmp_35;
  real(kind=4) prephitmp_36;
  integer(kind=8) _39;
  real(kind=4) pretmp_40;
  integer(kind=4) _47;
  real(kind=4) prephitmp_60;
  unsigned int _64;
  unsigned long _66;
  real(kind=4) prephitmp_73;
  real(kind=4) prephitmp_74;
  real(kind=4) pretmp_75;
  logical(kind=4) pretmp_76;
  real(kind=4) _77;
  real(kind=4) prephitmp_82;
  void * _85;
  real(kind=4) prephitmp_87;
  real(kind=4) pretmp_94;
  real(kind=4) _95;
  unsigned int _98;
  unsigned long _99;
  unsigned long _100;
  unsigned long _103;
  integer(kind=8) _106;
  unsigned long _107;
  void * _109;
  unsigned int _110;
  unsigned int _111;
  integer(kind=4) _112;
  void * _116;
  unsigned int _117;
  unsigned int _118;
  unsigned long _119;
  unsigned long _120;
  unsigned long _123;
  integer(kind=8) _126;
  unsigned long _127;
  void * _129;
  unsigned int _130;
  unsigned int _131;
  integer(kind=4) _132;

  <bb 2> [local count: 118111598]:
  _39 = *a_38(D).dim[0].stride;
  if (_39 != 0)
    goto <bb 4>; [50.00%]
  else
    goto <bb 3>; [50.00%]

  <bb 3> [local count: 59055799]:

  <bb 4> [local count: 118111598]:
  # iftmp.6_33 = PHI <_39(2), 1(3)>
  a.0_42 = *a_38(D).data;
  _1 = *a_38(D).dim[0].ubound;
  _2 = *a_38(D).dim[0].lbound;
  _3 = _1 - _2;
  ubound.0_43 = _3 + 1;
  *avg_45(D) = 0.0;
  _5 = MAX_EXPR <ubound.0_43, 0>;
  _47 = (integer(kind=4)) _5;
  pretmp_76 = *include_negative_48(D);
  if (_47 <= 0)
    goto <bb 22>; [11.00%]
  else
    goto <bb 5>; [89.00%]

  <bb 5> [local count: 105119322]:
  if (pretmp_76 != 0)
    goto <bb 16>; [50.00%]
  else
    goto <bb 6>; [50.00%]

  <bb 6> [local count: 52559661]:
  if (iftmp.6_33 != 1)
    goto <bb 7>; [20.00%]
  else
    goto <bb 12>; [80.00%]

  <bb 7> [local count: 10511932]:
  _106 = iftmp.6_33 * 4;
  _107 = (unsigned long) _106;
  ivtmp.28_108 = (unsigned long) a.0_42;
  _110 = (unsigned int) _5;
  _111 = _110 + 1;
  _112 = (integer(kind=4)) _111;

  <bb 8> [local count: 95563020]:
  # i_58 = PHI <i_41(10), 1(7)>
  # k_54 = PHI <k_7(10), 0(7)>
  # prephitmp_36 = PHI <prephitmp_21(10), 0.0(7)>
  # ivtmp.28_104 = PHI <ivtmp.28_105(10), ivtmp.28_108(7)>
  _109 = (void *) ivtmp.28_104;
  pretmp_40 = MEM[(real(kind=4) *)_109];
  if (pretmp_40 > 0.0)
    goto <bb 9>; [59.00%]
  else
    goto <bb 10>; [41.00%]

  <bb 9> [local count: 28191091]:
  _16 = prephitmp_36 + pretmp_40;
  *avg_45(D) = _16;
  k_18 = k_54 + 1;

  <bb 10> [local count: 95563020]:
  # k_7 = PHI <k_54(8), k_18(9)>
  # prephitmp_21 = PHI <prephitmp_36(8), _16(9)>
  i_41 = i_58 + 1;
  ivtmp.28_105 = ivtmp.28_104 + _107;
  if (i_41 == _112)
    goto <bb 11>; [11.00%]
  else
    goto <bb 8>; [89.00%]

  <bb 11> [local count: 52559662]:
  # k_57 = PHI <k_7(10), k_80(15)>
  # prephitmp_35 = PHI <prephitmp_21(10), prephitmp_82(15)>
  goto <bb 24>; [100.00%]

  <bb 12> [local count: 42047729]:
  ivtmp.21_65 = (unsigned long) a.0_42;
  _64 = (unsigned int) _5;
  _98 = _64 + 4294967295;
  _99 = (unsigned long) _98;
  _100 = _99 * 4;
  _66 = ivtmp.21_65 + 4;
  _103 = _66 + _100;

  <bb 13> [local count: 382252093]:
  # k_50 = PHI <0(12), k_80(15)>
  # prephitmp_11 = PHI <0.0(12), prephitmp_82(15)>
  # ivtmp.21_90 = PHI <ivtmp.21_65(12), ivtmp.21_19(15)>
  _85 = (void *) ivtmp.21_90;
  pretmp_24 = MEM[(real(kind=4) *)_85];
  if (pretmp_24 > 0.0)
    goto <bb 14>; [59.00%]
  else
    goto <bb 15>; [41.00%]

  <bb 14> [local count: 112764368]:
  _77 = prephitmp_11 + pretmp_24;
  *avg_45(D) = _77;
  k_79 = k_50 + 1;

  <bb 15> [local count: 382252093]:
  # k_80 = PHI <k_50(13), k_79(14)>
  # prephitmp_82 = PHI <prephitmp_11(13), _77(14)>
  ivtmp.21_19 = ivtmp.21_90 + 4;
  if (ivtmp.21_19 == _103)
    goto <bb 11>; [11.00%]
  else
    goto <bb 13>; [89.00%]

  <bb 16> [local count: 52559661]:
  if (iftmp.6_33 != 1)
    goto <bb 17>; [20.00%]
  else
    goto <bb 20>; [80.00%]

  <bb 17> [local count: 10511932]:
  _126 = iftmp.6_33 * 4;
  _127 = (unsigned long) _126;
  ivtmp.40_128 = (unsigned long) a.0_42;
  _130 = (unsigned int) _5;
  _131 = _130 + 1;
  _132 = (integer(kind=4)) _131;

  <bb 18> [local count: 95563020]:
  # i_63 = PHI <i_52(18), 1(17)>
  # prephitmp_60 = PHI <_13(18), 0.0(17)>
  # ivtmp.40_124 = PHI <ivtmp.40_125(18), ivtmp.40_128(17)>
  _129 = (void *) ivtmp.40_124;
  pretmp_75 = MEM[(real(kind=4) *)_129];
  _13 = prephitmp_60 + pretmp_75;
  i_52 = i_63 + 1;
  ivtmp.40_125 = ivtmp.40_124 + _127;
  if (i_52 == _132)
    goto <bb 19>; [11.00%]
  else
    goto <bb 18>; [89.00%]

  <bb 19> [local count: 52559662]:
  # prephitmp_20 = PHI <_13(18), _95(21)>
  goto <bb 23>; [100.00%]

  <bb 20> [local count: 42047729]:
  ivtmp.34_115 = (unsigned long) a.0_42;
  _117 = (unsigned int) _5;
  _118 = _117 + 4294967295;
  _119 = (unsigned long) _118;
  _120 = _119 * 4;
  _15 = ivtmp.34_115 + 4;
  _123 = _15 + _120;

  <bb 21> [local count: 382252093]:
  # prephitmp_87 = PHI <0.0(20), _95(21)>
  # ivtmp.34_113 = PHI <ivtmp.34_115(20), ivtmp.34_114(21)>
  _116 = (void *) ivtmp.34_113;
  pretmp_94 = MEM[(real(kind=4) *)_116];
  _95 = prephitmp_87 + pretmp_94;
  ivtmp.34_114 = ivtmp.34_113 + 4;
  if (ivtmp.34_114 == _123)
    goto <bb 19>; [11.00%]
  else
    goto <bb 21>; [89.00%]

  <bb 22> [local count: 12992276]:
  if (pretmp_76 != 0)
    goto <bb 23>; [50.00%]
  else
    goto <bb 24>; [50.00%]

  <bb 23> [local count: 59055800]:
  # prephitmp_73 = PHI <0.0(22), prephitmp_20(19)>
  _25 = (real(kind=4)) _47;
  _26 = prephitmp_73 / _25;
  goto <bb 25>; [100.00%]

  <bb 24> [local count: 59055800]:
  # k_72 = PHI <0(22), k_57(11)>
  # prephitmp_74 = PHI <0.0(22), prephitmp_35(11)>
  _28 = (real(kind=4)) k_72;
  _29 = prephitmp_74 / _28;

  <bb 25> [local count: 118111600]:
  # cstore_14 = PHI <_26(23), _29(24)>
  *avg_45(D) = cstore_14;
  return;

}

With the Intel Fortran compilers you can also check such things, e.g. with

$ ifx -O2 -qopt-report=2 -qopt-report-phase=loop -qopt-report-file=stdout average.f90
Global optimization report for : average_

LOOP BEGIN at /app/example.f90 (15, 7)
<Predicate Optimized v2>
    remark #15436: loop was not vectorized:  
    remark #25439: Loop unrolled with remainder by 8
LOOP END

LOOP BEGIN at /app/example.f90 (23, 1)
<Remainder loop>
    remark #25585: Loop converted to switch
LOOP END

LOOP BEGIN at /app/example.f90 (15, 7)
<Predicate Optimized v1>
    remark #25423: Invariant If condition at line 15 hoisted out of this loop
    remark #15344: Loop was not vectorized: vector dependence prevents vectorization
    remark #15346: vector dependence: assumed FLOW dependence between (19:17) and (19:17) 
    remark #25439: Loop unrolled with remainder by 8
LOOP END

LOOP BEGIN at /app/example.f90 (15, 7)
<Remainder loop>
LOOP END

From this report we can see there are two loops for the two branches of the predicate. Both of these loops were unrolled. It can be difficult to read these optimization reports because without some previous experience it can be difficult to comprehend what kind of transformations were performed. Maybe @greenrongreen or @whuhn can give us a hint how to correctly use the reports?

In the article I linked above from Johnny’s Software Lab, they used C++ templates to guarantee branch optimization/loop unswitching:

float avg;
bool should_include_negatives = get_should_include_negatives();
if (should_include_negatives) {
    avg = average<true>(array, len);
} else {
    avg = average<false>(array, len);
}

In Fortran you could in principle also do such a template substitution with a preprocessor but it would be tedious. Naively, I thought to myself, why not reach for an internal procedure? The compiler will see the logical literal value and eliminate the dead code branches.

subroutine average(a,include_negative,avg)
  implicit none
  real, intent(in) :: a(:)
  logical, intent(in) :: include_negative
  real, intent(out) :: avg

  if (include_negative) then
    call average_impl(a,.true.,avg)
  else
    call average_impl(a,.false.,avg)
  end if

contains
  subroutine average_impl(a,include_negative,avg)
  implicit none
  real, intent(in) :: a(:)
  logical, intent(in) :: include_negative
  real, intent(out) :: avg
  ! ...
  ! same routine as earlier
  ! ...
  end subroutine
end subroutine

It turns out that with gfortran and -O2 this didn’t inline (meaning no advantage compared to the original code). With the additional option -finline-limit=80 the procedures were inlined and the dead branches eliminated. The same happened with -O3. So in the end there was no real benefit, we just switched from one optimization (loop unswitching) to another one (inlining and dead branch elimination). But looking at the output of -fopt-info and -fopt-info-missed at different optimization levels can be useful to spot missed opportunities.

My conclusion is to reiterate what has been said before (both in the current and past threads) with a quote from Matt Godbolt:

Don’t compromise readability … Unless you can absolutely and unconditionally prove it beyond any reasonable doubt, don’t compromise readability. … Trust your compiler to take readable, clear code that you can reason about. Turns out if you can reason about it, the compiler can reason about it and it can do the right thing.

1 Like

@JohnCampbell, an if could impede vectorization of the rest of loop. In the example I posted with calculating the average, once the branch is hoisted out, the loop can be vectorized. It’s easy to see the potential if the subroutine were written like this:

subroutine average_v3(a,include_negative,avg)
  implicit none
  real, intent(in) :: a(:)
  logical, intent(in) :: include_negative
  real, intent(out) :: avg
  integer :: i, k

  if (include_negative) then
    avg = sum(a) / size(a)
  else
    avg = sum(a,a > 0) / count(a > 0)
  end if
end subroutine

I admit the average calculation is just a convoluted toy example. It should have been written from the start like this:

pure real function average(a)
  implicit none
  real, intent(in) :: a(:)
  average = sum(a) / size(a)
end function
pure real function average_only_positive(a)
  implicit none
  real, intent(in) :: a(:)
  average_only_positive = sum(a, a > 0) / count(a > 0)
end function

Probably @ffff’s problem does not lend itself to such simple fixes.

+1

Excellent points by @ivanpribec for OP to consider.

To me, this obsession with removing an IF statement from a loop to enhance vectorisation appears to be overdone in the initial example.
The example was not:

do i=1, bignumber
    if (ical) avg = avg + a(i)
end do

but effectively

do i=1, bignumber
    if (ical) call sub(a,b,c)
    ...
end do

There is, at a minimum, a lot of calculation overhead in “call sub(a,b,c)” than in a potential vectorisation.