The version of code @JohnCampbell has shown is what would be expected following loop invariant code motion. One form of such code motion is also loop unswitching.
If ical
does not change for the duration of the loop, the compiler can create two versions of the loop, one where ical
is true, and the other where it is false. For an insightful article on branches and performance I would suggest: How branches influence the performance of your code and what can you do about it?
Here’s an example I’ve copied from the article above:
subroutine average(a,include_negative,avg)
implicit none
real, intent(in) :: a(:)
logical, intent(in) :: include_negative
real, intent(out) :: avg
integer :: i, k
avg = 0
k = 0
do i = 1, size(a)
if (include_negative) then
avg = avg + a(i)
else
if (a(i) > 0) then
avg = avg + a(i)
k = k + 1
end if
end if
end do
if (include_negative) then
avg = avg / size(a)
else
avg = avg / k
end if
end subroutine
You can ask your compiler to emit an optimization report. With GCC, the option is -fopt-info
or -fopt-info-<option>
for specific optimizations. For the subroutine above:
$ gfortran -O3 -c -fopt-info-loop average.f90
average.f90:14:6: optimized: Unswitching loop on condition: if (pretmp_76 != 0)
average.f90:14:6: optimized: versioned this loop for when certain strides are 1
average.f90:14:6: optimized: versioned this loop for when certain strides are 1
You can see that unswitching was performed. At -O2
on the other hand, this optimization was not performed.
You could inspect the generated intermediate representation with the flag -fdump-tree-optimized
, however for the untrained eye it’s difficult to understand what’s going on.
Dump of optimized version
$ cat average.f90.252t.optimized
;; Function average (average_, funcdef_no=0, decl_uid=4239, cgraph_uid=1, symbol_order=0)
Removing basic block 3
Removing basic block 27
Removing basic block 28
Removing basic block 29
Removing basic block 30
Removing basic block 31
Removing basic block 32
Removing basic block 33
Removing basic block 34
Removing basic block 35
Removing basic block 36
Removing basic block 37
Removing basic block 38
__attribute__((fn spec (". r r w ")))
void average (struct array01_real(kind=4) & restrict a, logical(kind=4) & restrict include_negative, real(kind=4) & restrict avg)
{
unsigned long ivtmp.40;
unsigned long ivtmp.34;
unsigned long ivtmp.28;
unsigned long ivtmp.21;
integer(kind=4) k;
integer(kind=4) i;
real(kind=4)[0:D.4276] * restrict a.0;
integer(kind=8) ubound.0;
integer(kind=8) _1;
integer(kind=8) _2;
integer(kind=8) _3;
integer(kind=8) _5;
real(kind=4) prephitmp_11;
real(kind=4) _13;
real(kind=4) cstore_14;
unsigned long _15;
real(kind=4) _16;
real(kind=4) prephitmp_20;
real(kind=4) prephitmp_21;
real(kind=4) pretmp_24;
real(kind=4) _25;
real(kind=4) _26;
real(kind=4) _28;
real(kind=4) _29;
integer(kind=8) iftmp.6_33;
real(kind=4) prephitmp_35;
real(kind=4) prephitmp_36;
integer(kind=8) _39;
real(kind=4) pretmp_40;
integer(kind=4) _47;
real(kind=4) prephitmp_60;
unsigned int _64;
unsigned long _66;
real(kind=4) prephitmp_73;
real(kind=4) prephitmp_74;
real(kind=4) pretmp_75;
logical(kind=4) pretmp_76;
real(kind=4) _77;
real(kind=4) prephitmp_82;
void * _85;
real(kind=4) prephitmp_87;
real(kind=4) pretmp_94;
real(kind=4) _95;
unsigned int _98;
unsigned long _99;
unsigned long _100;
unsigned long _103;
integer(kind=8) _106;
unsigned long _107;
void * _109;
unsigned int _110;
unsigned int _111;
integer(kind=4) _112;
void * _116;
unsigned int _117;
unsigned int _118;
unsigned long _119;
unsigned long _120;
unsigned long _123;
integer(kind=8) _126;
unsigned long _127;
void * _129;
unsigned int _130;
unsigned int _131;
integer(kind=4) _132;
<bb 2> [local count: 118111598]:
_39 = *a_38(D).dim[0].stride;
if (_39 != 0)
goto <bb 4>; [50.00%]
else
goto <bb 3>; [50.00%]
<bb 3> [local count: 59055799]:
<bb 4> [local count: 118111598]:
# iftmp.6_33 = PHI <_39(2), 1(3)>
a.0_42 = *a_38(D).data;
_1 = *a_38(D).dim[0].ubound;
_2 = *a_38(D).dim[0].lbound;
_3 = _1 - _2;
ubound.0_43 = _3 + 1;
*avg_45(D) = 0.0;
_5 = MAX_EXPR <ubound.0_43, 0>;
_47 = (integer(kind=4)) _5;
pretmp_76 = *include_negative_48(D);
if (_47 <= 0)
goto <bb 22>; [11.00%]
else
goto <bb 5>; [89.00%]
<bb 5> [local count: 105119322]:
if (pretmp_76 != 0)
goto <bb 16>; [50.00%]
else
goto <bb 6>; [50.00%]
<bb 6> [local count: 52559661]:
if (iftmp.6_33 != 1)
goto <bb 7>; [20.00%]
else
goto <bb 12>; [80.00%]
<bb 7> [local count: 10511932]:
_106 = iftmp.6_33 * 4;
_107 = (unsigned long) _106;
ivtmp.28_108 = (unsigned long) a.0_42;
_110 = (unsigned int) _5;
_111 = _110 + 1;
_112 = (integer(kind=4)) _111;
<bb 8> [local count: 95563020]:
# i_58 = PHI <i_41(10), 1(7)>
# k_54 = PHI <k_7(10), 0(7)>
# prephitmp_36 = PHI <prephitmp_21(10), 0.0(7)>
# ivtmp.28_104 = PHI <ivtmp.28_105(10), ivtmp.28_108(7)>
_109 = (void *) ivtmp.28_104;
pretmp_40 = MEM[(real(kind=4) *)_109];
if (pretmp_40 > 0.0)
goto <bb 9>; [59.00%]
else
goto <bb 10>; [41.00%]
<bb 9> [local count: 28191091]:
_16 = prephitmp_36 + pretmp_40;
*avg_45(D) = _16;
k_18 = k_54 + 1;
<bb 10> [local count: 95563020]:
# k_7 = PHI <k_54(8), k_18(9)>
# prephitmp_21 = PHI <prephitmp_36(8), _16(9)>
i_41 = i_58 + 1;
ivtmp.28_105 = ivtmp.28_104 + _107;
if (i_41 == _112)
goto <bb 11>; [11.00%]
else
goto <bb 8>; [89.00%]
<bb 11> [local count: 52559662]:
# k_57 = PHI <k_7(10), k_80(15)>
# prephitmp_35 = PHI <prephitmp_21(10), prephitmp_82(15)>
goto <bb 24>; [100.00%]
<bb 12> [local count: 42047729]:
ivtmp.21_65 = (unsigned long) a.0_42;
_64 = (unsigned int) _5;
_98 = _64 + 4294967295;
_99 = (unsigned long) _98;
_100 = _99 * 4;
_66 = ivtmp.21_65 + 4;
_103 = _66 + _100;
<bb 13> [local count: 382252093]:
# k_50 = PHI <0(12), k_80(15)>
# prephitmp_11 = PHI <0.0(12), prephitmp_82(15)>
# ivtmp.21_90 = PHI <ivtmp.21_65(12), ivtmp.21_19(15)>
_85 = (void *) ivtmp.21_90;
pretmp_24 = MEM[(real(kind=4) *)_85];
if (pretmp_24 > 0.0)
goto <bb 14>; [59.00%]
else
goto <bb 15>; [41.00%]
<bb 14> [local count: 112764368]:
_77 = prephitmp_11 + pretmp_24;
*avg_45(D) = _77;
k_79 = k_50 + 1;
<bb 15> [local count: 382252093]:
# k_80 = PHI <k_50(13), k_79(14)>
# prephitmp_82 = PHI <prephitmp_11(13), _77(14)>
ivtmp.21_19 = ivtmp.21_90 + 4;
if (ivtmp.21_19 == _103)
goto <bb 11>; [11.00%]
else
goto <bb 13>; [89.00%]
<bb 16> [local count: 52559661]:
if (iftmp.6_33 != 1)
goto <bb 17>; [20.00%]
else
goto <bb 20>; [80.00%]
<bb 17> [local count: 10511932]:
_126 = iftmp.6_33 * 4;
_127 = (unsigned long) _126;
ivtmp.40_128 = (unsigned long) a.0_42;
_130 = (unsigned int) _5;
_131 = _130 + 1;
_132 = (integer(kind=4)) _131;
<bb 18> [local count: 95563020]:
# i_63 = PHI <i_52(18), 1(17)>
# prephitmp_60 = PHI <_13(18), 0.0(17)>
# ivtmp.40_124 = PHI <ivtmp.40_125(18), ivtmp.40_128(17)>
_129 = (void *) ivtmp.40_124;
pretmp_75 = MEM[(real(kind=4) *)_129];
_13 = prephitmp_60 + pretmp_75;
i_52 = i_63 + 1;
ivtmp.40_125 = ivtmp.40_124 + _127;
if (i_52 == _132)
goto <bb 19>; [11.00%]
else
goto <bb 18>; [89.00%]
<bb 19> [local count: 52559662]:
# prephitmp_20 = PHI <_13(18), _95(21)>
goto <bb 23>; [100.00%]
<bb 20> [local count: 42047729]:
ivtmp.34_115 = (unsigned long) a.0_42;
_117 = (unsigned int) _5;
_118 = _117 + 4294967295;
_119 = (unsigned long) _118;
_120 = _119 * 4;
_15 = ivtmp.34_115 + 4;
_123 = _15 + _120;
<bb 21> [local count: 382252093]:
# prephitmp_87 = PHI <0.0(20), _95(21)>
# ivtmp.34_113 = PHI <ivtmp.34_115(20), ivtmp.34_114(21)>
_116 = (void *) ivtmp.34_113;
pretmp_94 = MEM[(real(kind=4) *)_116];
_95 = prephitmp_87 + pretmp_94;
ivtmp.34_114 = ivtmp.34_113 + 4;
if (ivtmp.34_114 == _123)
goto <bb 19>; [11.00%]
else
goto <bb 21>; [89.00%]
<bb 22> [local count: 12992276]:
if (pretmp_76 != 0)
goto <bb 23>; [50.00%]
else
goto <bb 24>; [50.00%]
<bb 23> [local count: 59055800]:
# prephitmp_73 = PHI <0.0(22), prephitmp_20(19)>
_25 = (real(kind=4)) _47;
_26 = prephitmp_73 / _25;
goto <bb 25>; [100.00%]
<bb 24> [local count: 59055800]:
# k_72 = PHI <0(22), k_57(11)>
# prephitmp_74 = PHI <0.0(22), prephitmp_35(11)>
_28 = (real(kind=4)) k_72;
_29 = prephitmp_74 / _28;
<bb 25> [local count: 118111600]:
# cstore_14 = PHI <_26(23), _29(24)>
*avg_45(D) = cstore_14;
return;
}
With the Intel Fortran compilers you can also check such things, e.g. with
$ ifx -O2 -qopt-report=2 -qopt-report-phase=loop -qopt-report-file=stdout average.f90
Global optimization report for : average_
LOOP BEGIN at /app/example.f90 (15, 7)
<Predicate Optimized v2>
remark #15436: loop was not vectorized:
remark #25439: Loop unrolled with remainder by 8
LOOP END
LOOP BEGIN at /app/example.f90 (23, 1)
<Remainder loop>
remark #25585: Loop converted to switch
LOOP END
LOOP BEGIN at /app/example.f90 (15, 7)
<Predicate Optimized v1>
remark #25423: Invariant If condition at line 15 hoisted out of this loop
remark #15344: Loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between (19:17) and (19:17)
remark #25439: Loop unrolled with remainder by 8
LOOP END
LOOP BEGIN at /app/example.f90 (15, 7)
<Remainder loop>
LOOP END
From this report we can see there are two loops for the two branches of the predicate. Both of these loops were unrolled. It can be difficult to read these optimization reports because without some previous experience it can be difficult to comprehend what kind of transformations were performed. Maybe @greenrongreen or @whuhn can give us a hint how to correctly use the reports?