There is, and indeed it can lead more efficient code. But this applies irrespective of OpenMP, which I mentioned in my original post.
The aliasing issue can be overcome in C with the standard restrict
type qualifier (since C99). Here’s an example (Compiler Explorer), for an x86-64 processor:
void multiply_scalar(float *a, float *b) {
*b = 3.0f*(*a);
}
void multiply_stream(float *a, float *b) {
for (int i = 0; i < 4; i++)
b[i] = 3.0f*a[i];
}
void multiply_stream_noalias(float * restrict a, float * restrict b)
{
for (int i = 0; i < 4; i++)
b[i] = 3.0f*a[i];
}
multiply_scalar(float*, float*):
vmovss xmm0, DWORD PTR .LC0[rip]
vmulss xmm0, xmm0, DWORD PTR [rdi]
vmovss DWORD PTR [rsi], xmm0
ret
multiply_stream(float*, float*):
vmovss xmm1, DWORD PTR .LC0[rip]
xor eax, eax
.L4:
vmulss xmm0, xmm1, DWORD PTR [rdi+rax]
vmovss DWORD PTR [rsi+rax], xmm0
add rax, 4
cmp rax, 16
jne .L4
ret
multiply_stream_noalias(float*, float*):
vbroadcastss xmm0, DWORD PTR .LC0[rip]
vmulps xmm0, xmm0, XMMWORD PTR [rdi]
vmovups XMMWORD PTR [rsi], xmm0
ret
.LC0:
.long 1077936128
As you can see, in multiply_stream
there is a loop implementing scalar multiplication (vmulss
). Once the restrict
qualifier is added (or __restrict
extension in C++), the compiler performs 4 multiplications in parallel with the vmulps
instruction. The xmm
registers are part of the Streaming SIMD Extension (SSE). They are 128-bit wide, meaning they can fit four floats, or two doubles. When you multiply in scalar mode, you are essentially wasting three out of four SIMD lanes.
(Note, at -O3
optimization some C and C++ compilers may insert runtime checks for aliasing with separate vectorized and scalar codepaths depending on the aliasing result.)
In Fortran, because of the default aliasing rules, the vector instruction is produced directly:
subroutine multiply_stream(a,b)
real, intent(in) :: a(4)
real, intent(out) :: b(4)
do i = 1, 4
b(i) = 3.*a(i)
end do
end subroutine
multiply_stream_:
vbroadcastss xmm0, DWORD PTR .LC0[rip]
vmulps xmm0, xmm0, XMMWORD PTR [rdi]
vmovups XMMWORD PTR [rsi], xmm0
ret
.LC0:
.long 1077936128
Addendum: using Intel Intrinsics (SSE), the function to multiply four floats with a constant value can be written as,
// Multiply elements of a by 3 and store result in b
void multiply_sse(float *a, float *b) {
__m128 r = _mm_set_ps1(3.0f); // Broadcast 3.0 to all elements of r
r = _mm_mul_ps (r, _mm_loadu_ps(a)); // Multiply packed single-precision operands
// (for second operand unaligned load is used)
_mm_storeu_ps(b, r); // Store result in b
}