C Is Not a Low-level Language

ivanpribec · August 13, 2023, 11:22pm

There is, and indeed it can lead more efficient code. But this applies irrespective of OpenMP, which I mentioned in my original post.

The aliasing issue can be overcome in C with the standard restrict type qualifier (since C99). Here’s an example (Compiler Explorer), for an x86-64 processor:

void multiply_scalar(float *a, float *b) {
    *b = 3.0f*(*a);
}

void multiply_stream(float *a, float *b) {
    for (int i = 0; i < 4; i++)
        b[i] = 3.0f*a[i];
}

void multiply_stream_noalias(float * restrict a, float * restrict b) 
{
    for (int i = 0; i < 4; i++)
        b[i] = 3.0f*a[i];
}

multiply_scalar(float*, float*):
        vmovss  xmm0, DWORD PTR .LC0[rip]
        vmulss  xmm0, xmm0, DWORD PTR [rdi]
        vmovss  DWORD PTR [rsi], xmm0
        ret
multiply_stream(float*, float*):
        vmovss  xmm1, DWORD PTR .LC0[rip]
        xor     eax, eax
.L4:
        vmulss  xmm0, xmm1, DWORD PTR [rdi+rax]
        vmovss  DWORD PTR [rsi+rax], xmm0
        add     rax, 4
        cmp     rax, 16
        jne     .L4
        ret
multiply_stream_noalias(float*, float*):
        vbroadcastss    xmm0, DWORD PTR .LC0[rip]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi] 
        vmovups XMMWORD PTR [rsi], xmm0
        ret
.LC0:
        .long   1077936128

As you can see, in multiply_stream there is a loop implementing scalar multiplication (vmulss). Once the restrict qualifier is added (or __restrict extension in C++), the compiler performs 4 multiplications in parallel with the vmulps instruction. The xmm registers are part of the Streaming SIMD Extension (SSE). They are 128-bit wide, meaning they can fit four floats, or two doubles. When you multiply in scalar mode, you are essentially wasting three out of four SIMD lanes.

(Note, at -O3 optimization some C and C++ compilers may insert runtime checks for aliasing with separate vectorized and scalar codepaths depending on the aliasing result.)

In Fortran, because of the default aliasing rules, the vector instruction is produced directly:

subroutine multiply_stream(a,b)
   real, intent(in) :: a(4)
   real, intent(out) :: b(4)
   do i = 1, 4
    b(i) = 3.*a(i)
   end do
end subroutine

multiply_stream_:
        vbroadcastss    xmm0, DWORD PTR .LC0[rip]
        vmulps  xmm0, xmm0, XMMWORD PTR [rdi]
        vmovups XMMWORD PTR [rsi], xmm0
        ret
.LC0:
        .long   1077936128

Addendum: using Intel Intrinsics (SSE), the function to multiply four floats with a constant value can be written as,

// Multiply elements of a by 3 and store result in b
void multiply_sse(float *a, float *b) {
    __m128 r = _mm_set_ps1(3.0f);        // Broadcast 3.0 to all elements of r
    r = _mm_mul_ps (r, _mm_loadu_ps(a)); // Multiply packed single-precision operands
                                         // (for second operand unaligned load is used)
    _mm_storeu_ps(b, r);                 // Store result in b
}

Topic		Replies	Views
Is Fortran or C++ code better optimized? Advocacy	8	1543	February 27, 2024
FORTRAN is faster than C++	4	3107	January 20, 2023
Discussion of Fortran on Hacker News	29	2723	January 6, 2024
New Fortran programmer saying hi (and some thoughts on advocacy)	18	1066	June 2, 2024
What is the superiority of Fortran over alternative languages like Chapel or Julia?	49	9483	August 13, 2021

C Is Not a Low-level Language

Related topics