Not to mention this (auto-)vectorizes nicely. Say with gfortran -O3 -march=skylake-avx512 -mprefer-vector-width=512, the bulk of the work gets done in the hot loop:
.L5:
vmovups zmm0, ZMMWORD PTR [r15+rax]
vfmadd213ps zmm0, zmm1, ZMMWORD PTR [rdx+rax]
vmovups ZMMWORD PTR [rdx+rax], zmm0
add rax, 64
cmp rdi, rax
jne .L5