How about using Parallel BLAS?
! Standard BLAS
call zgemv(trans, m, n, alpha, a, lda, &
x, incx, &
beta, y, incy)
! Parallel BLAS
call pzgemv(trans, m, n, alpha, a, ia, ja, desca, &
x, ix, jx, descx, incx, &
beta, y, iy, jy, descy, incy)
You will need to establish the distributed arrays and descriptors at first, but otherwise your code can remain mostly the same.
More info: