Why the performance is poorer after using OpenMP?

c(i,j,thread) = c(i,j,thread) + 0.052*lap_c(i,j,thread)
from
lap_c(i,j,thread) = … + c(im,j,thread) + …
but c(im,j,thread) was updated for the previous itteration (i-1) of do i = 1,nc
plus c(:,jm,thread) from the previous itteration of do j = 1,nc.

It may appear acceptable but the updated values are not all based on estimates from the previous step.

It is necessary to carry forward the unchanged single value of “c(i,j,thread)” plus the unchanged row c(:,jm,thread) from the previous itteration, with appropriate consideration of edge conditions.

C could be declared as allocate ( c(0:nc,0:nc,threads) ), and so utilise c(:,0,threads) to store the previous vector C(:,jm1,threads) and C(0,j,threads) for C(im1,j,threads), at the appropriate time in the itteration, or use a local real(kind=8) :: c_im and c_jm(nc).

The following code suggests a possible solution

  real(kind=wp) :: c_jm(nc), c_jn(nc), lap_c

        c_jm(:) = c(:,NC,thread)
        do j = 1,nc
            jp = j+1
            if (j == Nc) jp = 1
            c_jn(:) = c(:,j,thread)
            im = nc
            do i = 1,nc
                ip = i+1
                if (i == Nc) ip = 1
!old            lap_c(i,j,thread) = c(ip,j,thread) + c(im,j,thread) + c(i,jm,thread) + c(i,jp,thread) - 4.0*c(i,j,thread)
                lap_c             = c_jn(ip)       + c_jn(im)       + c_jm(i)        + c(i,jp,thread) - 4.0*c_jn(i)
                c(i,j,thread) =  c(i,j,thread) + 0.052*lap_c
                im = i
            end do
            c_jm(:) = c_jn(:)
        end do

In the OpenMP case, it is the edges that need to be updated between the threads at each step.