Do concurrent: not seeing any speedup

sjain · May 22, 2025, 5:03am

I am trying three different poisson solvers:

Do loops:

program poisson_solver
  implicit none
  integer, parameter :: nx=512, ny=512, max_iter=10000
  real, parameter :: tol=1.0e-6, dx=1.0/(nx-1), dy=1.0/(ny-1)
  real :: phi_old(nx,ny), phi_new(nx,ny), residual(nx,ny)
  real :: diff, maxdiff
  integer :: i, j, iter
  real :: start_time, end_time

  ! Initialize with random guess
  call random_seed()
  call random_number(phi_old)
  phi_new = phi_old

  ! Apply Dirichlet BCs: zero on edges
  phi_old(1,:) = 0.0; phi_old(nx,:) = 0.0
  phi_old(:,1) = 0.0; phi_old(:,ny) = 0.0
  phi_new(1,:) = 0.0; phi_new(nx,:) = 0.0
  phi_new(:,1) = 0.0; phi_new(:,ny) = 0.0

  print *, "Start solving..."

  ! Start timer
  call cpu_time(start_time)

  ! Jacobi Iteration
  do iter = 1, max_iter
    maxdiff = 0.0

    do j = 2, ny - 1
      do i = 2, nx - 1
        phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
      end do
    end do

    do j = 2, ny - 1
      do i = 2, nx - 1
        residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do
    maxdiff = maxval(abs(residual(2:nx-1,2:ny-1)))
    phi_old = phi_new

    if (mod(iter,100)==0) print *, 'Iter:', iter, ' Maxdiff:', maxdiff
    if (maxdiff < tol) exit
  end do

  ! End timer
  call cpu_time(end_time)

  print *, 'Converged after', iter, 'iterations with maxdiff =', maxdiff
  print *, 'Time taken (seconds):', end_time - start_time

end program poisson_solver

Do concurrent:

program poisson_solver
## same as before0

    do concurrent (i=2:nx-1, j=2:ny-1)
      phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
    end do
    do concurrent  (i=2:nx-1, j=2:ny-1)
      residual(i,j)=0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1))-phi_new(i,j)
    end do
## same as before
end program poisson_solver

do with openmp:

program poisson_solver
  use omp_lib
## same as before
    !$omp parallel do private(i,j) shared(phi_old, phi_new)
    do j = 2, ny - 1
      do i = 2, nx - 1
        phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
      end do
    end do
    !$omp end parallel do

    !$omp parallel do private(i,j) shared(phi_new, residual)
    do j = 2, ny - 1
      do i = 2, nx - 1
        residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do
    !$omp end parallel do
## same as before

Time using ifort: ifx -qopenmp -o poisson_solver do_omp.f90

Do: 2.570228 s
Do concurrent: 69.89281 s (I dont think time is being measured right over here)
OMP: 1.08 s

Using gfortran: gfortran -O3 -fopenmp -o poisson_solver do.f90 && ./poisson_solver

Do: 1.96368110 s
Do concurrent: 2.00398302 s
OMP: 0.87 s

Using flang (amd): flang -O3 -fopenmp -o poisson_solver do.f90 && ./poisson_solver

Do: 1.97 s,
Do concurrent: 1.96 s,
Do openmp: 0.96 s

I would have expected do concurrent to be as fast as openmp.
What am I doing wrong here?
Caution: code was partly generated using genAI

ivanpribec · May 22, 2025, 6:15am

Can you try timing with system_clock or omp_get_wtime instead of cpu_time?

For gfortran you will need an additional flag to generate multithreaded do concurrent. I’m not sure if flang has something similar (yet). You also may want to check the optimization reports.

Some details can be found here: DO CONCURRENT: compiler flags to enable parallelization - #6 by ivanpribec

sjain · May 22, 2025, 6:31am

So I tried -ftree-parallelize-loops=n and for n=1, I get same speed as the serial code. For n>1 it slows down.

hkvzjal · May 22, 2025, 6:48am

What if you add the locality specifiers with do concurrent? (You’ll need gfortran 15 for that, ifx already supports them)

ivanpribec · May 22, 2025, 6:52am

It might be worth a try rotating the index order to do concurrent (j=2:ny-1,i=2:nx-1).

sjain · May 22, 2025, 7:05am

So I think the problem is my PC at this point. I fixed some bugs in the omp code and its slow as well. I will try this on our institutes HPC system and report back
I am actually doing j and then i now.

aledinola · May 22, 2025, 8:38am

Not sure if it’s your PC. I had the same problem while ago: do concurrent was slower than the serial loop. OpenMP works much better. Maybe some compilers have yet to implement a good do concurrent

Shahid · May 22, 2025, 9:20am

Do concurrent implmentation is pretty much same as OpenMP. The scope of variables should be defined same as in OpenMP. The difference is though LOCAL keyword for do concurrent and PRIVATE for OpenMP.

Shahid · May 22, 2025, 9:26am

gfortran 15 does not seem to be showing speedup. It does not show any local specification error though!.

GNU Fortran 15.1 has been released - Compilers / GNU - Fortran Discourse

A few months ago i checked ifx for windows. Do concurrent was not showing speedup for the windows 10. So seems like ifx for linux supports Do concurrent only.

MattAl · May 22, 2025, 1:42pm

A tangent (to the matter at hand) but it might interest you:

In OpenMP, this structure

do iter = 1, max_iter
    !$omp parallel do private(i,j) shared(phi_old, phi_new)
    do j = 2, ny - 1
      do i = 2, nx - 1
        phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
      end do
    end do
    !$omp end parallel do

    !$omp parallel do private(i,j) shared(phi_new, residual)
    do j = 2, ny - 1
      do i = 2, nx - 1
        residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do
    !$omp end parallel do
...

has a non-trivial overhead on the creation, assignment, destruction of the parallel threads, and that’s for every iter.

The solution is to extend the parallel region outside the iter loop, always paying attention to shared and private variables.

That way, you only create and destroy the parallel region once.

The parallel region is just !$omp parallel private() shared(), and then you can use the threads by !$omp do.

One thing that changes is that now you need an explicit barrier between the two nested inner loops.

sjain · May 22, 2025, 2:34pm

Hi. Thanks! This makes a lot of sense! Do you have an example code for this?

sjain · May 22, 2025, 2:39pm

Tried the flags. No progress. I converted the code to MPI and OpenMP and both are incredibly fast

PierU · May 22, 2025, 3:00pm

I strongly concur… This is very likely the reason why the do concurrent is so slow in the first place.

This is at least dependent on the implementation. AFAIK, smart OpenMP implementations do not destroy/recreate the thread pool between 2 parallel region.

No, you don’t. !$OMP END DO is also an implicit barrier.

ivanpribec · May 22, 2025, 3:55pm

Assuming your problem is big enough, the Jacobi kernel tends to become memory bound. This is because the kernel has a very low arithmetic intensity. For instance in the first loop you read 4 * 4 bytes and write 4 bytes, while performing just 4 operations so the intensity is just 0.2 FLOP / byte.

It would be nicer to have something more accurate than “incredibly fast” which is subjective. It is worthwhile to compare your performance with the peak values of your machine (the so-called “roofline”). For this set of kernels we can look at the effective bandwidth:

    !  read: (nx*ny)*4
    ! write: (nx-2)*(ny-2)*4
    do j = 2, ny - 1
      do i = 2, nx - 1
        phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
      end do
    end do

    !  read: (nx*ny)*4
    ! write: (nx-2)*(ny-2)*4
    do j = 2, ny - 1
      do i = 2, nx - 1
        residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do

    !  read: (nx-2)*(ny-2)*4
    maxdiff = maxval(abs(residual(2:nx-1,2:ny-1)))

    !  read: (nx*ny)*4
    ! write: (nx*ny)*4
    phi_old = phi_new

You can sum these values and divide by the time per outer loop iteration to give the effective bandwidth:

  print *, 'Effective BW (GB/s): ', (4*(nx*ny)*4 + 3*(nx-2)*(ny-2)*4) / (elapsed/iter) / (1024.0)**3

What I get in the serial case (Apple M2 Pro):

 Converged after       10001 iterations with maxdiff =   4.91848588E-03
 Time taken (seconds):   4.10800219
 Effective BW (GB/s):    16.5865917

This CPU has LPDDR5-6400 memory, and has a peak bandwidth of ~200 GB/s. So there is still some potential for improvement.

Luckily, the analysis also suggests easy ways of improving the performance by eliminating redundant memory operations. For example you can get rid of the residual array entirely, and just do the maxval reduction in the loop:

    maxdiff = 0.0
    do j = 2, ny - 1
      do i = 2, nx - 1
        tmp = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
        maxdiff = max(maxdiff,abs(tmp))
        !residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do

The next idea would be to get rid of the assignment,

phi_new = phi_old

Instead you can use the simple flip-flop technique (alternative options involve pointers, or move_alloc):

  flip = 0

  ! Jacobi Iteration
  do iter = 1, max_iter

    select case(flip)
    case(0)
      call jacobi(nx,ny,phi_old,phi_new,maxdiff)
    case(1)
      call jacobi(nx,ny,phi_new,phi_old,maxdiff)
    end select
    flip = 1 - flip

    if (mod(iter,100)==0) print *, 'Iter:', iter, ' Maxdiff:', maxdiff
    if (maxdiff < tol) exit
    
  end do

With these two changes I get,

 Converged after       10001 iterations with maxdiff =   4.64811921E-03
 Time taken (seconds):   1.61683095

The BW metric has to be updated however, to reflect the actual volume of data written and read.

Another idea you can try is fusing the update loop and the maxdiff calculation.

themos · May 22, 2025, 4:05pm

This is a personal view: you should probably not expect a DO CONCURRENT to be enough to manage parallel execution. I think it makes more sense to see it as human-to-human annotation. Consider it a way for one programmer to signal to another that a piece of code can be re-written with the OpenMP extensions. The OpenMP folks have signalled that they would be horrified if DO CONCURRENT and OpenMP both tried to manage the execution resources of the processor. Compiler writers have clearly put a lot more effort in OpenMP than DO CONCURRENT implementations. Users have invested heavily in OpenMP. This feedback situation is not likely to change soon.

eugene_epshteyn · May 22, 2025, 6:35pm

Some compilers will try to convert DO CONCURRENT to OpenMP, but you have to ask. For flang, the switch is -fdo-concurrent-to-openmp=host. For ifx you still need to pass -fopenmp to get DO CONCURRENT to parallelize, see https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2025-1/do-concurrent.html

RonShepard · May 22, 2025, 10:39pm

That construct is designed so that the index order is irrelevant. If a compiler distinguishes between the two orders, then I would say that is a shortcoming, maybe even a major one, of the compiler. One expects the compiler to make some reasonable effort to order the operations to optimize memory references, vector lengths, and so on, and that chosen order should be the same regardless of the index order specified in the construct.

Regarding DO vs. DO CONCURRENT, the left hand side of the assignment is obviously independent of anything on the right hand side, so the compiler should know that even the DO loops can be executed in any order, including concurrently, and in a single thread or in multiple threads. Maybe some minimal level of optimization might need to be specified to get the optimal ordering and the optimal instructions, but speaking as someone who used f77 compilers in the 1980s and achieved optimal performance on the hardware that was available then, I would be disappointed if a fortran compiler could not produce at least near-optimal single-thread code for those do loops. As discussed in MFE, when the programmer uses DO CONCURRENT, the compiler might make some further optimistic assumptions about vector lengths and so on, but even the DO loop code should result in good performance for simple loops like this.

One possible pitfall of adding DO CONCURRENT to the language is that compilers might stop trying to optimize the legacy DO loop constructs. We, as users of the compilers, should try to avoid that from happening. That concern also extends to the use of GPUs, where most of the focus seems to be directed toward DO CONCURRENT.

ivanpribec · May 22, 2025, 11:32pm

In practice it is not the case.

According to this Intel blogpost on Auto-Vectorization of DO CONCURRENT with Multiple Indices:

The Fortran standard allows the iterations of a DO CONCURRENT construct to be executed in any order and the index variables and associated ranges may be specified in any order. However, in the Intel ® Fortran implementation, the order of the index variables matters [emphasis added] and the DO CONCURRENT is interpreted as nested loops in the order specified by the initial statement from the outermost to innermost.

For the GCC implementation, old discussions suggest the construct was simply translated to regular do loops; but marked with an ivdep directive. I think it is still done that way in the single-threaded case (I happened to submit a bug report indirectly related to this).

I take optimal means reaching a machine bottleneck. The reality is compilers regularly fail to do so already on 2-d stencils like the one in this thread, not to mention in more complicated cases. This is one of the first cases studied in Hager & Wellein’s node-level performance engineering course, specifically the roofline study of the Jacobi method.

Personally, I think Levesque’s observations remain valid to this day,

Often, compilers are inhibited from doing the best optimization because they do not know enough about the code. The solution is to involve the programmer in the analysis. Even a programmer does not know all of the answers needed for optimization; but if assisted by run-time statistics, he or she can usually provide the necessary information.
The authors believe that it will be a significant time before a compiler can automatically optimize the dusty-deck Fortran program well. So, for today, the right approach is for the programmer to become involved in the analysis, aided by run-time statistics.

Source: A Guidebook to Fortran on Supercomputers, pg. 92

sjain · May 23, 2025, 12:58am

Would the solution be to use a combination of openmp and do concurrent? i.e. use the omp features for cpu and do concurrent for nvidia gpus? I would ideally like a code which works on all platforms

Shahid · May 23, 2025, 7:19am

I find the comparison to be similar for Concurrent and OpenMP do loops. Here is my understanding and testing results.

First i replace call cpu_time with call system_clock. Then make two programs. One for dc and other for OpenMP.

Do Concurrent (dc) code

!   Concurrent do loop program
!
!   windows 10, intel classic compiler
!   =======================================
!        
!   ifort main_dc.f90 /Qopenmp /F5000000 /Qopt-report-phase:openmp

program poisson_solver
  implicit none
  integer, parameter :: nx=512, ny=512, max_iter=10000
  real, parameter :: tol=1.0e-6, dx=1.0/(nx-1), dy=1.0/(ny-1)
  real :: phi_old(nx,ny), phi_new(nx,ny), residual(nx,ny)
  real :: maxdiff
  integer :: i, j, iter
  integer(8) :: rate, start_time, end_time 
  
  ! Initialize with random guess
  call random_seed()
  call random_number(phi_old)
  phi_new = phi_old
  
  ! Apply Dirichlet BCs: zero on edges
  phi_old(1,:) = 0.0; phi_old(nx,:) = 0.0
  phi_old(:,1) = 0.0; phi_old(:,ny) = 0.0
  phi_new(1,:) = 0.0; phi_new(nx,:) = 0.0
  phi_new(:,1) = 0.0; phi_new(:,ny) = 0.0
  
  print *, "Start solving..."
  
  ! Start timer
  call system_clock (count=start_time , count_rate=rate)
  
  ! Jacobi Iteration
  time_loop:do iter = 1, max_iter
    maxdiff = 0.0
    
    do concurrent (j=2:ny-1,i=2:nx-1) default (none) shared(phi_old, phi_new)
      phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
    end do
    
    do concurrent (j=2:ny-1,i=2:nx-1) default (none) shared(phi_new, residual)
      residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
    end do
    
    maxdiff = maxval(abs(residual(2:nx-1,2:ny-1)))
    phi_old = phi_new
    
    if (mod(iter,1000)==0) print *, 'Iter:', iter, ' Maxdiff:', maxdiff
    if (maxdiff < tol) exit
    
  end do time_loop
  
  ! End timer
  call system_clock (count=end_time)
  
  print *, 'Converged after', iter, 'iterations with maxdiff =', maxdiff
  print *, 'Time taken (seconds):', real(max(end_time - start_time , 1_8 )) /real(rate),' seconds'

end program poisson_solver

OpenMP code is

!   OpenMP do loop program
!
!   windows 10, intel classic compiler
!   =======================================
!        
!   ifort main_omp.f90 /Qopenmp /F5000000 /Qopt-report-phase:openmp

program poisson_solver
  implicit none
  integer, parameter :: nx=512, ny=512, max_iter=10000
  real, parameter :: tol=1.0e-6, dx=1.0/(nx-1), dy=1.0/(ny-1)
  real :: phi_old(nx,ny), phi_new(nx,ny), residual(nx,ny)
  real :: maxdiff
  integer :: i, j, iter
  integer(8) :: rate, start_time, end_time 
  
  ! Initialize with random guess
  call random_seed()
  call random_number(phi_old)
  phi_new = phi_old
  
  ! Apply Dirichlet BCs: zero on edges
  phi_old(1,:) = 0.0; phi_old(nx,:) = 0.0
  phi_old(:,1) = 0.0; phi_old(:,ny) = 0.0
  phi_new(1,:) = 0.0; phi_new(nx,:) = 0.0
  phi_new(:,1) = 0.0; phi_new(:,ny) = 0.0
  
  print *, "Start solving..."
  
  ! Start timer
  call system_clock (count=start_time , count_rate=rate)
  
  ! Jacobi Iteration
  time_loop:do iter = 1, max_iter
    maxdiff = 0.0
    
    !$omp parallel do private(i,j) shared(phi_old, phi_new)
    do j = 2, ny - 1
      do i = 2, nx - 1
        phi_new(i,j) = 0.25 * (phi_old(i+1,j) + phi_old(i-1,j) + phi_old(i,j+1) + phi_old(i,j-1))
      end do
    end do
    !$omp end parallel do
    
    !$omp parallel do private(i,j) shared(phi_new, residual)
    do j = 2, ny - 1
      do i = 2, nx - 1
        residual(i,j) = 0.25*(phi_new(i+1,j) + phi_new(i-1,j) + phi_new(i,j+1) + phi_new(i,j-1)) - phi_new(i,j)
      end do
    end do
    !$omp end parallel do
    
    maxdiff = maxval(abs(residual(2:nx-1,2:ny-1)))
    phi_old = phi_new
    
    if (mod(iter,1000)==0) print *, 'Iter:', iter, ' Maxdiff:', maxdiff
    if (maxdiff < tol) exit
    
  end do time_loop
  
  ! End timer
  call system_clock (count=end_time)
  
  print *, 'Converged after', iter, 'iterations with maxdiff =', maxdiff
  print *, 'Time taken (seconds):', real(max(end_time - start_time , 1_8 )) /real(rate),' seconds'

end program poisson_solver

Testing

Concurrent do loop

 D:\testing_gfortran15Version\comparison>ifort main_dc.f90 /Qopenmp /F5000000 /Qopt-report-phase:openmp
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.13.0 Build 20240602_000000
Copyright (C) 1985-2024 Intel Corporation.  All rights reserved.

ifort: remark #10448: Intel(R) Fortran Compiler Classic (ifort) is now deprecated and will be discontinued late 2024. Intel recommends that customers transition now to using the LLVM-based Intel(R) Fortran Compiler (ifx) for continued Windows* and Linux* support, new language support, new language features, and optimizations. Use '/Qdiag-disable:10448' to disable this message.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main_dc.exe
-subsystem:console
-stack:5000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main_dc.obj

D:\testing_gfortran15Version\comparison>main_dc
 Start solving...
 Iter:        1000  Maxdiff:  2.9246032E-02
 Iter:        2000  Maxdiff:  1.8029779E-02
 Iter:        3000  Maxdiff:  1.4946640E-02
 Iter:        4000  Maxdiff:  1.2880355E-02
 Iter:        5000  Maxdiff:  1.1320204E-02
 Iter:        6000  Maxdiff:  1.0075003E-02
 Iter:        7000  Maxdiff:  9.0442300E-03
 Iter:        8000  Maxdiff:  8.1699193E-03
 Iter:        9000  Maxdiff:  7.4144900E-03
 Iter:       10000  Maxdiff:  6.7545176E-03
 Converged after       10001 iterations with maxdiff =  6.7545176E-03
 Time taken (seconds):   2.525000      seconds

OpenMP do loop

D:\testing_gfortran15Version\comparison>ifort main_omp.f90 /Qopenmp /F5000000 /Qopt-report-phase:openmp
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.13.0 Build 20240602_000000
Copyright (C) 1985-2024 Intel Corporation.  All rights reserved.

ifort: remark #10448: Intel(R) Fortran Compiler Classic (ifort) is now deprecated and will be discontinued late 2024. Intel recommends that customers transition now to using the LLVM-based Intel(R) Fortran Compiler (ifx) for continued Windows* and Linux* support, new language support, new language features, and optimizations. Use '/Qdiag-disable:10448' to disable this message.
Microsoft (R) Incremental Linker Version 14.32.31332.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:main_omp.exe
-subsystem:console
-stack:5000000
-defaultlib:libiomp5md.lib
-nodefaultlib:vcomp.lib
-nodefaultlib:vcompd.lib
main_omp.obj

D:\testing_gfortran15Version\comparison>main_omp
 Start solving...
 Iter:        1000  Maxdiff:  1.7876625E-02
 Iter:        2000  Maxdiff:  1.2360156E-02
 Iter:        3000  Maxdiff:  9.5547438E-03
 Iter:        4000  Maxdiff:  7.8075826E-03
 Iter:        5000  Maxdiff:  6.8114996E-03
 Iter:        6000  Maxdiff:  6.3546896E-03
 Iter:        7000  Maxdiff:  6.0142875E-03
 Iter:        8000  Maxdiff:  5.7539642E-03
 Iter:        9000  Maxdiff:  5.5487156E-03
 Iter:       10000  Maxdiff:  5.3788126E-03
 Converged after       10001 iterations with maxdiff =  5.3788126E-03
 Time taken (seconds):   2.260000      seconds

Compiler reports

Concurrent do loop

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


Begin optimization report for: POISSON_SOLVER

    Report from: OpenMP optimizations [openmp]

OpenMP Construct at D:\testing_gfortran15Version\comparison\main_dc.f90(37,5)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at D:\testing_gfortran15Version\comparison\main_dc.f90(37,5)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at D:\testing_gfortran15Version\comparison\main_dc.f90(41,5)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at D:\testing_gfortran15Version\comparison\main_dc.f90(41,5)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
===========================================================================

OpenMP do loop

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


Begin optimization report for: POISSON_SOLVER

    Report from: OpenMP optimizations [openmp]

OpenMP Construct at D:\testing_gfortran15Version\comparison\main_omp.f90(37,11)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at D:\testing_gfortran15Version\comparison\main_omp.f90(45,11)
remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
===========================================================================```

Topic		Replies	Views
Performance, C vs. Fortran	25	4356	July 7, 2021
Do vs do concurrent	1	364	April 12, 2024
Are 4 nested do loops suboptimal? Help	22	974	April 1, 2022
Fortran openMP - make code faster	4	477	August 31, 2022
Slow thread creation with nested loops in GFortran Help	5	694	January 21, 2023

Do concurrent: not seeing any speedup

Testing

Compiler reports

Related topics