TL;DR: when using hybrid do concurrent + OpenMP/OpenACC paralellization, the transformations appear to depend heavily on the implementation (compiler flags).
I think the best option here would be the OpenMP 6.0 amendments, that allow using the target [teams] loop construct on do concurrent blocks. Quoting the Intel Fortran documentation:
OpenMP 6.0 specifications added support for specifying “loop” constructs on DO CONCURRENT. This enables you to offload DO CONCURRENT loops using OpenMP target construct on DO CONCURRENT loops. It also enables you to control the data transfers using OpenMP mapping clauses.
In this case you could also use the num_teams clause on the omp teams loop directive for program control. I’m not sure if the Nvidia compilers support the use of loop on do concurrent loops.
Just to have a MWE, I took your loops and put them into a function:
subroutine foo(a,b)
real, contiguous :: a(:,:,:), b(:,:,:)
integer :: i, j, k
!$omp target nowait ! L6
do concurrent (i=1:10,j=1:15,k=1:27) ! L7
a(i,j,k) = 0.0
end do
!$omp end target
!$omp target nowait ! L12
do concurrent (i=1:10, j = 4:24, k=1:2) ! L13
b(i,j,k) = 77.0
end do
!$omp end target
!$omp taskwait ! L18
end subroutine
When compiled with the command: ifx -c -O2 -fiopenmp -fopenmp-targets=spir64 -fopenmp-target-do-concurrent -qopt-report=3 -qopt-report-phase=openmp -qopt-report-stdout foo.f90, I get the following report:
Begin optimization report for: foo_
OMP TASK BEGIN at foo.f90 (6, 7)
remark #30011: task construct ignored
OMP TASK END
OMP TASK BEGIN at foo.f90 (12, 7)
remark #30011: task construct ignored
OMP TASK END
OMP TASKWAIT BEGIN at foo.f90 (18, 7)
remark #30008: taskwait construct transformed
OMP TASKWAIT END
OMP DIR.OMP.TARGET BEGIN at foo.f90 (6, 7)
remark #30012: "A" has an implicit clause "map(map(tofrom: A))" because it is a non-scalar variable referenced at line:[8:3]
remark #30020: OpenMP: PRIVATE clause item foo_$I$_1 is made work-item-local
remark #30020: OpenMP: PRIVATE clause item foo_$J$_1 is made work-item-local
remark #30020: OpenMP: PRIVATE clause item foo_$K$_1 is made work-item-local
remark #30008: target construct transformed
OMP LOOP BEGIN at foo.f90 (7, 1)
remark #30015: OpenMP: loop construct was transformed into parallel do
remark #30014: OpenMP: 3 loops collapsed
remark #30008: parallel do construct transformed
OMP LOOP END
OMP DIR.OMP.TARGET END
OMP DIR.OMP.TARGET BEGIN at foo.f90 (12, 7)
remark #30012: "B" has an implicit clause "map(map(tofrom: B))" because it is a non-scalar variable referenced at line:[14:2]
remark #30020: OpenMP: PRIVATE clause item foo_$I$_2 is made work-item-local
remark #30020: OpenMP: PRIVATE clause item foo_$J$_2 is made work-item-local
remark #30020: OpenMP: PRIVATE clause item foo_$K$_2 is made work-item-local
remark #30008: target construct transformed
OMP LOOP BEGIN at foo.f90 (13, 1)
remark #30015: OpenMP: loop construct was transformed into parallel do
remark #30014: OpenMP: 3 loops collapsed
remark #30008: parallel do construct transformed
OMP LOOP END
OMP DIR.OMP.TARGET END
With the command nvfortran -O2 -mp=gpu -stdpar=gpu -Minfo=mp,stdpar I get the following report:
6, !$omp target
6, Generating "nvkernel_foo__F1L6_2" GPU kernel
7, Generating NVIDIA GPU code
7, Loop run sequentially
Loop parallelized across CUDA thread blocks, CUDA threads(32) blockidx%x threadidx%x
7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
12, !$omp target
12, Generating "nvkernel_foo__F1L12_4" GPU kernel
18, Taskwait
For some reason it looks the first loop is only partially parallelized, while the second loop might not be parallelized at all (?).
If I remove -mp=gpu flag, the report changes to:
7, Generating NVIDIA GPU code
7, ! blockidx%x threadidx%x auto-collapsed
Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
7, Generating implicit copyout(a(:10,:15,:27)) [if not already present]
13, Generating NVIDIA GPU code
13, ! blockidx%x threadidx%x auto-collapsed
Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
13, Generating implicit copyout(b(:10,4:24,:2)) [if not already present]
which shows that now the code is parallel, but it probably isn’t asynchronous because the OpenMP directives are turned off.
Here is a link to Compiler Explorer with the snippet above: Compiler Explorer
If I replace with OpenACC directives (ChatGPT did this):
!$acc parallel loop collapse(3) async(1) present(a)
do concurrent (i=1:10, j=1:15, k=1:27)
a(i,j,k) = 0.0
end do
!$acc parallel loop collapse(3) async(2) present(b)
do concurrent (i=1:10, j=4:24, k=1:2)
b(i,j,k) = 77.0
end do
!$acc wait
and use the flags -O2 -acc=gpu -stdpar=gpu -Minfo=acc,stdpar the report is
6, Generating present(a(:,:,:))
Generating NVIDIA GPU code
7, !$acc loop seq
!$acc loop vector(32) ! threadidx%x
7, Loop is parallelizable
11, Generating present(b(:,:,:))
Generating NVIDIA GPU code
12, !$acc loop seq
!$acc loop vector(32) ! threadidx%x
12, Loop is parallelizable
If I remove -stdpar=gpu the report is,
6, Generating present(a(:,:,:))
Generating NVIDIA GPU code
7, ! blockidx%x threadidx%x collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
11, Generating present(b(:,:,:))
Generating NVIDIA GPU code
12, ! blockidx%x threadidx%x collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
If I change the code to use the !$acc kernels directive,
!$acc data present(a,b)
!$acc kernels async(1)
do concurrent (i=1:10, j=1:15, k=1:27)
a(i,j,k) = 0.0
end do
!$acc end kernels
!$acc kernels async(2)
do concurrent (i=1:10, j=4:24, k=1:2)
b(i,j,k) = 77.0
end do
!$acc end kernels
!$acc wait
!$acc end data
the same report is generated, with or without -stdpar=gpu:
5, Generating present(b(:,:,:),a(:,:,:))
7, Loop is parallelizable
Generating NVIDIA GPU code
7, ! blockidx%x threadidx%x auto-collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
13, Loop is parallelizable
Generating NVIDIA GPU code
13, ! blockidx%x threadidx%x auto-collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x collapsed-innermost
Edit 2: the combination of kernels and do concurrent is governed by the following paragraph from the OpenACC 3.3 spec (page 91, section 2.17.2),
This section refers to the Fortran do concurrent construct that is a form of do construct. When do concurrent appears without a loop construct in a kernels construct it is treated as if it is annotated with loop auto. If it appears in a parallel construct or an accelerator routine then it is treated as if it is annotated with loop independent.
On a tangent, I’m a bit confused by the right-form of OpenACC clause here. With the code:
!$acc parallel loop async present(a)
do concurrent (i=1:10, j=1:15, k=1:27)
a(i,j,k) = 0.0
end do
I get the report:
5, Generating present(a(:,:,:))
Generating NVIDIA GPU code
6, !$acc loop gang ! blockidx%x
!$acc loop vector(32) ! threadidx%x
!$acc loop seq
Using the collapse(3) clause, in
!$acc parallel loop collapse(3) async present(a)
do concurrent (i=1:10, j=1:15, k=1:27)
a(i,j,k) = 0.0
end do
gives the report:
5, Generating present(a(:,:,:))
Generating NVIDIA GPU code
6, ! blockidx%x threadidx%x collapsed
!$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
Edit: I believe the use of collapse here is technically forbidden by the OpenACC 3.3 spec (page 58, line 2023) which states: “A tile and collapse clause may not appear on loop that is associated with do concurrent.” Seems like the nvfortran compiler accepts it anyway.

This case of using async is documented in the Nvidia HPC SDK manual (NVIDIA HPC Compilers User's Guide — NVIDIA HPC Compilers User's Guide 25.7 documentation),
Use of OpenACC async clause on the compute constructs can be utilised to perform computations in DC-loop asynchronously.
But there is no “Interoperability [of stdpar] with OpenMP” section in their manual, so I’m not convinced that use case is allowed/tested.
Anyways, I believe OpenMP 6.0 supports do concurrent, but compiler support is still lacking. In the last discussion about this topic on Discourse, we didn’t reach a conclusion if OpenMP 5.2 specified the interaction well enough: OpenMP and `do concurrent` loop = crash at runtime - #11 by PierU. It seemed to fall under the cases where you are at the mercy of your compiler vendor (meaning likely to be unportable).