Clarification on DO CONCURRENT

I’ve encountered a behavior of DO CONCURRENT when using the Intel compiler that I find unexpected. There is already a long thread on the Intel board (https://community.intel.com/t5/Intel-Fortran-Compiler/do-concurrent-broken-with-openmp/m-p/1304354) without a conclusion so far. I summarize here my current understanding.

The issue can be demonstrated with the help of the following example.

program test_do_concurrent

  implicit none

  print*, b([1,2])


  contains

  function b(a)

    integer, dimension(2,2) :: b
    integer, intent(in), dimension(2) :: a
    integer :: i,j

    do concurrent(i=1:2, j=1:2)
      b(i,j) = a(2) * i * j
    enddo

  end function b

end program test_do_concurrent

I would expect that it prints

2 4 4 8

and that is what I get from gfortran and ifort, unless I compile with ifort and -qopenmp, in which case I get

0 0 0 0

According to the Intel support, this is ok because my code contains unspecified behavior if a (more specifically a(2)) is not defined in the loop.

The relevant aspects of the standard are

11.1.7.5 Additional semantics for DO CONCURRENT constructs

  1. The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED, or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a different entity in each iteration, similar to LOCAL locality.
  2. A variable that has LOCAL or LOCAL_INIT locality is a construct entity with the same type, type parameters, and rank as the variable with the same name in the innermost executable construct or scoping unit that includes the DO CONCURRENT construct, and the outside variable is inaccessible by that name within the construct. The construct entity has the ASYNCHRONOUS, CONTIGUOUS, POINTER, TARGET, or VOLATILE attribute if and only if the outside variable has that attribute; it does not have the BIND, INTENT, PROTECTED, SAVE, or VALUE attribute, even if the outside variable has that attribute. If it is not a pointer, it has the same bounds as the outside variable. At the beginning of execution of each iteration,
  3. If a variable has unspecified locality,
    • if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;

C1128: A variable-name that appears in a LOCAL or LOCAL_INIT locality-spec shall not have the ALLOCATABLE, INTENT (IN), or OPTIONAL attribute, shall not be of finalizable type, shall not be a nonpointer polymorphic dummy argument, and shall not be a coarray or an assumed-size array. A variable-name that is not permitted to appear in a variable definition context shall not appear in a LOCAL or LOCAL_INIT locality-spec.

The reasoning seems to be a has unspecified locality, therefore it should behave similar to a variable with LOCAL locality, but this ignores that a variable with INTENT(in) cannot have LOCAL locality. I agree with the Intel support that the standard is not 100% clear here, but the behavior is in my opinion against the principle of least astonishment. 11.1.7.5.p3 also does not say that unspecified locality requires the variable to become defined in the iteration, it just says that it should become defined at most once.

2 Likes

I tested gfortran -fopenmp a.f90 on this example and it still produces 2 4 4 8, even with optimizations options such as -O3 -march=native -ffast-math.

As a user, I can’t see anything wrong with your code, and as such I would personally consider this a bug in the Intel compiler. As a user, at the very least I would like the compiler to give me an error (or at least a warning) if it is going to return 0, because the code is somehow not conforming. However, as a user, I would like this code to just work — unless there is some technical reason why this cannot be done.

However, from the standards point of view, they might technically be right. In which case the standard should be improved.

I would be interested what @sblionel thinks on this one.

2 Likes

Just because the description of unspecified locality bears some similarity to LOCAL locality, that doesn’t make them the same. When DO CONCURRENT was first introduced in F2008, there was no concept of locality in the standard. This got added in F2018. The current words about “unspecified locality” applied, in F2008, to all variables in an iteration. At the time, the thought was that compilers could figure out on their own whether a variable was local or shared, and compilers tried to do that but didn’t always get it right. Mirroring OpenMP, F2018 added locality clauses to allow the programmer to specify what was meant.

That INTENT(IN) dummy arguments can’t be LOCAL is irrelevant.

However… I find myself disagreeing with Intel regarding this case. The standard says “if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration;”. That “or” is significant. a is not “previously defined” in any iteration, but neither is it “defined or become defined during any other iteration.” Therefore, the reference to a meets the requirements for unspecified locality, the example is conforming, and should produce the result you want.

I find it interesting that adding -qopenmp changes the behavior, because this source does not use any OpenMP features. Yes, ifort uses OpenMP to parallelize a DO CONCURRENT, but if I build with -qparallel, I get the “correct” answer. (With this example, -qparallel doesn’t parallelize because the compiler thinks there is insufficient work - reasonable.)

Note that the following modification does produce the desired result:

  function b(a)

    integer, dimension(2,2) :: b
    integer, intent(in), dimension(2) :: a
    integer, dimension(2) :: a2
    integer :: i,j

    a2 = a
    do concurrent(i=1:2, j=1:2)
      b(i,j) = a2(2) * i * j
    enddo

  end function b

I don’t see any functional difference between these two cases, and see no reason why simply adding -qopenmp should change the result. (In the past, one could sometimes blame that on this implicitly making all procedures recursive, but in the 2021 compiler, RECURSIVE is the default (since that’s F2018.)

6 Likes

only ifort shows this behavior (I’ve clarified the description)

1 Like

Thank you very much Dr. Fortran! @sblionel .
I am curious, so, just say in order to make do concurrent work, what flag should I use really? -qparallel or -qopenmp?

I am a little confused, like when do we use -qparallel and when do we use qopenmp?

Is it that -qparallel just automatically make some part of code parallelized (we do not need to do anything, the compiler will handle it automatically), like on a desktop/laptop, it can use multiples core and threads automatically?

While -qopenmp is basically we need to put !$OMP somewhere in the code and we need to really do the parallelization job for the code, right?

But do concurrent I am a little confused, if my code only have do concurrent and do not have any other openMP part, in this case, ``-qparallelandqopenmp``` which flag should I use?

Thank you very much in advance Dr. Fortran!

I have a question.
What is the difference between a variable with SHARED locality compared with one with unspecified locality (in a do concurrent of course)?
Are there examples of different behaviour, or is the SHARED locality just a guarantee given by the programmer to the compiler?
Or examples where without the SHARED locality the code is non conforming.

Thanks

-parallel (not -qparallel) directs the compiler to automatically parallelize Fortran loops (ordinary DO and DO CONCURRENT) as well as some array operations. It is rather conservative in its approach, though you can use compiler-specific directives to give the compiler more information. Intel Fortran uses OpenMP “under the covers” for this parallelization. See Automatic Parallelization (intel.com) for more information.

-qopenmp says that you will be using OpenMP directives and procedures to “direct” parallelization of your code. The Intel compiler will ALSO do parallelization of DO CONCURRENT if you say -qopenmp, even if you are not using OpenMP otherwise.

2 Likes

SHARED tells the compiler that the variable is shared across iterations, so it doesn’t need to try to figure it out by itself. Here is what the standard says about it (11.1.7.5p3):

3 If a variable has SHARED locality, appearances of the variable within the DO CONCURRENT construct refer to the variable in the innermost executable construct or scoping unit that includes the DO CONCURRENT construct. If it is defined or becomes undefined during any iteration, it shall not be referenced, defined, or become undefined during any other iteration. If it is allocated, deallocated, nullified, or pointer-assigned during an iteration it shall not have its allocation or association status, dynamic type, array bounds, shape, or a deferred type parameter value inquired about in any other iteration. A noncontiguous array with SHARED locality shall not be supplied as an actual argument corresponding to a contiguous INTENT (INOUT) dummy argument.

3 Likes

I have tried a do concurrent with LOCAL_INIT with ifort 2021.5.0 20211109.

program main
    implicit none
    integer :: a, i
    a = 5
    do concurrent(i = 1:5) local_init(a)
        print "(I5,I7)", i, a 
        a = a*2
        print "(A5, I7)",'     ', a
    end do
end program

Compiled in WSL2 with -parallel. It writes:

    1      5
          10
    2     10
          20
    3     20
          40
    4     40
          80
    5     80
         160

But I would have expected:

    1      5
          10
    2      5
          10
    3      5
          10
    4      5
          10
    5      5
          10

I’m getting your expected output if I compile with -qopenmp option instead of -parallel.
But, if I enlarge the limit of iterations to 20: do concurrent (i=1:20) local_init(a), the output gets messy (note: I modified the code to output both i and a in both print statements):

Output
    1      5
   17      5
   17     10
   18      5
    7      5
   15      5
   13      5
    9      5
    9     10
   10     10
   10     20
   19      5
   20      5
    5      5
    5     10
    6     10
    6     20
   11      5
    1     10
    3      5
   18     10
    7     10
   15     10
   13     10
   14     10
   14     20
   19     10
   20     10
   11     10
    2     10
    2     20
    3     10
    8     10
   16     10
   12     10
   12     20
    4     10
    4     20
    8     20
   16     20

If the results depend on the number of iterations vs. number of available processors (as it seems), the whole thing is no good.

I think this is just an ifort bug. You should report it.

As you can see below, the shared clause fix this.

jhammond@nuclear:~$ ifort -fopenmp dcbug.F90 -DFOO="" && ./a.out
           0           0           0           0
jhammond@nuclear:~$ ifort -fopenmp dcbug.F90 -DFOO="shared(a)" && ./a.out
           2           4           4           8
jhammond@nuclear:~$ ifort -fopenmp dcbug.F90 -DFOO="local(a)" && ./a.out
           0           0           0           0
jhammond@nuclear:~$ ifort -fopenmp dcbug.F90 -DFOO="local_init(a)" && ./a.out
           2           4           4           8
jhammond@nuclear:~$ cat dcbug.F90
program test_do_concurrent

  implicit none

  print*, b([1,2])


  contains

  function b(a)

    integer, dimension(2,2) :: b
    integer, intent(in), dimension(2) :: a
    integer :: i,j

    do concurrent(i=1:2, j=1:2) FOO
      b(i,j) = a(2) * i * j
    enddo

  end function b

end program test_do_concurrent

I don’t understand why intent(in) precludes the use of local or local_init.

I’m not sure, but by analogy with OpenMP, I guess that local or local_init means local to a thread, not local to an iteration. If for some reason the compiler decides not to parallelize the loop, then a single thread is used, with the same a all along the iterations. So your "unexpected output` looks perfectly right to me.

It makes sense to me. In practice, what would be the benefit of using local or local_init for a variable that shall not be modified?

Well, nowhere in the standard there is the word thread.
I’d like other comments but at:

11.1.7.5 Additional semantics for DO CONCURRENT constructs

is written:

At the beginning of execution of each iteration,
• if a variable with LOCAL locality is a pointer it has undefined pointer association status, and otherwise it is undefined except for any subobjects that are default-initialized;
• a variable with LOCAL_INIT locality has the pointer association status and definition status of the out side variable with that name; the outside variable shall not be an undefined pointer or a nonallocatable nonpointer variable that is undefined.

What is the correct interpretation of definition status?
That the local_init variable will have the same value of the outside variable with the same name?

Otherwise I don’t understand how you can use local_init variables.

Trying the example as in the OP (using ifort 2021.7.1, -qopenmp flag on), I am getting runtime error forrtl: error (76): Abort trap signal.

I then tried to reproduce the equivalent OpenMP explicit implementation:

...
!$omp parallel do collapse(2) default(private) shared(b, a) num_threads(4)
do j = 1, 2
   do i = 1, 2
      id    = (j-1)*2 + i
      b(id) = a(2) * i * j
   enddo
enddo
!$omp end parallel do

which generates

OMP: Error #179: Function Can't open SHM2 failed:
OMP: System error #30: Read-only file system
forrtl: severe (174): SIGSEGV, segmentation fault occurred

Might be something related with Intel’s OpenMP implementation ?

Gives the same error with just:

!$omp parallel do PRIVATE(I, J, id) num_threads(2)
do j = 1, 2
   do i = 1, 2
      id    = (j-1)*2 + i
      !$omp critical
      print *, id, i * j
      !$omp end critical
   enddo
enddo
!$omp end parallel do

which seems strange to me…

Looks like you’re right, and that the behavior of local / local_init is different from OpenMP private / firstprivate

1 Like

Totally agree.

I think you can see local_init as the firstprivate OpenMP equivalent. For each thread the compiler created for executing the loop, it does a copy of the value (definition status) held by that variable (right before the start) to all threads local instances, assuming that if it is a pointer variable, it is associated with a target (which by itself has been defined previously, i.e. does not hold UB).

EDIT:

actually, true, but… you could translate the two at least at the very first iteration in the do concurrent case, isn’t it ?

What is happening is clear, but there is no concept of thread in the standard.

A variable that has LOCAL or LOCAL_INIT locality is a construct entity with the same type, type parameters, and rank as the variable with the same name in the innermost executable construct or scoping unit that includes the DO CONCURRENT construct, and the outside variable is inaccessible by that name within the construct.

I.e. there is no sharing of those variables between iterations. For local variables, they must be defined before being used within an iteration. For local_init, they start with the value of the variable outside the DO CONCURRENT construct.

The bottom line is that the operations in one iteration of a DO CONCURRENT loop shall not affect the operations in any other iteration of a DO CONCURRENT loop or you’ve violated the standard. That’s the whole point of the construct is to give the programmer a way to say “these iterations don’t depend on each other”.

P.S. Even for SHARED:

If it is defined or becomes undefined during any iteration, it shall not be referenced, defined, or become undefined during any other iteration.

1 Like

Just to finish testing Intel compilers, I run this program:

program main
    implicit none
    integer, parameter :: NP = 100
    integer :: a, i, b(NP)
    a = 1000
    do concurrent(i = 1:NP) local_init(a)
        a = a*2
        b(i) = a + i
    end do 

    do i=1, NP
        print *,i, b(i)
    enddo
end program

ifort needed -qopenmp or -parallel compiler options, but the output was wrong in both cases.
Instead ifx, while still needing -qopenmp, printed the correct output.

In this example, does “do concurrent” imply a multi-thread activity ?

Also, for this to work in !$OMP, you would require the following (although 2**NP will fail):

!$OMP parallel private(i,aa) shared (NP,a,b)
    do i = 1,NP
        aa = a*2**i
        b(i) = aa + i
    end do 
!$OMP end parallel

do concurrent also supports multiple executables, so does “local_init(a)” refer to variables with this shared status between executables, rather than local threads ?