Clarification on DO CONCURRENT

I’ve encountered a behavior of DO CONCURRENT when using the Intel compiler that I find unexpected. There is already a long thread on the Intel board (Re:do concurrent broken with openmp - Intel Community) without a conclusion so far. I summarize here my current understanding.

The issue can be demonstrated with the help of the following example.

program test_do_concurrent

  implicit none

  print*, b([1,2])


  contains

  function b(a)

    integer, dimension(2,2) :: b
    integer, intent(in), dimension(2) :: a
    integer :: i,j

    do concurrent(i=1:2, j=1:2)
      b(i,j) = a(2) * i * j
    enddo

  end function b

end program test_do_concurrent

I would expect that it prints

2 4 4 8

and that is what I get from gfortran and ifort, unless I compile with ifort and -qopenmp, in which case I get

0 0 0 0

According to the Intel support, this is ok because my code contains unspecified behavior if a (more specifically a(2)) is not defined in the loop.

The relevant aspects of the standard are

11.1.7.5 Additional semantics for DO CONCURRENT constructs

  1. The locality of a variable that appears in a DO CONCURRENT construct is LOCAL, LOCAL_INIT, SHARED, or unspecified. A construct or statement entity of a construct or statement within the DO CONCURRENT construct has SHARED locality if it has the SAVE attribute. If it does not have the SAVE attribute, it is a different entity in each iteration, similar to LOCAL locality.
  2. A variable that has LOCAL or LOCAL_INIT locality is a construct entity with the same type, type parameters, and rank as the variable with the same name in the innermost executable construct or scoping unit that includes the DO CONCURRENT construct, and the outside variable is inaccessible by that name within the construct. The construct entity has the ASYNCHRONOUS, CONTIGUOUS, POINTER, TARGET, or VOLATILE attribute if and only if the outside variable has that attribute; it does not have the BIND, INTENT, PROTECTED, SAVE, or VALUE attribute, even if the outside variable has that attribute. If it is not a pointer, it has the same bounds as the outside variable. At the beginning of execution of each iteration,
  3. If a variable has unspecified locality,
    • if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration; if it is defined or becomes undefined by more than one iteration it becomes undefined when the loop terminates;

C1128: A variable-name that appears in a LOCAL or LOCAL_INIT locality-spec shall not have the ALLOCATABLE, INTENT (IN), or OPTIONAL attribute, shall not be of finalizable type, shall not be a nonpointer polymorphic dummy argument, and shall not be a coarray or an assumed-size array. A variable-name that is not permitted to appear in a variable definition context shall not appear in a LOCAL or LOCAL_INIT locality-spec.

The reasoning seems to be a has unspecified locality, therefore it should behave similar to a variable with LOCAL locality, but this ignores that a variable with INTENT(in) cannot have LOCAL locality. I agree with the Intel support that the standard is not 100% clear here, but the behavior is in my opinion against the principle of least astonishment. 11.1.7.5.p3 also does not say that unspecified locality requires the variable to become defined in the iteration, it just says that it should become defined at most once.

1 Like

I tested gfortran -fopenmp a.f90 on this example and it still produces 2 4 4 8, even with optimizations options such as -O3 -march=native -ffast-math.

As a user, I can’t see anything wrong with your code, and as such I would personally consider this a bug in the Intel compiler. As a user, at the very least I would like the compiler to give me an error (or at least a warning) if it is going to return 0, because the code is somehow not conforming. However, as a user, I would like this code to just work — unless there is some technical reason why this cannot be done.

However, from the standards point of view, they might technically be right. In which case the standard should be improved.

I would be interested what @sblionel thinks on this one.

1 Like

Just because the description of unspecified locality bears some similarity to LOCAL locality, that doesn’t make them the same. When DO CONCURRENT was first introduced in F2008, there was no concept of locality in the standard. This got added in F2018. The current words about “unspecified locality” applied, in F2008, to all variables in an iteration. At the time, the thought was that compilers could figure out on their own whether a variable was local or shared, and compilers tried to do that but didn’t always get it right. Mirroring OpenMP, F2018 added locality clauses to allow the programmer to specify what was meant.

That INTENT(IN) dummy arguments can’t be LOCAL is irrelevant.

However… I find myself disagreeing with Intel regarding this case. The standard says “if it is referenced in an iteration it shall either be previously defined during that iteration, or shall not be defined or become undefined during any other iteration;”. That “or” is significant. a is not “previously defined” in any iteration, but neither is it “defined or become defined during any other iteration.” Therefore, the reference to a meets the requirements for unspecified locality, the example is conforming, and should produce the result you want.

I find it interesting that adding -qopenmp changes the behavior, because this source does not use any OpenMP features. Yes, ifort uses OpenMP to parallelize a DO CONCURRENT, but if I build with -qparallel, I get the “correct” answer. (With this example, -qparallel doesn’t parallelize because the compiler thinks there is insufficient work - reasonable.)

Note that the following modification does produce the desired result:

  function b(a)

    integer, dimension(2,2) :: b
    integer, intent(in), dimension(2) :: a
    integer, dimension(2) :: a2
    integer :: i,j

    a2 = a
    do concurrent(i=1:2, j=1:2)
      b(i,j) = a2(2) * i * j
    enddo

  end function b

I don’t see any functional difference between these two cases, and see no reason why simply adding -qopenmp should change the result. (In the past, one could sometimes blame that on this implicitly making all procedures recursive, but in the 2021 compiler, RECURSIVE is the default (since that’s F2018.)

6 Likes

only ifort shows this behavior (I’ve clarified the description)

1 Like

Thank you very much Dr. Fortran! @sblionel .
I am curious, so, just say in order to make do concurrent work, what flag should I use really? -qparallel or -qopenmp?

I am a little confused, like when do we use -qparallel and when do we use qopenmp?

Is it that -qparallel just automatically make some part of code parallelized (we do not need to do anything, the compiler will handle it automatically), like on a desktop/laptop, it can use multiples core and threads automatically?

While -qopenmp is basically we need to put !$OMP somewhere in the code and we need to really do the parallelization job for the code, right?

But do concurrent I am a little confused, if my code only have do concurrent and do not have any other openMP part, in this case, ``-qparallelandqopenmp``` which flag should I use?

Thank you very much in advance Dr. Fortran!

I have a question.
What is the difference between a variable with SHARED locality compared with one with unspecified locality (in a do concurrent of course)?
Are there examples of different behaviour, or is the SHARED locality just a guarantee given by the programmer to the compiler?
Or examples where without the SHARED locality the code is non conforming.

Thanks

-parallel (not -qparallel) directs the compiler to automatically parallelize Fortran loops (ordinary DO and DO CONCURRENT) as well as some array operations. It is rather conservative in its approach, though you can use compiler-specific directives to give the compiler more information. Intel Fortran uses OpenMP “under the covers” for this parallelization. See Automatic Parallelization (intel.com) for more information.

-qopenmp says that you will be using OpenMP directives and procedures to “direct” parallelization of your code. The Intel compiler will ALSO do parallelization of DO CONCURRENT if you say -qopenmp, even if you are not using OpenMP otherwise.

1 Like

SHARED tells the compiler that the variable is shared across iterations, so it doesn’t need to try to figure it out by itself. Here is what the standard says about it (11.1.7.5p3):

3 If a variable has SHARED locality, appearances of the variable within the DO CONCURRENT construct refer to the variable in the innermost executable construct or scoping unit that includes the DO CONCURRENT construct. If it is defined or becomes undefined during any iteration, it shall not be referenced, defined, or become undefined during any other iteration. If it is allocated, deallocated, nullified, or pointer-assigned during an iteration it shall not have its allocation or association status, dynamic type, array bounds, shape, or a deferred type parameter value inquired about in any other iteration. A noncontiguous array with SHARED locality shall not be supplied as an actual argument corresponding to a contiguous INTENT (INOUT) dummy argument.

1 Like

I have tried a do concurrent with LOCAL_INIT with ifort 2021.5.0 20211109.

program main
    implicit none
    integer :: a, i
    a = 5
    do concurrent(i = 1:5) local_init(a)
        print "(I5,I7)", i, a 
        a = a*2
        print "(A5, I7)",'     ', a
    end do
end program

Compiled in WSL2 with -parallel. It writes:

    1      5
          10
    2     10
          20
    3     20
          40
    4     40
          80
    5     80
         160

But I would have expected:

    1      5
          10
    2      5
          10
    3      5
          10
    4      5
          10
    5      5
          10

I’m getting your expected output if I compile with -qopenmp option instead of -parallel.
But, if I enlarge the limit of iterations to 20: do concurrent (i=1:20) local_init(a), the output gets messy (note: I modified the code to output both i and a in both print statements):

Output
    1      5
   17      5
   17     10
   18      5
    7      5
   15      5
   13      5
    9      5
    9     10
   10     10
   10     20
   19      5
   20      5
    5      5
    5     10
    6     10
    6     20
   11      5
    1     10
    3      5
   18     10
    7     10
   15     10
   13     10
   14     10
   14     20
   19     10
   20     10
   11     10
    2     10
    2     20
    3     10
    8     10
   16     10
   12     10
   12     20
    4     10
    4     20
    8     20
   16     20

If the results depend on the number of iterations vs. number of available processors (as it seems), the whole thing is no good.