On the creation of temporary arrays when passing type element arguments

Dear community:

I would like to hear your recommendation on how to optimize the scenario in where I need to pass a column of a type array element as a parameter to a function while avoiding the creation of temporary arrays. The situation emerges in legacy code that processes business-like data structures from data blobs that I cant modify.

In the original simplified code below, The main TEST program reads an array of complex types, and uses the VSRCH to find the location of a string in one of the string colums of the complex type:

INTEGER FUNCTION VSRCH(STR, LST, L)
   IMPLICIT NONE
   INTEGER L
   CHARACTER(*) STR
   CHARACTER(*) LST(L)
   INTEGER I
   VSRCH = -1
   DO I=1,L
       IF (LST(I) == STR) THEN
            VSRCH = I
            EXIT
       END IF
   END DO
END FUNCTION VSRCH

PROGRAM TEST

IMPLICIT NONE
INTEGER SIZ, LNA, LNB
PARAMETER (LNA = 10)
PARAMETER (LNB = 20)
PARAMETER (SIZ = 100000)

TYPE :: TableA
  CHARACTER(LEN=LNA) :: A
  CHARACTER(LEN=LNB) :: B
  INTEGER D
  REAL C
END TYPE

TYPE(TableA), DIMENSION(SIZ) :: list
INTEGER VSRCH
INTEGER IDX,I,N
real T1,T2 
CHARACTER(100) :: num

list%A = 'AA'
list(5)%A = 'A'

N=1
IF(COMMAND_ARGUMENT_COUNT().GE.1)THEN
   CALL GET_COMMAND_ARGUMENT(1,num)
   READ(num,*)N 
END IF

call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, IDX, T2-T1

END

When compiled with -check all a warning appears telling us that pasing the column list%A to the VSRCH method creates a temporary array:

*forrtl: warning (406): fort: (1): In call to VSRCH, an array temporary was created for argument #2*

This is understandable as the shape of the list array type is not the same as the expected array of strings argument, and as a result the list%A column is a non stride-1 array, so the compiler decides to make a copy of list%A before passing it to the method.

After some research, Arjen Markus suggested several improvements to modernize the old code and use accessor auxiliary functions that circumvected the problem.

Further research indicated that this scenario can be fixed without resorting to auxiliary functions and that it can be fixed by only introducing a minimum set of some of the new language features like MODULE, INTENT and DIMENSION that change the way the arguments and the signature of the function are defined.

The following code is almost the same, but when compiler with -check all it does not produce a warning anymore:

MODULE SRCH
CONTAINS
INTEGER FUNCTION VSRCH(STR, LST, L)
   IMPLICIT NONE
   INTEGER, INTENT(IN) :: L
   CHARACTER(*), INTENT(IN):: STR
   CHARACTER(*), DIMENSION(:),INTENT(IN) :: LST
   INTEGER :: I
   VSRCH = -1
   DO I=1,L
       IF (LST(I) == STR) THEN
            VSRCH = I
            EXIT
       END IF
   END DO
END FUNCTION VSRCH
END MODULE

PROGRAM TEST

USE SRCH

IMPLICIT NONE
INTEGER SIZ, LNA, LNB
PARAMETER (LNA = 10)
PARAMETER (LNB = 20)
PARAMETER (SIZ = 100000)

TYPE :: TableA
  CHARACTER(LEN=LNA) :: A
  CHARACTER(LEN=LNB) :: B
  INTEGER D
  REAL C
END TYPE

TYPE(TableA), DIMENSION(SIZ) :: list

INTEGER IDX,I,N
REAL T1,T2 
CHARACTER(10) :: num

list%A = 'AA'
list(5)%A = 'A'

N=1
IF(COMMAND_ARGUMENT_COUNT().GE.1)THEN
   CALL GET_COMMAND_ARGUMENT(1,num)
   READ(num,*)N 
END IF

call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, IDX, T2-T1

END

As a result, when tested to run consecutively 100k times with an array of 100k entries in an Intel(R) Core™ i5-8365U CPU @ 1.60GHz, 1896 Mhz, 4 Core(s) laptop, the original codes takes 24.8s - 25.3s. The modified code takes 0.0156s - 0s, so this does speeds up the execution:

>ifort solution.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.
Microsoft (R) Incremental Linker Version 14.28.29913.0
Copyright (C) Microsoft Corporation.  All rights reserved.
-out:solution.exe
-subsystem:console
solution.obj
>problem.exe 100000
 Found       100001      100000           5   25.31250
>solution.exe 100000
 Found       100001      100000           5  0.0000000E+00

One plausible explanation of why this works is that the use of MODULE, INTENT and DIMENSION give the compiler the hints it needs to be smart about how to access the non stride-1 list%A data structure and avoid the unnecessary creation of temporary arrays and as a result the code is now very efficient.

Another plaussible explanation, however, is that these constructs simply confuses the compiler and as a result it does not produce any warnings anymore, but at runtime the performance is as bad as it used to be.

Any comments in deciding if this solution is really adequate or if I should look for a more proper solution it will be most appreciated.

2 Likes

My hypothesis is that in the original code the array of structures is passed by value (copying takes a lot of time), and in the new code it is passed by reference. But of course the compiler needs to verify that it is not modified into the subroutine because of the intent(in).

Note that in C, arrays are passed by reference. But in Fortran things seems more complicated:

Note that the L variable could also be declared as intent(in).

1 Like

To precise my thoughts:

  • in the first case, a subarray would be extracted from list, which takes time, then copied to the subroutine, which takes time,
  • in the second case, the compiler would use the address of the whole array list and simply increment a pointer with the length of the TableA type to access all the A strings.

INTENT and MODULE help the compiler.
Compilers experts will tell you if my idea is more or less correct or not…

But the time difference seems quite excessive…

2 Likes

Thank you for your analysis and input.

I have corrected the code to define the variable L as INTENT(IN) as you suggest.

I also updated the code with the version that I am using so that others can run the exact same cases with large arrays and number of iterations. As you say, the difference is significant. When running 100k times, the CPU is quite busy in the first scenario but takes no time in the second:

1 Like

yes 25/0.015=1666 x is significant :grinning:

Have you tried with another compiler or other options (-O3 for example) ?

Note also that with the last version of your code, your array seems rather big: 4.2 MB. By “big” I mean with a size similar to the size of the cache of your CPU (6 MB). So, with the temporary subarray you could exceed the size of the cache, which could imply many exchanges between the RAM and the CPU cache. That could be another reason for the time difference. It could be interesting to test the same programs on a machine with a bigger cache.

1 Like

Good suggestions…

With /O3 the results are the same:

>ifort /O3 solution.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.
Microsoft (R) Incremental Linker Version 14.28.29913.0
Copyright (C) Microsoft Corporation.  All rights reserved.
-out:solution.exe
-subsystem:console
solution.obj
>problem.exe 100000
 Found       100001      100000           5   24.62500
>solution.exe 100000
 Found       100001      100000           5  0.0000000E+00

Yes, these are large data structures, and yes, the cache in my laptop could be the culprit for the big difference.

To test this, I reduced the SIZ of the array to 10k and then to 1k and the original code takes 2 secs and 0.156s respectively , so it seems to scale linearly with the array size:

with SIZ set to 10k
    >problem.exe 100000
     Found       100001       10000           5   2.078125
    >solution.exe 100000
     Found       100001       10000           5  0.0000000E+00

with SIZ set to 1K
>problem.exe 100000
 Found       100001        1000           5  0.1562500
>solution.exe 100000
 Found       100001        1000           5  0.0000000E+00

So no cache problem.
But it could occur if you use a 300k array.

Using a profiler like gprof maybe also interesting to see where time is wasted.

Brilliant ! Look at the results when running with SIZ set to 300k. I guess we found one more reason to use the proposed solution approach:

>problem.exe 100000
forrtl: severe (170): Program Exception - stack overflow
Image              PC                Routine            Line        Source
problem.exe        00007FF7C9F11198  Unknown               Unknown  Unknown
problem.exe        00007FF7C9EB11AA  Unknown               Unknown  Unknown
problem.exe        00007FF7C9F10FBE  Unknown               Unknown  Unknown
problem.exe        00007FF7C9F113A4  Unknown               Unknown  Unknown
KERNEL32.DLL       00007FFD85917974  Unknown               Unknown  Unknown
ntdll.dll          00007FFD87E0A2D1  Unknown               Unknown  Unknown

>solution.exe 100000
 Found       100001      300000           5  0.0000000E+00
1 Like

Good! So the first program is copying the subarray on the stack, the second is not putting it on the stack. And hence is probably using the address of the array (either by passing it or directly, I don’t know) .

With gfortran-10, the use of assumed-shape dummy array for the OP’s code seems to use a temporary array (if I attach the -fcheck=all option)… I’m wondering if this is compiler-specific? (I may be doing something wrong with compilation, though)

Edit: Here is the output on my computer (old mac) for the second code snippet by the OP (using an explicit interface via module + assumed-shape dummy array + pass list%A directly as an actual argument).

$ gfortran-10 -fcheck=all solution.f90 
$ ./a.out
At line 53 of file solution.f90
Fortran runtime warning: An array temporary was created
 Found    2   100000   5   1.76799949E-03

$ gfortran-10 -fcheck=all -O3 solution.f90 
$ ./a.out
At line 53 of file solution.f90
Fortran runtime warning: An array temporary was created
 Found    2   100000   5   9.74999741E-04

Line 53 corresponds to this line:

IDX = VSRCH("A", list%A, SIZ)

@lanrebr ,

Welcome to this forum, I’m glad you took my advice over at the Intel Fortran forum and posted here as well.

So to reiterate what I recommended at the Intel forum, please review Fortran 2018 and you will find a Fortran intrinsic named FINDLOC (actually it was introduced starting Fortran 2008) that provides a compiler-implemented (hopefully better integrated with the processor) and which can aid your VSRCH function:

Also, look at other array intrinsics in Fortran such as COUNT, MAXLOC, MINLOC, etc.

Beware of a bug in Intel Fortran though:

Also, note you should now have no doubt whatsoever about the use and value of MODULEs, the INTENT attributes of dummy arguments, explicit interfaces, and a ton of other Fortran standard facilities.

@lanrebr ,

Note that in addition to compiler options such as -check, you can attempt some additional instrumentation that can crosscheck what the compiler might be indicating, in this instance using an extension method LOC:

So you may introduce it your utility function like so:

INTEGER FUNCTION VSRCH(STR, LST, L)
   IMPLICIT NONE
   INTEGER L
   CHARACTER(*) STR
   CHARACTER(*) LST(L)
   INTEGER I
   logical, save :: lfirst = .true.
   if ( lfirst ) then
      print *, "In VSRCH: loc(lst(1)): ", loc(lst(1))
      lfirst = .false.
   end if 
   VSRCH = -1
   DO I=1,L
       IF (LST(I) == STR) THEN
            VSRCH = I
            EXIT
       END IF
   END DO
END FUNCTION VSRCH

And also in your test main:

PROGRAM TEST

IMPLICIT NONE
INTEGER SIZ, LNA, LNB
PARAMETER (LNA = 10)
PARAMETER (LNB = 20)
PARAMETER (SIZ = 100000)

TYPE :: TableA
  CHARACTER(LEN=LNA) :: A
  CHARACTER(LEN=LNB) :: B
  INTEGER D
  REAL C
END TYPE

TYPE(TableA), DIMENSION(SIZ) :: list
INTEGER VSRCH
INTEGER IDX,I,N
real T1,T2 
CHARACTER(100) :: num

list%A = 'AA'
list(5)%A = 'A'

N=1
IF(COMMAND_ARGUMENT_COUNT().GE.1)THEN
   CALL GET_COMMAND_ARGUMENT(1,num)
   READ(num,*)N 
END IF

print *, "In main: loc(list(1)%A): ", loc(list(1)%A)
call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, IDX, T2-T1

END

With this you might find:

C:\temp>ifort /c vsrch.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

C:\temp>ifort /c o.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.2.0 Build 20210228_000000
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

C:\temp>link o.obj vsrch.obj /subsystem:console /out:o.exe
Microsoft (R) Incremental Linker Version 14.27.29112.0
Copyright (C) Microsoft Corporation. All rights reserved.

C:\temp>o.exe
In main: loc(list(1)%A): 140702822234272
In VSRCH: loc(lst(1)): 105602136384
Found 2 100000 5 0.0000000E+00

C:\temp>

You can then try the same with your other approach and see what you get!

As a rule of thumb, always use MODULE and INTENT. Just remember that it helps the compiler (to check and optimize your program), and you won’t forget.

It’s more verbose, but explicit is better than implicit. Tell the compiler all you know about your program and data, you don’t lose your time: the compiler will reward you with a faster program and less time lost to debug your code.

And it’s true for all attributes: PURE, PARAMETER, etc.

2 Likes

I agree with this advice in order to catch errors at compile time. But are there examples where merely declaring a procedure PURE and its arguments’ INTENTs actually speed up the code? Libraries such as LAPACK, written for speed, are written in FORTRAN 77 style and do not specify argument INTENTs.

1 Like

If I run the first program with gfortran on an Intel(R) Core™ i7-5500U CPU @ 2.40GHz (the time in seconds is on the right):

  • Original program (without module, without intent(in)):
$ gfortran vsrch.f90
$ time ./a.out 10000
 Found        10001      100000           5   11.4076843
  • With just a module:
$ time ./a.out 10000
 Found        10001      100000           5   11.6750689
  • With the module and the three intent(in):
$ time ./a.out 10000
 Found        10001      100000           5   4.64428234
  • With the module and the three intent(in) and -O3:
$ time ./a.out 10000    
 Found        10001      100000           5   2.95242286 

It seems even a little better (~2.85-2.90) with -Os (size optimization).

2 Likes

Concerning PURE, I have no example but Milan says in his book page 77:

It allows the compiler to execute the procedure in the most efficient way. A good compiler on a multicore system can even execute a pure procedure in parallel, if that would be more efficient.

Note that the ifort CPU time with the solution.exe program by @lanrebr is strange, comparing to gfortran:
it could be an “over-optimization” effect: we call N times the function with exactly the same arguments values. Why not call it only one time? => ~0.0s
It could be a clever optimization! :smiley:

Modifying the list table inside the loop would yield a more robust benchmark.

@septc has already reported above that when he uses gfortran with -check all in the solution.f90 he is still getting a temporary array warnings, so even in that case the gfortran will still copy the data no matter what constructs of the language we use.

In this perspective the Intel Fortran compiler is apparently able to fully optimize the code and not make any copies whatsoever.

Perhaps this is a desirable improvement in the gfortran compiler?

2 Likes

Yes, you’re right. Gfortran seems to dramatically lag behind… :frowning_face:

But are there examples where merely declaring a procedure PURE and its arguments’ INTENTs actually speed up the code?

In the past I’ve added the pure specifier and intent attributes to a multiphase fluid simulation code (AMR-LBM-OpenMP-2D), and remeber seeing a 20-25 % performance increase (I did perform some minor restructuring, but in no way did I modify the algorithm).

Libraries such as LAPACK, written for speed, are written in FORTRAN 77 style and do not specify argument INTENTs.

I always thought the reference LAPACK is written primarily with correctness in mind. The speed comes from the calls to BLAS routines, for which hardware optimized libraries are available (MKL, OpenBLAS, ATLAS, …). But it would be interesting to test this.

3 Likes