On the creation of temporary arrays when passing type element arguments

In these test you are running the code 10k times in gfortran. When comparing your results with mine, the original code in Intel Fortran is even faster than the updated code in gfortran, and on top of that, in the Intel Fortran the modified codes takes no time at all:

If we assume, as you propose, that when using Intel Fortran the updated code does not create any temporary arrays it does not spend any time or memory doing any array creations/data transfers, but the exact same array is used, the only real work it is doing is searching 10k times a string in a 100k array. However, since it always find the match after only five iterations, it doesn’t need to go deep into the array.

Therefore, calling the method 10k times is more or less equivalent to comparing a string to all the entries of an array of 10k x 5=50k. This was done on purpose so that we can only measure the time and effort of passing the array.

Based on this, the time of Intel Fortran time with the proposed solution makes sense, correct?

Yes. You are right, there is only 50k loops and gfortran seems to lose a lot of time transferring the subarray. I will report to @JerryD to see what he thinks about that problem.

FortranFan Thank you for a great suggestion.

I have made the changes that you propose, but in order not to make too many changes, instead of printing the memory location inside the function I return the position of memory LOC(LST(I)) instead of the index I in the array:

In the VSRCH function:

DO I=1,L
       IF (LST(I) == STR) THEN
            VSRCH = LOC(LST(I))
            EXIT
       END IF
END DO

Then in the maing program I do as you suggest to get the initial memory possition that would be found in the function and here I print them both to compare them:

O = LOC(list(5)%A)
call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, O, IDX, T2-T1

The result shows that indeed, in the original solution, inside the function the address of the memory location of the string found in the array is different than the corresponding memory location in the main program.
Also, it proves that when using the proposed solution, the position in memory in the function is the exact same as the one in the main program, proving that there is no copy made at all.

>problem.exe 100000
 Found       100001      100000 -1758418752 -1084180200   24.82812
>solution.exe 100000
 Found       100001      100000 -1967675200 -1967675200  1.5625000E-02

I then added a print statement inside the main program loop to print the memory location found inside the function at each iteration to see if the copy is made to different locations each time:

O = LOC(list(5)%A)
call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
   PRINT *,I,O,IDX
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, O, IDX, T2-T1

The results shows that in the original problem the memory location in the inside the method is different than the memory location in the main program, but that the memory location inside the method is the exact same every time in every run. This means that there is not a lot of memory trashing in this case as the same memory locations are used over and over again.

In the proposed solution, the memory location is the exact same in every single iteration as the one in the main program:

problem.exe 5
1 1282320576 1977662216
2 1282320576 1977662216
3 1282320576 1977662216
4 1282320576 1977662216
5 1282320576 1977662216
Found 6 100000 1282320576 1977662216 0.0000000E+00

solution.exe 5
1 1846782144 1846782144
2 1846782144 1846782144
3 1846782144 1846782144
4 1846782144 1846782144
5 1846782144 1846782144
Found 6 100000 1846782144 1846782144 0.0000000E+00

In order to verify this later observation, I then changed the main program to count how many times the memory location changes:

OL = LOC(list(5)%A)
NL = OL
CL = 0
call cpu_time(T1)
DO I=1,N
   IDX = VSRCH("A", list%A, SIZ)
   IF (IDX .NE. NL) THEN
      CL = CL + 1
      NL = IDX
   END IF
END DO
call cpu_time(T2)
PRINT *, "Found ", I, SIZ, OL, IDX, CL, T2-T1

The result confirms this that in the original problem the Intel Fortran copies the data into the exact same memory location each one of the 10k times, whereas in the solution there is absolutely no copy made in any run and is the exact same as in the main program.

problem.exe 10000
Found 10001 100000 41068736 1847638984 1 2.468750
solution.exe 10000
Found 10001 100000 -1050892096 -1050892096 0 0.0000000E+00

So this result is quite conclusive of the current behaviour and definitely confirms that this is inded the correct solution for this scenario if you use Intel Fortran.

This is great, but it still it would be nice confirm how this whole mechanic actually works on the Intel Fortran, as non stride-1 arrays are used in this code set for many purposes to handle large complex data set, not used just for searching strings.

1 Like

Alan Miller wrote a program to_f90.f90 to convert FORTRAN 77 codes to ELF90, a subset of Fortran 90 that requires argument INTENTs, and he used it convert part of Lapack. The converted code is part of BLUPF90, and I uploaded it to GitHub here.

2 Likes

I joined this group last week, and I realize that I am posting a reply almost six months after this thread was posted, but I wish to point out another factor that has not been discussed but may have drastically affected the running time.

The subroutine SRCH is called from MAIN with the same arguments N times, with N as large as 10,000. If SRCH is an external routine with an implicit interface, when MAIN is being compiled the compiler may not know if SRCH is pure, unless it does an extra pass after parsing and analyzing all the subprograms in the file.

When SRCH is made a contained function or a function in a module, however, the compiler may find out from its analysis that SRCH is pure, that the arguments passed to it are the same in every iteration of the DO loop in the main program, and the compiler may output code to run through the body of the function just once, or inline the function code. The result would be that the time spent in SRCH becomes too small to measure using CPU_TIME.

It appears to me that the test code is oversimplified and unrealistic. If, in your real application code, considerable time is being spent in its version of SRCH, and that function is being invoked many times with the same LST but different STR values and large values of N, you should consider sorting LST beforehand, and writing a version of SRCH that uses a more efficient algorithm than the current linear search, such as binary search, hash search, etc., and avoid redundant calls to SRCH. To do N linear searches with N strings takes O(N^2) amount of work, whereas N binary searches would take O(N lg N).

2 Likes