MOVE_ALLOC to grow an array

Here is a program that compares the speed of growing an array using MOVE_ALLOC and allocation on assignment. Often move_alloc is faster, but it depends on the compiler and operating system. I plan to tweet about move_alloc, so comments are welcome.

module m
implicit none
contains
pure subroutine grow_vec(x,xadd) ! append xadd(:) to x(:)
real, allocatable, intent(in out) :: x(:)
real             , intent(in)     :: xadd(:)
real, allocatable                 :: temp(:)
integer                           :: nx
if (.not. allocated(x)) allocate (x(0))
nx   = size(x)
allocate (temp(nx + size(xadd)))
temp(:nx) = x
temp(nx+1:) = xadd
deallocate (x)
call move_alloc(temp,x)
end subroutine grow_vec
end module m
!
program grow_array
use m, only: grow_vec
implicit none
real   , allocatable :: a(:),xran(:)
integer, parameter   :: nmax = 3*10**6, ngrow = 1000, nt = 3
integer              :: na
real                 :: t(nt)
allocate (xran(nmax))
call random_number(xran)
call cpu_time(t(1))
allocate (a(0))
na = 0
do ! grow array in a subroutine with copying and move_alloc
   call grow_vec(a,xran(na+1:na+ngrow))
   na = size(a)
   if (na >= nmax) exit
end do
print "(2a15)","size","max|a-xran|"
print "(i15,f15.10)",size(a),maxval(abs(a-xran))
call cpu_time(t(2))
a = [real ::]
na = 0
do ! grow array using allocation on assignment (AoA)
   a = [a,xran(na+1:na+ngrow)]
   na = size(a)
   if (size(a) >= nmax) exit
end do
print "(i15,f15.10)",size(a),maxval(abs(a-xran))
call cpu_time(t(3))
print "(/,3a15)","method","move_alloc","AoA"
print "(a15,2f15.3)","time(s)",t(2:nt) - t(1:nt-1)
end program grow_array
Edited output:
gfortran -O3 on Windows
         method     move_alloc            AoA
        time(s)          4.969         20.516

gfortran -O3 on WSL2
         method     move_alloc            AoA
        time(s)          3.258          4.459

flang -O3 on WSL2
         method     move_alloc            AoA
        time(s)          3.518          4.413

ifort -O3 /F512000000 on Windows
         method     move_alloc            AoA
        time(s)          4.156          5.703

ifort -O3 -heap-arrays on Windows
         method     move_alloc            AoA
        time(s)          4.641          8.750

 ifort -O3 -heap-arrays on WSL2
         method     move_alloc            AoA
        time(s)          4.077          3.025
3 Likes

ifort binaries segfault without -heap-arrays both on MacOS BigSur and Ubuntu 20.04 - might be worth mentioning in tweet

gfortran-10 -O3 on Ubuntu 20.04, i5-10505 CPU @ 3.20GHz, 16GB RAM
         method     move_alloc            AoA
        time(s)          4.286          5.562

ifort -O3 -heap-arrays on Ubuntu 20.04
         method     move_alloc            AoA
        time(s)          5.054          4.016

gfortran-11 -O3 on MacOS BigSur, i5-4288U CPU @ 2.60GHz, 8 GB RAM
         method     move_alloc            AoA
        time(s)         13.554         23.184

ifort -O3 -heap-arrays on MacOS BigSur
         method     move_alloc            AoA
        time(s)         14.547         28.613

interesting big difference move_alloc vs. AoA on MacOS (as on Windows in your results)

Due to exceeding the default stack size limit. Works fine if one removes the limit before running; e.g. ulimit -s unlimited with bash

I’m surprised the AoA worked as well as it did. (I was also unaware that arrays could be used inside an array constructor expression. Neat!)

A couple suggestions:

  • I think you’d be better off using system_clock (with int64 arguments) in your timings, and not including the print statements (with the calculation of maxval(abs(a-xran))) in the timed portion.
  • For move_alloc, a lot of what you are timing is just the procedure call overhead to grow_vec (accounts for 75% of the time with ifort on linux for me). Consider this third in-line alternative:
a = [real ::]
na = 0
do
   block
    real, allocatable :: temp(:)
    allocate(temp(na+ngrow))
    temp(:na) = a
    temp(na+1:) = xran(na+1:na+ngrow)
    deallocate(a)
    call move_alloc(temp, a)
   end block
   na = size(a)
   if (size(a) >= nmax) exit
end do

Given the semantics in the standard for MOVE_ALLOC, the statement deallocate(a) is not needed.

Depending on the processor, such explicit allocation may even adversely affect performance.

2 Likes

@Beliavsky , it’s good you’re seeking comments. My suggestion will be for you to work harder, much much harder and give your drafts a lot more thought so that you can make them as brief as possible before tweeting.

In the case of MOVE_ALLOC for this use case, for example you can start with something as follows and make it further compact:

! canonical method to grow an array:
! say a and tmp are rank-1 objects of the same type (and kind)  and
! tmp is allocated to desired new shape for a
   tmp( 1:size(a) ) = a
   tmp( size(a)+1: ) = .. ! data to be added
   call move_alloc( from=tmp, to=a ) 

And that a couple of other options include:

! array constructor
   a = [ a, [ .. ] ]
! RESHAPE intrinsic function
   a = reshape( a, shape=[ new_shape ], pad=.. )

In my opinion, your tips thus far have been verbose.

Separately in the context of those coding in anger with Fortran and coming to this Discourse for information and guidance, what will be useful is to realize

  • a need to dynamically grow/shrink an array will be helped by containerizing it suitably, perhaps in a derived type;
  • the growth or shrinkage should follow a scheme suitable for the need at hand; a default scheme is usually to double/halve the size of the object in the container but there may be other better schemes depending on the program and the processor.

So then the approach should be to study the options vis-a-vis the needs while paying careful attention to the instrumentation around what is one is measuring and how so. The code in the original post has issues in this aspect as pointed out upthread.

A quick go-by can be as follows where the scheme to double the size each time is considered. To make it somewhat more interesting the option with RESHAPE intrinsic is also thrown in. One can then see the performance variation with the option and the shape and relate that to one’s need in actual code.

Click to see
   integer, parameter :: WP = selected_real_kind( p=12 )
   character(len=*), parameter :: fmth = "(g0,t10,g0,t30,g0,t50,g0,t70,g0)"
   character(len=*), parameter :: fmtg = "(g0,t10,g0,t30,g0.3,t50,g0.3,t70,g0)"
   integer, parameter :: nstart = 100000
   integer, parameter :: nruns = 12
   integer :: i, n(nruns)
   real(WP) :: t1, t2
   real(WP), allocatable :: a(:),xran(:)
   allocate (xran(nstart*(2**nruns)))
   call random_number( xran )
   n = [( nstart*2**(i-1), i=1,nruns )]
   blk1: block
      a = [real(WP) :: ]
      print *, "Method: Array constructor"
      print fmth, "i", "Array Size", "Time (sec)", "Max|a-xran|", "a(n/2)"
      do i = 1, nruns
         call my_cpu_time(t1)
         a = [ a, xran(size(a)+1:n(i)) ]
         call my_cpu_time(t2)
         print fmtg, i, size(a), (t2-t1), maxval(abs(a-xran)), a(size(a)/2)
      end do
   end block blk1
   blk2: block
      a = [real(WP) :: ]
      print *, "Method: RESHAPE intrinsic"
      print fmth, "i", "Array Size", "Time (sec)", "Max|a-xran|", "a(n/2)"
      do i = 1, nruns
         call my_cpu_time(t1)
         a = reshape( a, shape=[n(i)], pad=xran(size(a)+1:n(i)) )
         call my_cpu_time(t2)
         print fmtg, i, size(a), (t2-t1), maxval(abs(a-xran)), a(size(a)/2)
      end do
   end block blk2
   blk3: block
      a = [real(WP) :: ]
      print *, "Method: MOVE_ALLOC"
      print fmth, "i", "Array Size", "Time (sec)", "Max|a-xran|", "a(n/2)"
      do i = 1, nruns
         block
            real(WP), allocatable :: tmp(:)
            call my_cpu_time(t1)
            allocate( tmp(n(i)) )
            tmp(1:size(a)) = a
            tmp(size(a)+1:) = xran(size(a)+1:size(tmp))
            call move_alloc( from=tmp, to=a )
            call my_cpu_time(t2)
         end block
         print fmtg, i, size(a), (t2-t1), maxval(abs(a-xran)), a(size(a)/2)
      end do
   end block blk3
contains
   subroutine my_cpu_time( time )
      use, intrinsic :: iso_fortran_env, only : I8 => int64
      ! Argument list
      real(WP), intent(inout) :: time
      ! Local variables
      integer(I8) :: tick
      integer(I8) :: rate
      call system_clock (tick, rate)
      time = real(tick, kind=kind(time) ) / real(rate, kind=kind(time) )
      return
   end subroutine my_cpu_time
end
  • Expected program behavior
C:\Temp>gfortran -O3 p.f90 -o p.exe

C:\Temp>p.exe
 Method: Array constructor
i        Array Size          Time (sec)          Max|a-xran|         a(n/2)
1        100000              0.157E-2            0.00                0.14632783894497925
2        200000              0.216E-2            0.00                0.93604658738074942
3        400000              0.383E-2            0.00                0.38816074699824510E-2
4        800000              0.780E-2            0.00                0.48353192216229768
5        1600000             0.154E-1            0.00                0.17707306405838441
6        3200000             0.316E-1            0.00                0.34672021796550934
7        6400000             0.619E-1            0.00                0.49127175384201527E-2
8        12800000            0.123               0.00                0.41832146544600479
9        25600000            0.274               0.00                0.34490797395681683
10       51200000            0.499               0.00                0.43429002560932473
11       102400000           1.01                0.00                0.92205452988951198
12       204800000           6.73                0.00                0.86448646582845623
 Method: RESHAPE intrinsic
i        Array Size          Time (sec)          Max|a-xran|         a(n/2)
1        100000              0.230E-1            0.00                0.14632783894497925
2        200000              0.170E-2            0.00                0.93604658738074942
3        400000              0.343E-2            0.00                0.38816074699824510E-2
4        800000              0.784E-2            0.00                0.48353192216229768
5        1600000             0.143E-1            0.00                0.17707306405838441
6        3200000             0.273E-1            0.00                0.34672021796550934
7        6400000             0.547E-1            0.00                0.49127175384201527E-2
8        12800000            0.107               0.00                0.41832146544600479
9        25600000            0.201               0.00                0.34490797395681683
10       51200000            0.351               0.00                0.43429002560932473
11       102400000           0.646               0.00                0.92205452988951198
12       204800000           1.41                0.00                0.86448646582845623
 Method: MOVE_ALLOC
i        Array Size          Time (sec)          Max|a-xran|         a(n/2)
1        100000              0.263E-1            0.00                0.14632783894497925
2        200000              0.636E-3            0.00                0.93604658738074942
3        400000              0.187E-2            0.00                0.38816074699824510E-2
4        800000              0.316E-2            0.00                0.48353192216229768
5        1600000             0.552E-2            0.00                0.17707306405838441
6        3200000             0.123E-1            0.00                0.34672021796550934
7        6400000             0.243E-1            0.00                0.49127175384201527E-2
8        12800000            0.472E-1            0.00                0.41832146544600479
9        25600000            0.843E-1            0.00                0.34490797395681683
10       51200000            0.163               0.00                0.43429002560932473
11       102400000           0.225               0.00                0.92205452988951198
12       204800000           0.463               0.00                0.86448646582845623
2 Likes

Thanks for the reshape idea and code – I hadn’t thought of using reshape for this.

Regarding your previous post, I’m not sure that spending much more time on each tweet would improve the quality much, and it would slow the pace.The tips are listed by topic and given one-sentence subjects here. I would like to cover OOP, parameterized derived types, C interoperability, and coarrays. When the main features of Fortran covered I may slow the pace to weekly and cover Fortran tools (fpm, fypp etc.) and libraries (stdlib and others).

A long time ago I made a ‘grow’ derived type that behaved similarly to C++ std::vector. I used move_alloc() a lot in that! I was severely frustrated by the lack of generics when doing that. I didn’t know about all the preprocessors at the time.