DO CONCURRENT: compiler flags to enable parallelization

I had prepared some time ago a comparison of loops vs array syntax in the context of a dynamic programming problem, see FortranVec/src/main.f90 at main · aledinola/FortranVec · GitHub

The loop-based code is in bellman_op, the code with array syntax is in bellman_op_vec and bellman_op_vec2. There was a discussion on this forum, see Performance of vectorized code in ifort and ifx

It turns out that this code

v_max = large_negative
            ap_ind = 0
            ! Choose a' optimally by stepping through all possible values
            do ap_c=1,n_a
                aprime_val = ap_grid(ap_c)
                cons = R*a_val + z_val - aprime_val
                if (cons>0.0d0) then
                    v_temp = f_util(cons) + beta*EV(ap_c,z_c)
                    !v_temp = f_util(cons) + beta*sum(v(ap_c,:)*z_tran(z_c,:))
                    if (v_temp>v_max) then
                        v_max = v_temp
                        ap_ind = ap_c
                    end if
                endif
            enddo !end a'

is faster than this

cons = R*a_val + z_val - ap_grid ! (n_ap,1)
            ! NOTE: where and merge are slower than forall
            ! NOTE: forall and do concurrent are equivalent with ifort
            ! but do concurrent is very slow with ifx!
            !where (cons>0.0d0)
            !    util = f_util(cons)  ! (n_ap,1)
            !elsewhere
            !    util = large_negative
            !end where
            !util = merge(f_util(cons),large_negative,cons>0.0d0)
            ! v_temp = large_negative
            ! do concurrent (ap_c=1:n_a, cons(ap_c)>0.0d0)
            !    v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            ! enddo
            v_temp = large_negative
            forall (ap_c=1:n_a, cons(ap_c)>0.0d0)
               v_temp(ap_c) = f_util(cons(ap_c))+beta*EV(ap_c,z_c)
            end forall
            ap_ind = maxloc(v_temp,dim=1)

Please see my repo for more information (in the second block of code you can replace the forall with do concurrent or merge or where if you don’t like forall being obsolete)

Back then (in 2024) we realized that there were also interesting performance differences between ifort and ifx (thanks to @ivanpribec for measuring running times appropriately) with ifort being significantly faster. It would be interesting to run again this test to see if ifx has caught up.