Parallelize this subroutine with Open MPI

Hello guys, i’m physicist form Brazil (sorry for bad english) and i work with N-body simulations.
I’m trying to parallelize the loops of that subroutine for my program but it’s not working, could someone help me, please?
Thanks

      SUBROUTINE RA15(X,V,TF,XL,LL,NV,NCLASS,NOR,OS,FIXED)
C  Integrator RADAU by E. Everhart, Physics Department, University of Denver
C  This 15th-order version, called RA15, is written out for faster execution.
C  y'=F(y,t) is  NCLASS=1,  y"=F(y,t) is NCLASS= -2,  y"=F(y',y,t) is NCLASS=2
C  TF is t(final) - t(initial). It may be negative for backward integration.
C  NV = the number of simultaneous differential equations.
C  The dimensioning below assumes NV will not be larger than 18.
C  LL controls sequence size. Thus SS=10**(-LL) controls the size of a term.
C  A typical LL-value is in the range 6 to 12 for this order 11 program.
C  However, if LL.LT.0 then XL is the constant sequence size used.
C  X and V enter as the starting position-velocity vector, and are output as
C  the final position-velocity vector.
C  Integration is in double precision. A 64-bit double-word is assumed.
      IMPLICIT REAL*8 (A-H,O-Z)
      REAL*8 NEXTOUTPUT, TVAL,PW
      DIMENSION X(12),V(12),F1(18),FJ(18),C(21),D(21),R(21),Y(18),Z(18),
     A     B(7,18),G(7,18),E(7,18),BD(7,18),H(8),W(7),U(7),NW(8) 
      common/ivolta/ivolta
      LOGICAL NPQ,NSF,NPER,NCL,NES,fixed
      
c:      logical ivolta ! 05/12/2001  ! ver subrotina OUTPUT    
   
      ivolta=0  ! 05/12/2001 !No Output olho emax1: se maior que ***? (0.9) paro a integracao 
c:                            fazendo IVOLTA=333. No inicio comeco com IVOLTA=0 (sempre)      
      DATA NW/0,0,1,3,6,10,15,21/
      DATA ZERO, HALF, ONE,SR/0.0D0, 0.5D0, 1.0D0,1.4D0/
C  These H values are the Gauss-Radau spacings, scaled to the range 0 to 1,
C  for integrating to order 15.
      DATA H/         0.D0, .05626256053692215D0, .18024069173689236D0,
     A.35262471711316964D0, .54715362633055538D0, .73421017721541053D0,
     B.88532094683909577D0, .97752061356128750D0/
C  The sum of the H-values should be 3.73333333333333333
      NPER=.FALSE.
      NSF=.FALSE.
      NCL=NCLASS.EQ.1
      NPQ=NCLASS.LT.2
      Out=OS
C  y'=F(y,t)  NCL=.TRUE.    y"=F(y,t)  NCL=.FALSE.   y"=F(y',y,t) NCL=.FALSE.
C  NCLASS=1   NPQ=.TRUE.    NCLASS= -2 NPQ=.TRUE.    NCLASS= 2    NPQ=.FALSE.
C  NSF is .FALSE. on starting sequence, otherwise .TRUE.
C  NPER is .TRUE. only on last sequence of the integration.
C  NES is .TRUE. only if LL is negative. Then the sequence size is XL.
      DIR=ONE
      IF(TF.LT.ZERO) DIR=-ONE
      NES=LL.LT.0
      XL=DIR*DABS(XL)
      PW=1./9.
C  Evaluate the constants in the W-, U-, C-, D-, and R-vectors
      DO 14 N=2,8
      WW=N+N*N
      IF(NCL) WW=N
      W(N-1)=ONE/WW
      WW=N
  14  U(N-1)=ONE/WW
      DO 22 K=1,NV
      IF(NCL) V(K)=ZERO
      DO 22 L=1,7
      BD(L,K)=ZERO
  22  B(L,K)=ZERO
      W1=HALF
      IF(NCL) W1=ONE
      C(1)=-H(2)
      D(1)=H(2)
      R(1)=ONE/(H(3)-H(2))
      LA=1
      LC=1
      DO 73 K=3,7
      LB=LA
      LA=LC+1
      LC=NW(K+1)
      C(LA)=-H(K)*C(LB)
      C(LC)=C(LA-1)-H(K)
      D(LA)=H(2)*D(LB)
      D(LC)=-C(LC)
      R(LA)=ONE/(H(K+1)-H(2))
      R(LC)=ONE/(H(K+1)-H(K))
      IF(K.EQ.3) GO TO 73
      DO 72 L=4,K
      LD=LA+L-3
      LE=LB+L-4
      C(LD)=C(LE)-H(K)*C(LE+1)
      D(LD)=D(LE)+H(L-1)*D(LE+1)
  72  R(LD)=ONE/(H(K+1)-H(L-1))
  73  CONTINUE
      SS=10.**(-LL)
C  The statements above are used only once in an integration to set up the
C  constants. They use less than a second of execution time.  Next set in
C  a reasonable estimate to TP based on experience. Same sign as DIR.
C  An initial first sequence size can be set with XL even with LL positive.
      TP=0.1D0*DIR
      IF(XL.NE.ZERO) TP=XL
      IF(NES) TP=XL
      IF(TP/TF.GT.HALF) TP=HALF*TF
      NCOUNT=0

C  An * is the symbol for writing on the monitor. The printer is unit 4.
C  Line 4000 is the starting place of the first sequence.
4000  NS=0
      NF=0
      NI=6
      TM=ZERO
      CALL FORCE (X, V, ZERO, F1)
      NF=NF+1
C Line 722 is begins every sequence after the first. First find new beta-
C  values from the predicted B-values, following Eq. (2.7) in text.
 722  DO 58 K=1,NV
      G(1,K)=B(1,K)+D(1)*B(2,K)+
     X  D(2)*B(3,K)+D(4)*B(4,K)+D( 7)*B(5,K)+D(11)*B(6,K)+D(16)*B(7,K)
      G(2,K)=            B(2,K)+
     X  D(3)*B(3,K)+D(5)*B(4,K)+D( 8)*B(5,K)+D(12)*B(6,K)+D(17)*B(7,K)
      G(3,K)=B(3,K)+D(6)*B(4,K)+D( 9)*B(5,K)+D(13)*B(6,K)+D(18)*B(7,K)
      G(4,K)=            B(4,K)+D(10)*B(5,K)+D(14)*B(6,K)+D(19)*B(7,K)
      G(5,K)=                         B(5,K)+D(15)*B(6,K)+D(20)*B(7,K)
      G(6,K)=                                      B(6,K)+D(21)*B(7,K)
  58  G(7,K)=                                                   B(7,K)
      T=TP
      T2=T*T
      IF(NCL) T2=T
      TVAL=DABS(T)
  
 

      IF(fixed.and.DABS(DIR*TM-OUT+OS).le.1.d-8) then ! antes o call abaixo, vinha nesta linha
      
      call OUTPUT(NF,NS,X,V,F1,T,TM,TF,.not.nsf,.FALSE.) ! 05/12/01: acrescentei IVOLTA
     
      if(ivolta.eq.333) return ! 05/12/2001: neste caso emax1>=0.9 entao retornar (finalizar)
                                                  end if
                                                  
      IF (fixed.or.(out-os-dir*tm).gt.1.d-8) go to 246
      call OUTPUT(NF,NS,X,V,F1,T,TM,TF,.not.nsf,.FALSE.) ! 05/12/01: acrescentei IVOLTA
      
      if(ivolta.eq.333) return  ! 05/12/2001: neste caso emax1>=0.9 entao retornar (finalizar)
      
      Out=out+os
  246 continue
  
      
C  Loop 175 is 6 iterations on first sequence and two iterations therafter.
      DO 175 M=1,NI
C  Loop 174 is for each substep within a sequence.
      DO 174 J=2,8
      JD=J-1
      JDM=J-2
      S=H(J)
      Q=S
      IF(NCL) Q=ONE
C  Use Eqs. (2.9) and (2.10) of text to predict positions at each aubstep.
C  These collapsed series are broken into two parts because an otherwise
C  excellent  compiler could not handle the complicated expression.
      DO 130 K=1,NV
      A=W(3)*B(3,K)+S*(W(4)*B(4,K)+S*(W(5)*B(5,K)+S*(W(6)*B(6,K)+
     V   S*W(7)*B(7,K))))
      Y(K)=X(K)+Q*(T*V(K)+T2*S*(F1(K)*W1+S*(W(1)*B(1,K)+S*(W(2)*B(2,K)
     X  +S*A))))
      IF(NPQ) GO TO 130
C  Next are calculated the velocity predictors need for general class II.
      A=U(3)*B(3,K)+S*(U(4)*B(4,K)+S*(U(5)*B(5,K)+S*(U(6)*B(6,K)+
     T    S*U(7)*B(7,K))))
      Z(K)=V(K)+S*T*(F1(K)+S*(U(1)*B(1,K)+S*(U(2)*B(2,K)+S*A)))
 130  CONTINUE
C  Find forces at each substep.
      CALL FORCE(Y,Z,TM+S*T,FJ)
      NF=NF+1
      DO 171 K=1,NV
C  Find G-value for the force FJ found at the current substep. This
C  section, including the many-branched GOTO, uses Eq. (2.4) of text.
      TEMP=G(JD,K)
      GK=(FJ(K)-F1(K))/S
      GO TO (102,102,103,104,105,106,107,108),J
 102  G(1,K)=GK
      GO TO 160
 103  G(2,K)=(GK-G(1,K))*R(1)
      GO TO 160
 104  G(3,K)=((GK-G(1,K))*R(2)-G(2,K))*R(3)
      GO TO 160
 105  G(4,K)=(((GK-G(1,K))*R(4)-G(2,K))*R(5)-G(3,K))*R(6)
      GO TO 160
 106  G(5,K)=((((GK-G(1,K))*R(7)-G(2,K))*R(8)-G(3,K))*R(9)-
     X     G(4,K))*R(10)
      GO TO 160
 107  G(6,K)=(((((GK-G(1,K))*R(11)-G(2,K))*R(12)-G(3,K))*R(13)-
     X     G(4,K))*R(14)-G(5,K))*R(15)
      GO TO 160
 108  G(7,K)=((((((GK-G(1,K))*R(16)-G(2,K))*R(17)-G(3,K))*R(18)-
     X     G(4,K))*R(19)-G(5,K))*R(20)-G(6,K))*R(21)
C  Upgrade all B-values.
 160  TEMP=G(JD,K)-TEMP
      B(JD,K)=B(JD,K)+TEMP
C  TEMP is now the improvement on G(JD,K) over its former value.
C  Now we upgrade the B-value using this dfference in the one term.
C  This section is based on Eq. (2.5).
      GO TO (171,171,203,204,205,206,207,208),J
 203  B(1,K)=B(1,K)+C(1)*TEMP
      GO TO 171
 204  B(1,K)=B(1,K)+C(2)*TEMP
      B(2,K)=B(2,K)+C(3)*TEMP
      GO TO 171
 205  B(1,K)=B(1,K)+C(4)*TEMP
      B(2,K)=B(2,K)+C(5)*TEMP
      B(3,K)=B(3,K)+C(6)*TEMP
      GO TO 171
 206  B(1,K)=B(1,K)+C(7)*TEMP
      B(2,K)=B(2,K)+C(8)*TEMP
      B(3,K)=B(3,K)+C(9)*TEMP
      B(4,K)=B(4,K)+C(10)*TEMP
      GO TO 171
 207  B(1,K)=B(1,K)+C(11)*TEMP
      B(2,K)=B(2,K)+C(12)*TEMP
      B(3,K)=B(3,K)+C(13)*TEMP
      B(4,K)=B(4,K)+C(14)*TEMP
      B(5,K)=B(5,K)+C(15)*TEMP
      GO TO 171
 208  B(1,K)=B(1,K)+C(16)*TEMP
      B(2,K)=B(2,K)+C(17)*TEMP
      B(3,K)=B(3,K)+C(18)*TEMP
      B(4,K)=B(4,K)+C(19)*TEMP
      B(5,K)=B(5,K)+C(20)*TEMP
      B(6,K)=B(6,K)+C(21)*TEMP
 171  CONTINUE
 174  CONTINUE
      IF(NES.OR.M.LT.NI) GO TO 175
C  Integration of sequence is over. Next is sequence size control.
      HV=ZERO
      DO 635 K=1,NV
 635  HV=DMAX1(HV,DABS(B(7,K)))
      HV=HV*W(7)/TVAL**7
 175  CONTINUE
      IF (NSF) GO TO 180
      IF(.NOT.NES) TP=(SS/HV)**PW*DIR
      IF(NES) TP=XL
      IF(NES) GO TO 170
      IF(TP/T.GT.ONE) GO TO 170
   8  FORMAT (A,2X,2I2,2D18.10)
      TP=.8D0*TP
      NCOUNT=NCOUNT+1
      IF(NCOUNT.GT.10) RETURN
     
      
C  Restart with 0.8x sequence size if new size called for is smaller than
C  originally chosen starting sequence size on first sequence.
      GO TO 4000
 170  NSF=.TRUE.
C Loop 35 finds new X and V values at end of sequence using Eqs. (2.11),(2.12)
 180  DO 35 K=1,NV
      X(K)=X(K)+V(K)*T+T2*(F1(K)*W1+B(1,K)*W(1)+B(2,K)*W(2)+B(3,K)*W(3)
     X    +B(4,K)*W(4)+B(5,K)*W(5)+B(6,K)*W(6)+B(7,K)*W(7))
      IF(NCL) GO TO 35
      V(K)=V(K)+T*(F1(K)+B(1,K)*U(1)+B(2,K)*U(2)+B(3,K)*U(3)
     V    +B(4,K)*U(4)+B(5,K)*U(5)+B(6,K)*U(6)+B(7,K)*U(7))
  35  CONTINUE
      TM=TM+T
      NS=NS+1
C  Return if done.
      IF(.NOT.NPER) GO TO 78
      CALL OUTPUT(NF,NS,X,V,F1,T,TM,TF,.not.nsf,NPER)
      RETURN
C  Control on size of next sequence and adjust last sequence to exactly
C  cover the integration span. NPER=.TRUE. set on last sequence.
  78  CALL FORCE (X,V,TM,F1)
      NF=NF+1
      IF(NES) GO TO 341
      TP=DIR*(SS/HV)**PW
      IF(TP/T.GT.SR) TP=T*SR
 341  IF(NES) TP=XL
      IF(DIR*(TM+TP).LT.DIR*TF-1.D-8) GO TO 66
      TP=TF-TM
      NPER=.TRUE.

 66   IF (.not.fixed.or.DIR*(TM+TP).LT.OUT-1.D-8) GO TO 77
      TP=DIR*OUT-TM
      OUT=OUT+OS 
      
C  Now predict B-values for next step. The predicted values from the preceding
C  sequence were saved in the E-matrix. Te correction BD between the actual
C  B-values found and these predicted values is applied in advance to the
C  next sequence. The gain in accuracy is significant. Using Eqs. (2.13):
  77  Q=TP/T
      DO 39 K=1,NV
      IF(NS.EQ.1) GO TO 31
      DO 20 J=1,7
  20  BD(J,K)=B(J,K)-E(J,K)
  31  E(1,K)=      Q*(B(1,K)+ 2.D0*B(2,K)+ 3.D0*B(3,K)+
     X           4.D0*B(4,K)+ 5.D0*B(5,K)+ 6.D0*B(6,K)+ 7.D0*B(7,K))
      E(2,K)=                Q**2*(B(2,K)+ 3.D0*B(3,K)+
     Y           6.D0*B(4,K)+10.D0*B(5,K)+15.D0*B(6,K)+21.D0*B(7,K))
      E(3,K)=                             Q**3*(B(3,K)+
     Z           4.D0*B(4,K)+10.D0*B(5,K)+20.D0*B(6,K)+35.D0*B(7,K))
      E(4,K)=   Q**4*(B(4,K)+ 5.D0*B(5,K)+15.D0*B(6,K)+35.D0*B(7,K))
      E(5,K)=                Q**5*(B(5,K)+ 6.D0*B(6,K)+21.D0*B(7,K))
      E(6,K)=                             Q**6*(B(6,K)+ 7.D0*B(7,K))
      E(7,K)=                                           Q**7*B(7,K)
      DO 39 L=1,7
  39  B(L,K)=E(L,K)+BD(L,K)
C  Two iterations for every sequence after the first.
      NI=2
      GO TO 722
      END

3 Likes

Welcome to the forum. Could you please post your parallelized code and say what the error is?

1 Like

To parallelize, we need an understanding of what calculations can be done in parallel; calculations that are the same form but are independent and for different data cases, and with sufficient computation to be effective.

DO 175 M (lines 141-229) might be an option, but this looks to require many private copies of arrays.
B(7,18) appears to be an accumulation throughout loops 175 and 174, but used in loop 130, so the complexity of this interaction through DO 175 M would require a better understanding than my quick overview.

Can you identify repeated independent calculation that could be done in parallel ?
I don’t think I can.

MPI in my opinion, works best if you can clearly or easily distribute almost equal amount of jobs on each of the cores involved. The communications between each cores should be the fewer the better.

If your job is to solve, like a system of 10^4 ODEs, then perhaps you might want to use parallelized solver like Sundials. If this is what you want, I guess the best option is perhaps to just use Sundials instead as suggested by @nicholaswogan, @ivanpribec etal,

On the other hand, if you want to solve like 3-ODE system for 10^4 times, then say you have 100 cores, you can distribute 100 of such 3-ODE systems in each of the cores.
In general, I think you may want to isolate the part you want to parallelize

do j = 1, n
  XXXXXXXXX # you want to parallelize this part
enddo

into an independent subroutine first, call it subroutine AAA.

do j = 1, n
   call  AAA(j,...)  # AAA needs to be completely and absolutely an independent subroutine. 
enddo

Then in MPI it would be more or less like below I guess,

do i = 0, nproc()-1    # say 100 cores, so from rank0 to rank99
  if (myrank()==i) then
    jstart_i = ...
    jend_i = ...
    do j =jstart_i, jend_i # distribute corresponding jobs from jstart_i to jend_i on rank i 
      call AAA(j,...)  
    enddo
   # when finished exit the do loop
  endif
enddo

Or perhaps you do not need the loop at all. Just directly let each rank do the subroutine AAA with the jobs they are assigned.

1 Like

I learned the most important parts of OpenMP from this tutorial: Using OpenMP with Fortran — Research Computing University of Colorado Boulder documentation . Huge fan of OpenMP.

2 Likes