Hello. I have a (Monte Carlo) simulation that I am preparing to parallelize. At this point the code is fairly large, but not unwieldy so if any drastic structure changes needs to happen I can still make them without too much hassle.
My simulation has a ton of variables everywhere, so when making it I put all the things that store information for it in one big derived type S. I think I remember seeing that this is what people usually do when making some procedures with tons of inputs to save space, so I can call procedures like call compute(S, x, y, z) instead of having 10-20 or so inputs.
In preparing my code to become parallel (I am still somewhat new to parallel programming, only having done basic exercises and not a big project) I noticed this makes S basically always intent(inout) to nearly every procedure. Sometimes the effect on S is not so much (changing a scalar or two) and sometimes the effect on S is more drastic: I run a matrix computation with lots of changing matrices and workspace matrices. The main thing I want to parallelize with Openmp is doing 2 or so matrix computations (not just a single dgemm or anything, but many BLAS/LAPACK routines) at once. The matrices involved would be stored in S, so that means any routine I call in an Openmp parallel section would have intent(inout) for S. Is this type of thing fine for Openmp codes in Fortran? Would it be considerably better performance wise if I separated components out to have more clearly defined intent(out) arguments (eg, put all constant variables in one derived type, have derived type workspaces made for each routine to separate them out, pass in individual matrices instead of accessing them through S). I do not intend to use pointers at all if that helps.
Also, on the topic of Openmp. A question I have is about splitting off threads to do work and how many threads BLAS/LAPACK use. For example, I can set how many threads OpenBLAS uses through an environment variable. If I set this to say 4, then split off 2 threads each executing BLAS routines, does each BLAS routine run single threaded or 4 threaded? If one thread finishes its work, could it then become available for the other thread’s BLAS calls?
Unfortunately, I’ve found the topic of parallelization to not be “universal”, meaning that for different applications and data structures, you might end up with different strategies.
Could you please describe a little more the data structures that you’re planning to use? A minimal example, with common dimensions, would be great.
When using multi-threading, I prefer a seperate local data structure for each thread, especially for private variables.
The problem I have with storing all information in one big derived type is this may be difficult to assess for inefficency related to cache coherency.
Using a big derived type for referencing shared data may be ok, but not for private working variables.
do
do
do
! Many rank 1 updates of the matrices S%A and S%B
call dger(S, S%A)
call dger(S, S%B)
end do
! Do some Monte Carlo measurements
call measure(S, S%A, S%B)
end do
! Occasionally S%A and S%B have to be recomputed
call recompute(S, S%A)
call recompute(S, S%B)
end do
Obviously the code is more complex than this, but this is the general structure of the part I want to parallelize right now. In the innermost do loop I have a small workspace, in measuring I have a lot of vectors/matrices that I am filling with data, and during the recompute part I have a handful of workspace arrays (I think each recompute uses around 8 workspace arrays). All routines rely on some common variables that don’t ever change in the simulation. The innermost do loop only changes a few (mostly scalar and its workspace) variables stored in S along with another simple matrix that gives instructions on how to recompute S%A and S%B. The measurement routine changes a lot since it needs to store data. The recompute routine changes just S%A and S%B along with its workspace arrays.
Recomputation of S%A and S%B takes a while, but I don’t need the inner loops to complete before starting on this process. The primary thing I want to parallelize is doing the innermost loops while working on this recomputation. While the innermost loop does change an array that holds information needed for recomputation, it doesn’t change too much between recomputations (I have enough information to around 70% of the next recomputation job before it actually comes). I want be able to work on this recomputation while the rank 1 updates go on, then meet and finish the recompute. The dimensions of the matrices involved are only a few hundred (eg, 400x400 + or -).