Parallelization of specialized code

I have been developing a free thermodynamic software (OpenCalphad) for the last 10 years and it is used by a few groups independently of me. The main part of the software is to calculate the equilibrium of a system with known external conditions (T,P, N_i or \mu_i). My own interest in the software is to develop models and calculate phase diagrams and other diagrams to test the software and develop databases. Typically such a calculation takes a few seconds to a few minutes.
The heavy use of this kind of software is simulation of phase transformations as part of a kinetic software which takes care of transport and diffusion of different elements during a process. Such a calculation involve millions of gridpoints each of these represent a local equilibrium and it can take days or weeks to finish the simulation. In order to speed up such calculation I have made it possible to have several equilibrium calculations in parallel by separating the thermodynamic data from the actual phase amounts and compositions for each equilibrium.
There are some possibilities to speed up calculations of each equilibrium by calculate for example each phase involved in an equilibrium in parallel but I assume that during a simulation the equilibrium in a gridpoint will be assigned a single CPU and there is no meaning to try use several. Or is this assumption no longer valid with the developments of GPU and other hardware?

2 Likes

When you talk about solving these equilibria on a computational mesh - is each of them solved independently of the others (at least for the equilibrium source terms)? If so, that’s a massively parallel problem like chemistry source terms are in combustion.

GPUs may help as long as the problem can be written in a way that it can run on small blocks - so you can practically run as many gridpoints on a single GPU as you could run on hundreds/thousands of CPUs.

In combustion that’s not possible (stiff chemistry, implicit solvers, etc.), so we resolve to domain simplification: instead of solving one problem per grid point, we identify regions of the 3D domain that have essentially similar thermodynamic state → we solve by “clusters” of grid points, that share ~similar composition. That enables 2-3 orders of magnitude speedups usually. Suggest to lookup for ISAT, clustering, “adaptive multi-zone” methods in combustion.

1 Like

I am not involved in the simulation software, the only interaction I have had with those who calculate diffusion and convection is that the thermodynamic equilibrium calculation is too slow. They want me to solve the equilibrium calculation with some 20 element and 200 phases in a millisecond in each gridpoint as they need the phase compositions and chemical potentials to move the elements around in the 3D (or 2D) simulation space.

I guess I can speed up an equilibrium calculation by using multiple CPUs for the single equilibrium calculation but I think almost all CPUs are busy solving the equilibrium calculations in all other 10000 gridpoints.

Arranging the gridpoints in domains with similar T, P and composition is not really my problem. I find it sufficiently complicated to handle all kinds of strange models that exists in thermodynamics.

Is your computation basically a 0D problem? And your customer project wants to compute a huge set of these 0D computations? And are those 0D computations independent of all the others?

My software deals with thermodynamics, I do not deal with space. I have a given amout of elements, T and P and should
calculate the equilibrium using complex models for the interactions between the elements in many different solid and liquid phases (most of them unstable but I do not know which until I found the equilibrium). Dealing with space: 1D, 2D or 3D is not my problem, that is dealt with by those solving the kinetic equations. But they need to know the chemical potentials of the elements and which phases are stable and their compositions at each point (or line segment or volume unit) they have.

My question was should try to parallelize my thermodynamic software to speed it up or do the kinetic software already use up all CPU/GPU available? The kinetic specialists I deal with always complain the thermodynamic calculations take too much time.

@eelis is right when using the term 0D, these kind of point-wise thermodynamics computation are 0D problems in respect to the “caller” application which indeed are usually grid based applications, be it FEM, CFD, FDT, etc… this at least the usual jargon from a numerical perspective.

The use of parallel resources is typically managed by the caller application because indeed as @FedericoPerini exemplified, the best approach is to cluster thermodynamically-similar regions and then do the “0D” complex computations in parallel according to some batched approach determined by the caller application.

You as provider of the logic for the lower level (spatially speaking) problem can just do so much, like guaranteeing that each call to your kernels is as efficient as possible by guaranteeing things like: proper internal vectorization and avoid at all costs the use of global variables which could hinder multi-threading. But this won’t happen automatically. You should at least try to profile your code to check that concurrent independent calls of your kernels are allowed.

From there on, it is the responsibility of the consumer of your logic to setup proper batching depending on their own spatial partitionning scheme.

A more basic question then is: how do your users, use your library? Is it by system call of executable programs (file exchanges) or can they statically link and call your procedures within their applications and directly pass data through memory?

The main complaint I hear is that my thermodynamic code is too slow to calculate the equilibrium at each local point. As I need to invert quite large matrices for quite a number of different phases to calculate the set of stable phases and the chemical potentials driving the diffusion I wonder if I could try to make some of these in parallel, the properties of each phase are just dependent on the local T, P and composition. However, I doubt there are any CPU/GPU available when the kinetic model handles many gridpoints. The library I developed allows an application to create any number of equilibrium records to handle T, P and independent constitutions of the phases for the calculations as the memory allows so each equilibrium can be calculated in parallel. But I am not really involved in the simulation software, I was just thinking how I could speed up the equilibrium calculation.

You could in principle, but practically speaking you might run into issues if the caller application is already multithreading or multiprocessing and you want to spawn sub-threads within.

GPU is an interesting idea but that depends on your user having HPC GPUs available and you providing linking mechanisms to vendor linalg libraries. Also, as of today, CPU and GPUs manage separate memory regions which need to managed properly.

You could try to first see if the inversion problem could be handled differently with other numerical strategies which could increase performance.

Usually, before trying to parallelize, on should try to get the most performance out of the sequential implementation.

Is the software (OpenCalphad) the following one…?

I wonder if such calculations already utilize some fast linear-algebra libraries (like MKL)? Also, I wonder if the calculation of the equilibrium at each grid point is “freshly” done everytime the library is called? I thought it might be useful to re-use the result of the previous time step at the same grid point (if available on the caller side) as the initial guess for interations in the library and “refine” or “update” the solution with only a few more iterations (if the application is some time evolution, for example…)

1 Like

Yes, have been working with OC for the last 10 years. It is all Fortran and I use LAPACK/BLAS for the matrix handling.
I provide a fairly old version of these routines and one user has preferred to link to an inhouse optimized version.
On my MacPro I have 14 threads and my test with 400 equilibria calculated in parallel takes 73 clockcycles,
when all run on a single CPU takes 424 cc on a single thread, which I guess is OK.

I am quite satisfied if there is not much to gain by trying to parallelize the equilibrium calculation itself.
I have a lot of do loops and followed some discussion on “do concurrent” but I do not think that is useful for OC.