Parallelization of specialized code

bosse · March 6, 2026, 10:32am

I have been developing a free thermodynamic software (OpenCalphad) for the last 10 years and it is used by a few groups independently of me. The main part of the software is to calculate the equilibrium of a system with known external conditions (T,P, N_i or \mu_i). My own interest in the software is to develop models and calculate phase diagrams and other diagrams to test the software and develop databases. Typically such a calculation takes a few seconds to a few minutes.
The heavy use of this kind of software is simulation of phase transformations as part of a kinetic software which takes care of transport and diffusion of different elements during a process. Such a calculation involve millions of gridpoints each of these represent a local equilibrium and it can take days or weeks to finish the simulation. In order to speed up such calculation I have made it possible to have several equilibrium calculations in parallel by separating the thermodynamic data from the actual phase amounts and compositions for each equilibrium.
There are some possibilities to speed up calculations of each equilibrium by calculate for example each phase involved in an equilibrium in parallel but I assume that during a simulation the equilibrium in a gridpoint will be assigned a single CPU and there is no meaning to try use several. Or is this assumption no longer valid with the developments of GPU and other hardware?

FedericoPerini · March 6, 2026, 1:54pm

When you talk about solving these equilibria on a computational mesh - is each of them solved independently of the others (at least for the equilibrium source terms)? If so, that’s a massively parallel problem like chemistry source terms are in combustion.

GPUs may help as long as the problem can be written in a way that it can run on small blocks - so you can practically run as many gridpoints on a single GPU as you could run on hundreds/thousands of CPUs.

In combustion that’s not possible (stiff chemistry, implicit solvers, etc.), so we resolve to domain simplification: instead of solving one problem per grid point, we identify regions of the 3D domain that have essentially similar thermodynamic state → we solve by “clusters” of grid points, that share ~similar composition. That enables 2-3 orders of magnitude speedups usually. Suggest to lookup for ISAT, clustering, “adaptive multi-zone” methods in combustion.

bosse · March 6, 2026, 2:53pm

I am not involved in the simulation software, the only interaction I have had with those who calculate diffusion and convection is that the thermodynamic equilibrium calculation is too slow. They want me to solve the equilibrium calculation with some 20 element and 200 phases in a millisecond in each gridpoint as they need the phase compositions and chemical potentials to move the elements around in the 3D (or 2D) simulation space.

I guess I can speed up an equilibrium calculation by using multiple CPUs for the single equilibrium calculation but I think almost all CPUs are busy solving the equilibrium calculations in all other 10000 gridpoints.

Arranging the gridpoints in domains with similar T, P and composition is not really my problem. I find it sufficiently complicated to handle all kinds of strange models that exists in thermodynamics.

eelis · March 6, 2026, 3:59pm

Is your computation basically a 0D problem? And your customer project wants to compute a huge set of these 0D computations? And are those 0D computations independent of all the others?

bosse · March 6, 2026, 4:37pm

My software deals with thermodynamics, I do not deal with space. I have a given amout of elements, T and P and should
calculate the equilibrium using complex models for the interactions between the elements in many different solid and liquid phases (most of them unstable but I do not know which until I found the equilibrium). Dealing with space: 1D, 2D or 3D is not my problem, that is dealt with by those solving the kinetic equations. But they need to know the chemical potentials of the elements and which phases are stable and their compositions at each point (or line segment or volume unit) they have.

My question was should try to parallelize my thermodynamic software to speed it up or do the kinetic software already use up all CPU/GPU available? The kinetic specialists I deal with always complain the thermodynamic calculations take too much time.

hkvzjal · March 6, 2026, 5:22pm

@eelis is right when using the term 0D, these kind of point-wise thermodynamics computation are 0D problems in respect to the “caller” application which indeed are usually grid based applications, be it FEM, CFD, FDT, etc… this at least the usual jargon from a numerical perspective.

The use of parallel resources is typically managed by the caller application because indeed as @FedericoPerini exemplified, the best approach is to cluster thermodynamically-similar regions and then do the “0D” complex computations in parallel according to some batched approach determined by the caller application.

You as provider of the logic for the lower level (spatially speaking) problem can just do so much, like guaranteeing that each call to your kernels is as efficient as possible by guaranteeing things like: proper internal vectorization and avoid at all costs the use of global variables which could hinder multi-threading. But this won’t happen automatically. You should at least try to profile your code to check that concurrent independent calls of your kernels are allowed.

From there on, it is the responsibility of the consumer of your logic to setup proper batching depending on their own spatial partitionning scheme.

A more basic question then is: how do your users, use your library? Is it by system call of executable programs (file exchanges) or can they statically link and call your procedures within their applications and directly pass data through memory?

bosse · March 6, 2026, 5:50pm

The main complaint I hear is that my thermodynamic code is too slow to calculate the equilibrium at each local point. As I need to invert quite large matrices for quite a number of different phases to calculate the set of stable phases and the chemical potentials driving the diffusion I wonder if I could try to make some of these in parallel, the properties of each phase are just dependent on the local T, P and composition. However, I doubt there are any CPU/GPU available when the kinetic model handles many gridpoints. The library I developed allows an application to create any number of equilibrium records to handle T, P and independent constitutions of the phases for the calculations as the memory allows so each equilibrium can be calculated in parallel. But I am not really involved in the simulation software, I was just thinking how I could speed up the equilibrium calculation.

hkvzjal · March 6, 2026, 6:18pm

You could in principle, but practically speaking you might run into issues if the caller application is already multithreading or multiprocessing and you want to spawn sub-threads within.

GPU is an interesting idea but that depends on your user having HPC GPUs available and you providing linking mechanisms to vendor linalg libraries. Also, as of today, CPU and GPUs manage separate memory regions which need to managed properly.

You could try to first see if the inversion problem could be handled differently with other numerical strategies which could increase performance.

Usually, before trying to parallelize, on should try to get the most performance out of the sequential implementation.

septc · March 6, 2026, 6:34pm

Is the software (OpenCalphad) the following one…?

I wonder if such calculations already utilize some fast linear-algebra libraries (like MKL)? Also, I wonder if the calculation of the equilibrium at each grid point is “freshly” done everytime the library is called? I thought it might be useful to re-use the result of the previous time step at the same grid point (if available on the caller side) as the initial guess for interations in the library and “refine” or “update” the solution with only a few more iterations (if the application is some time evolution, for example…)

bosse · March 6, 2026, 8:16pm

Yes, have been working with OC for the last 10 years. It is all Fortran and I use LAPACK/BLAS for the matrix handling.
I provide a fairly old version of these routines and one user has preferred to link to an inhouse optimized version.
On my MacPro I have 14 threads and my test with 400 equilibria calculated in parallel takes 73 clockcycles,
when all run on a single CPU takes 424 cc on a single thread, which I guess is OK.

I am quite satisfied if there is not much to gain by trying to parallelize the equilibrium calculation itself.
I have a lot of do loops and followed some discussion on “do concurrent” but I do not think that is useful for OC.

MarDie · March 7, 2026, 9:03pm

@bosse I’m not really an CALPHAD expert, but currently we’re working on the implementation of simple models into DAMASK so I find this thread (and OpenCalphad) quite interesting.

I agree with the conclusion that parallelization of a single point calculation makes no sense. If used in a tool for microstructure evolution (probably phase field), each of these calculations can be performed individually, making this an “embarrassing parallel” problem when doing domain decomposition of the discretization.

I think there are the following aspects can be considered to improve performance:

Single core solution scheme:

You mention the need to invert a matrix. Is that really needed? For solving an equation, there are normally better approaches. LAPACK also has all kind of special routines for exploitation of the matrix structure, maybe there is something to harvest.
Is there the chance to speed up the calculation with a good initial guess? This could be particular helpful when embedding Calphad in a phase field code, because in most of the points changes will be incremental.

Vectorized computation
If the same model is called for multiple points with different condition, it could make sense to provide an vectorized interface for which then parallelization might make sense. This could be helpful for phase field calculations but also for the calculation of a phase diagram: Instead of calling the routine for each composition, one would put in an array of conditions (T, P, composition). Even without parallelization there might be something to gain, e.g. from having database values in cache.
Reducing the number of calculations
Probably many points in a simulation are quite static, e.g. when far away from an interface. This could be exploited by the caller. With domain knowledge, one could for example also consider that substitutionals are typically less mobile than interstitials. This is similar to the idea of clustering points suggested by @FedericoPerini and @hkvzjal.
Model building
20 elements and 200 phases sound very ambitious. Is there really a database that is good enough for that? Often a simplified model gives already the relevant insights while a complicated model suffers from large runtimes and questionable parameterization. Focus on the relevant phases was also one of the key messages from a Calphad course of Prof. Inden that I took several years ago.
Use of a reduced order model
My colleague Nele Moelans developed an approach to use tensor decomposition to store the essential information from a Calphad model for a use case that sounds similar to what you have in mind.

bosse · March 8, 2026, 8:47am

In a typical simulation one may have some 5-10 elements and 20-200 phases with different models ans constituents, the gas may have some 200 species, a liquid some 20 cations, anions and molecules, the solid phases from 5-20 constituents in different sublattices but some solids may have fixed constitution.
As I have written I am not involved in the kinetic simulations, the main information I have from those who try to combine a kinetic model with proper thermodynamics is that they think it takes too long time to calculate the equilibrium state in a cell.

I designed OC to have have a separate memory area for each cell, saving T, P and all phases with their current composition (one has to save also the unstable phases as there can be nucleation of new phases during a simulation, as well as stable phases may disappear). Thus an initial minimization is needed only for the first calculation, at the next timestep the result of the previous calculation (involving also the unstable phases) is a good start point. All cells use the same thermodynamic description saved separately, This is I guess is a kind of vectorized method.

In the attached paper about a nuclear simulation the system as reduced to 14 elements and probably many phases were eliminated as not relevant. It is quite impressive but they want to treat larger systems. I have met Nele Moelans once a long time ago when I was still involved with the development of Thermo-Calc, the OC development started 2012 and I attach a paper about the minimization, the same algorithm is used in Thermo-Calc although I think they have abandoned Fortran as coding language,

But as I wrote in the previous message, I guess I should not try to parallelize the equilibrium calculation itself, the kinetic simulation will normally use all available CPUs

Bosse

(Attachment 15Sun-minimizer.pdf is missing)

(Attachment 20Int-OC.pdf is missing)

bosse · March 8, 2026, 8:53am

My attachment were rejected but if you are interested there are the references

C introini, J Sercombe and B Sundman, Nuclear Engineering and Design, Vol 369, (2020) 110818

B Sundman, X-G Lu and H Ohtani, Computational Materials Science, Vol 101 (2015) pp 127–137

septc · March 8, 2026, 1:13pm

Together with possible speed-up of the library itself, is it possible to ask application developers of the kinetic simulation about the possibility of applying additional time-saving techniques like clustering suggested above (by @FedericoPerini , @hkvzjal , @MarDie ) to reduce the number of library calls? If this could lead to a drastic computational saving, I belies it is worth for the application developers to explore such ideas further.
(But, another question may be whether such additional techniques like clustering, coarse graining, or infrequent / sparse / interpolated evaluations of thermodynamic data may be sufficient for their applications (e.g. in terms of accuracy), which I guess only the application developers have sufficient information…)

septc · March 8, 2026, 1:48pm

Here are the links to the papers above:

C introini, J Sercombe and B Sundman, Nuclear Engineering and Design, Vol 369, (2020) 110818
B Sundman, X-G Lu and H Ohtani, Computational Materials Science, Vol 101 (2015) pp 127–137
(I cannot understand the contents very well because the fields are so different… )

EDIT: According to the abstract of the first paper, the application program has already been highly optimized with two strategies…? (and further trying to reduce the computational cost for more demanding calculations…?)

loiseaujc · March 9, 2026, 11:12am

Disclaimer - I know very little about thermodynamics. Along with compressible flows, that is one of the three classes I barely passed during my studies. I do know about numerical linear algebra and optimization though.

My questions may be very naïve, sorry in advance.

Looking a the Github repo linked by @septc, I see that you use minpack. As far as I know, @jacobwilliams modernized it. You can find the community version here. I ain’t sure, but using the modernized implementation may be a low-hanging fruit for possible increased computational performances (although possibly marginally).
Since you use minpack, I suppose you have a minimization problem to solve at some point. Is this minimization problem a generic nonlinear program or is it by any chance a convex program? If not convex, does the function still have some properties that could be leveraged (e.g. coercivity, gradient-dominance, etc) which could hint at other techniques than generic nonlinear programming (e.g. maybe sequential convex programming)?
For a single-cell problem, what is the typical number of variables involved?
As far as I known, minpack uses the Levenberg-Maquardt algorithm to solve the nonlinear minimization problem. What’s the typical number of LMA iterations it needs to converge? Does minpack includes some of the more recent variants which include some form of momentum or acceleration techniques?

bosse · March 9, 2026, 11:40am

Thanks for the hints,
MINPACK is not used for the equilibrium calculations, it is used when the software is used to fit the thermodynamic model parameters to experimental data. That is also a quite heavy procedure but require some active interaction with the scientist to make decisions and it some way it is good not to have such work finish too quickly.
Traditional minimization of the Gibbs energy require solving nonlinear equations but in one of the papers I provided shows that using second derivatives of the Gibbs energy one can obtain a linear matrix solved by BLAS/LAPACK routines.
For the simulation of a nuclear fuel with 14 elements and some 100 phases with varying number of constituents (from 1 to about 30) there will be some 200 constitution variables in total. At equilibrium the chemical potentials of all elements in all (stable) phases must be equal. The chemical potentials for each phase can be calculated separately, then adjusting its constituent fractions in order to equilibrate them at the minimum. The constitution of the metastable phases (normally one has 1-5 stable phases) are changed at each iteration and the set of stable phases may change at each iteration.

But minpack is not invloved in this.

fedebenelli · March 9, 2026, 12:26pm

While I’m not doing grid simulations, I do a lot of phase equilibria calculations with equations of state, I assume your problem is from this domain. I did a comparison of flash calculations some time ago, comparing time performance to obtain a PT isoplethic phase diagram of a multicomponent mixture just to learn a bit of how to parallelize code. At that time I found the parallelized code to run up to three times faster than sequential code. But I did not focus much in optimizing it even further, it was just playing around.

github.com/ipqa-research/yaeos

app/parallel_flash.f90

main

program flasher
   !! Example program to make flash calculations with a PengRobinson76
   !! EoS model.
   !! In this case the program is parallelized using OpenMP.
   !! To run it in parallel you must use:
   !!
   !! ```
   !! fpm run --profile release --flag "-fopenmp"
   !! ```
   !!

   ! Import the relevant parts of yaeos
   use yaeos, only: pr, EquilibriumState, flash, PengRobinson76, ArModel, PTEnvel2, pt_envelope_2ph
   use yaeos, only: saturation_pressure, saturation_temperature, k_wilson, p_wilson
   use forsus, only: Substance, forsus_dir, forsus_default_dir
   use fortime, only: timer
   implicit none

   ! Variables definition:
   class(ArModel), allocatable :: model !! Model to use

This file has been truncated. show original

Do you work with Equations of State? You do phase-stability analysis? Phase stability can be very compute-time consuming and depending on the methods you use it can either take a lot of your time of be almost negligible.

Also, what kinds of EOS are you using? Volume solvers also take a lot of computation time. Maybe profiling the code can also give you a better picture

loiseaujc · March 9, 2026, 2:13pm

Alright, my bad. I quickly read your 2015 Comp. Mat. Sci. paper, and though I’m not familiar with the thermodynamics notation, I think I got the gist of it. Putting aside the data fitting which uses minpack for finding the parameters of the constituents Gibbs energy models, you basically have to solve the following problem

\begin{aligned} \operatorname{minimize} & \quad f(x) \\ \operatorname{subject~to} & \quad A x = b \end{aligned}

where x \in \mathbb{R}^n are the different free parameters of your problem (with n \simeq 200 from your previous answer), f : \mathbb{R}^n \to \mathbb{R} is the Gibbs energy of the mixture, and Ax = b are some physical constraints, e.g. charge conservation (I’m assuming linear equality constraints only because that’s what I understood from your paper but I may be wrong). No inequality constraints (e.g. non-negativity) ? And it is how fast you can solve this particular problem again and again (i.e. for each grid point or what not) that your users are complaining about, correct ?

Edit : From re-reading the paper, it seems that some variables may need to be non-negative for physical admissibility. I’m not quite sure I understand the discussion about stable and unstable phases. Once an equilibrium has been found, it can only be made of stable phases ? If so, all variables associated to unstable phases need to be zero ? If so, that would imply some form of inequality constraints (e.g. C x \leq d and/or x \geq 0).

bosse · March 9, 2026, 4:20pm

I am not familiar with your notation either. In Calphad one tries to find minimum of a set of Gibbs energy functions (one for each phase) for a set of external constraints, normally T, P and total amount of components. The minimum may be on the Gibbs energy curve of a single phase or somewhere on a surface which is tangent plane to the stable set of phases. Solving that would be a non-linear problem, eq. 34.
But by introducing first and second derivatives of the Gibbs energies with respect to the constituent fractions (eq.38 in the paper) one can calculate the change of the constituent fractions solving a linear matrix (eq, 42 in the paper) which is more stable and do not require steplength control.

The cost is calculating the second derivatives of G for all constituents which could be done in parallel for each phase. But as I have understood from the discussion so far there will not be any CPUs available during a simulation. Putting efforts into this for the calculations I do myself would not be interesting, if a phase diagram is calculated in 30 seconds or 35 is not really important.

However, an additional advantage of this method is one has additional external conditions, the equilibrium calculation can require that a fraction of a constituent in a phase should have a certain value or that a particular phase should be stable. Such a calculation would require iterations using only overall amounts of the components.

Topic		Replies	Views
Research articles using Fortran	178	17656	March 31, 2025
Molecular Dynamics Equilibration Step Homework	12	456	October 27, 2024
"A Perspective on Sustainable Computational Chemistry Software Development and Integration": doutbtful comments about Fortran Advocacy	15	1178	March 7, 2024
Fortran openMP - make code faster	4	512	August 31, 2022
What global minimization/maximization algorithms do you use?	24	2506	January 10, 2026

Parallelization of specialized code

Related topics