Advice for approaching data locality improvements in legacy code

I’m in the middle of my first major performance analysis and optimization project. I have legacy Fortran code (think Fortran 77 with some Fortran 90, lots of global state).

The main part I’m interested in is responsible for > 90 % of runtime. It comprises nested loops, a lot of code, very entangled, A LOT of variables.

I have profiled the application (with Linaro MAP), I know there is a significant amount of time going into memory access, and I have hints at the places where this is worst.

After looking at several guides (like Practical guide (with code) on performance optimization techniques in Fortran (2024) ), I checked for the obvious things such as index order in multidimensional arrays, tried removing unnecessary assignments and redundant variables, moving statements using the same variables closer together, and so on. With no success, unfortunately.

The code is compiled with -O3, so probably the compiler also does smart re-arrangements I’m not aware of.

My problem is that the code is very far from the nice, short toy examples that tutorials tend to show. So I know there’s no easy way to fix it. But does anyone have tips or advice on how to approach this? How to explore less blindly? How to get a feeling for what might work?

I’m willing to put in serious effort, I’m just not sure how to go about it or what tools to use.

Cheers,
Christina

P.S. I’m also looking into whether the algorithm as such can be improved, but it’s quite involved and I would prefer to not change it if I can.

4 Likes

Hello! welcome to the forum, we’re glad you’re here.

Approaching a large code base is always a challenge in general specially if they contain lots of legacy code. Having a lot of global state variables is always challenging since you’re carrying them around all the time and can be difficult to handle.

My suggestion would be to try to isolate the code you’re working on as much as possible, if it is not already encapsulated in a module (as many legacy apps are) I’d start with that. This way you can set up a better access to the functions you need through an interface you write - you’ll control data in/out and that will help you realize what is it that you really need to do the work.

If you’re not compiling with warnings enabled I’d enable all warnings and see if you can start fixing a couple of those that pop up in the area of the code you’re working with. This is a good way to improve the codebase and familiarize yourself more with that region.

I also like to do mini apps that mimic the loop I’m working with, similar data access patterns, similarly structured loops and you control everything.

If the codebase does not have unit tests it’s a good thing to invest in.

In general, with such large applications with legacy constructs it is very much worth to spend some time doing some house mending before you jump straight into what you’re actually trying to do. Also, do PRs often so that you don’t accumulate changes and try to merge in 1000s of line of code at the same time.

After you feel comfortable navigating the codebase and can create mini apps from diverse parts of it I’d start thinking about improving the algorithm.

Your mileage might vary but this is what has worked for me working with a large, legacy ridden codebase.

Could you please be more specific regarding the aim of your project? Are you mainly trying to improve the code’s performance, or is it the code’s structure (i.e. its maintainability) that is the main problem? Or is it both?

These two things may require different refactoring techniques, so without knowing what exactly you are trying to achieve, or having seen the code in question, it is difficult to provide specific advice.

You might want to try using profiling tools such as VTune https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html from intel. I’ve found it to be quite easy to use and extremely powerful to detect performance and memory issues.

1 Like

My approach is often to replace a smallish block of code with a routine, document the interface and try to isolate variables that are local to this smaller routine. Break the code into smaller functional parts.
Use of array sections can also hide a lot of memory use inefficiency, where the compiler is resorting to temporary arrays deep in the loops with associated hidden delays.

Where you have data structures that are used in a variety of code locations, it can be difficult to restructure the data to suit each use. Perhaps with better documentation of the interface, you can identify a local data structure that overcomes a lot of memory transfer delays due to inappropriate data layout for the computation approach.
You can’t attempt this if you have “a lot of code, very entangled, A LOT of variables”

1 Like

It’s about performance, and the specific goal I have is to reduce the number of cache misses. The profiler indicates that around 40 % of runtime goes into memory access, which sounds like it should be possible to reduce…

(Readability and maintainability are also a problem, but that’s a general one of the whole code base, so that doesn’t need to be solved now.)

Number of cache misses can be a misleading metric. Quoting Hager and Wellein,

Number of bus transactions, i.e., cache line transfers. Events like “cache misses” are commonly used instead of bus transactions, however one should be aware that prefetching mechanisms (in hardware of software) can interfere with the number of cache misses counted. In that respect, bus transactions are often the safer way to account for the actual data volume transferred over the memory bus. If the memory bandwidth used by the processor is close to the theoretical maximum (or better, close to what standard low-level bandwidth benchmarks like STREAM can achieve; …), there is no point in trying to optimize further for better bus utilization. …

Also in this video, Hager remarks,

Cache miss ratios, sometimes very popular, it tells how many accesses to the memory hierarchy, were misses in the cache hierarchy. We are not so enthusiastic about cache miss ratios, we think it is much more interesting to look at data volumes and data transfer bandwidths, because cache miss ratios can easily be misinterpreted. We could have a code with a very low cache miss ratio, but is also very slow.

Perhaps more straightforward is to just focus on a derived metric like:

  • number of loop iterations per second (or other problem specific metric, cell evaluations per second, systems solved per second)
  • data transfer bandwidth (byte/cy, byte/s)
  • floating point performance (flop/cy, flop/s)

The benefit of the latter two, is you can relate them to the effective limit values of a given processor (“the code reaches 50 % of the effective bandwidth”…).

A question here would be if the nested loops are amenable to (auto-)vectorization and/or parallelization?

One way to test this is by compiling the code with machine-specific optimizations and with and without vectorization (the parts in [ ]):

  • gfortran: -O3 -march=native -f[no-]tree-vectorize
  • ifx: -O3 -xHOST -[no-]vec

If it makes a difference, then you know the code is exploiting vector instructions (at least to some extent). This experiment still doesn’t tell you anything about the absolute performance.

I also like to start performance optimization with a view of the APS (Application Performance Snapshot) output. It is a tool from Intel, related to the VTune and Advisor set of tools, mentioned by @hkvzjal.

2 Likes

Thanks a lot for the advice! All very solid and helpful. :slight_smile: Especially starting with compiler warnings and isolating smaller parts I find good ways to find an entry point.

I’m torn between wanting to change as little as possible and wanting to have something that is also possible to parallelize without major pain. A good middle ground might be to play with the code a lot to grasp it, and in the end throw away all which you don’t need.

Well, it’s fascinating work!

Thanks for the tip; I’ll try the corresponding AMD tools.

So far I’ve been very happy with Linaro MAP, as it helped to detect some small code blocks in very inconspicuous helper methods which contained legacy workarounds that look harmless but scaled incredibly badly (and that I’m sure I would not have found otherwise). When it comes to metrics like available and used memory bandwidth, for example, beyond what `likwid-topology` gives, I still have to learn and skill up…

Thanks a lot for the note on cache misses. I’ll try to get a hang of the other metrics. As mentioned in my reply above, I have pretty good insights into where runtime goes, but I’m still too unknowing when it comes to a lot of other metrics. I know the code is memory bound, but I need to figure out why. And of course the best tools are useless if you don’t know how to interpret what they do - something to work on!

Also, there’s very little vectorization happening. It would be nice to be able to exploit it more, I guess, but with the code as it is I see little possibilities. It’s a lot of entangled “get something from this array, calculate stuff, then get something from that array, do more stuff, and then put the result into yet another array”.

The naive answer would be the code must be loading a lot of data from main memory. Lots of science and engineering codes are loops that move data in and out, meaning performance is limited by the relatively slow data path from main memory, through the caches, all the way up to the functional units (the ALU).

There are several questions to explore:

  • is there unnecessary data movement (“redundant traffic”) that could be avoided?
  • is the data access pattern contiguous, strided, or is it irregular? (this goes hand-in-hand with loop index ordering)
  • could some of the data be recalculated cheaply on the fly?
  • does the code use multi-threading? on recent architectures you may need multiple cores to saturate the memory interface
  • can the computation be offloaded to an accelerator/GPU?
1 Like

I’ve a little experience with Linaro MAP. It’s complementary to the intel vTune, etc, but when we were looking at introducing some openMP into an existing code earlier this year, we often found that Linaro MAP was clearer to understand than intel vTune. We didn’t try the AMD tools as Next Gen. AMD fortran compiler doesn’t support co-arrays which are used in our code.

1 Like

If you can profile the code - as some other people suggested so you can quickly find the bottlenecks you can see where to start.

Sometimes also you’ll find places that can just be made better by doing some small quality of life upgrades, like a better interface to a library.

I find it difficult to see how this could be the case, other than a poor algorithm.
If this is the case, then I would expect that you are looking at the wrong part of the code.
Most often, 99% of the calculation is in 1% of the code, so inefficiency in the other 99% of code is not significant.

This is interesting, as most old legacy codes did have efficiency workarounds to help the then compiler technology, but which is no longer suited to modern compilers. It can help to identify these patches by simply clean out the Fortran syntax.
( An example of this is reducing multi-dimensional arrays to single vectors, with complex subscript calculations. Now, removing this mess can usually help. Another is to now use ALLOCATE rather than the mess of an old working array for many data arrays. )

This is a worry, but can be improved by revising the algorithm order. AVX instructions are very beneficial in the inner loops

Other things that you might want to take into account @christina

  • LTO (link time optimization): Say that you wrote an elemental procedure or a small kernel function in a library-like module, you use it in other several places, you might expect the compiler to take that procedure and give you some optimal code at the end, but inlining (or not) such procedures can completely change the overall performance… Unfortunately, in Fortran there is no native “force inline” options. There are flags like -flto for (gnu and intel llvm) or -ipo (intel classic) or -Minline (nvidia) that can help with this. As a Fortran programmer, there is no standard native way to force inlinement. The only “poor-man’s” alternative are to either copy-paste the implementation in the same file of the calling procedure, or use an include file (which basically does that) to be sure the compiler will inline your small kernel function within your larger scope code.

  • While compilers do a great job at optimizing your code, they do have a “limited” analysis scope which can imply that sometimes they will do a code optimization with respect to some limited scope which in the overall picture might actually hinder the performance. Sometimes you have to help the compiler, sometimes you have to trick it, or simply take a full step back and refactor a big chunk of code. For a memory bound problem you want to do as much computation as possible with the data loaded on the cache. Certain implementations can profit from you declaring small “chunks” of data (on powers of 2) that can be processed in parallel by the vector instructions instead of expecting the compiler to guess, out of a sequential loop that it is authorized to “chunk-it-out”.

  • Branching (if/exit or if/cycle) are things that can be revisited to help the compiler. In some situations I have found that the use of the merge intrinsic renders better performance. Not a magic bullet but worth exploring.

2 Likes

Thanks again everyone who chimed in; this is more helpful than you think. :slight_smile:

After a bit of tinkering with the code, I managed to make it significantly worse. Which shows me that my intuition about data access and movement is still :poop:, but also makes me confident that I can make it better once I manage to move into the other direction…

2 Likes

Interesting suggestions!

I often have the following code pattern in my codes: nested do loops that call a function or subroutine stored in a separate module. Let me illustrate with a MWE:

module return_module
implicit none
private
public :: ReturnFn

real function ReturnFn(iap, ia, iz)
implicit none
integer, intent(in) :: iap, ia, iz

ReturnFn = real(iap + ia + iz)

end function ReturnFn

end module return_module

program main_program

use return_module, only: ReturnFn

implicit none

integer, parameter :: Nap = 1000, Na = 500, Nz = 50

integer :: iap, ia, iz

real :: value

real :: array(Nap, Na, Nz)

! Loop order optimized for Fortran column-major storage

do iz = 1, Nz

do ia = 1, Na

  do iap = 1, Nap

    value = ReturnFn(iap, ia, iz)

    array(iap, ia, iz) = value

  end do

end do

end do

end program main_program

What would be a good way of improving performance of this code?

  • Sometimes I use openMP to parallelize the nested loops since they are independent, but I’m not sure if I should collapse all three using the collapse statement or only the outermost.
  • I have tried do concurrent instead of omp directives but results are usually worse than serial code, while omp does well
  • Will moving the function ReturnFn from the module to the main program (as internal function after contains statement) improve performance? It makes the code less clear, since I want to separate the numerical routine with the loops from the function. Therefore I am reluctant to do so…

Apologies for the poor formatting of my MWE (there is a mistake in the module, a contains statement is missing) but this new formatting editor is not very user friendly

You should vectorize the function (make it take arrays as arguments, and return an array result).

This way it will be possible to push, e.g., the innermost loop (over iap) into the function itself. Which will reduce the function call overhead by a factor of Nap = 1000 in your example, and moreover ensure unit stride of memory access within the function itself.

You could even push also the outer loops into the function (if you’d like to achieve the maximum possible vector length).

Or you could collapse these loops into one loop, and parallelize this one using OpenMP directives (so that you can exploit both vectorization and parallelization). It all depends on the kind of hardware you want this to run fast on.

1 Like

Thanks for the suggestion, but I think using array syntax is very powerful in Matlab but not in Fortran. Do you have some specific example?

How is this different from

!$ omp do collapse(3)

?