Profiling Fortran Code

Hello,

I think it would be good to have a discussion on how folks profile their code. Fortran is a language for those who want to go fast, yet searching ‘Profiling’ on the forum seems to yield nothing.

Profiling comes in many forms. One may want to profile:

  1. Timing of a program and its function calls.
  2. Timing of i/o operations involved in a program, and how many times they are called.
  3. Memory usage of the objects in the code. Memory statistics over the lifetime of the code; where in the source code execution is memory usage is highest.
  4. Number of times routines are called, or that a line in code [or its generated assembly] is run.
  5. Generalizations of the above to parallel execution environment (MPI, coarrays, etc).
  6. More specific items; memory traversal statistics, cache behavior of variables, register usage, things that would require an assembly-level analysis probably.

Are there other things you profile? What tools do you use? Are there tools that do most of the above, or are there separate tools for different things?

9 Likes

If you want to profile parallel code, you’ll want to talk to ParaTools about TAU. I’ve not used it myself, but I’ve seen it used and it is like black magic.

For other purposes I’ve generally been successful with tools like gprof and Valgrind. I’ll bet others will suggest that Intel’s tools are pretty good, but I haven’t really used them.

5 Likes

Currently, for regular/coarray Fortran I just use gprof flat profiles to query 1 and 4-5, and valgrind to find memory errors. When I’ve been doing CUDA Fortran, I heavily enjoy the Nsight Compute system by NVIDIA, as it allows you to track everything above except i/o operations. Its biggest strength is being able to do a lot of what 6) above lists. I’d honestly be quite happy if I could find a similar tool for regular Fortran, but I worry it would have to involve the hardware vendors creating profilers for code ran on their hardware. Perhaps the Intel tools you mention have some of that functionality.
I’ve heard of HPCToolkit, some people seem to swear by it, but haven’t tried it out yet myself.

2 Likes

FYI Intel offers VTune Profiler as part of oneAPI but it’s part of their Base Toolkit. Whereas IFORT (and IFX) compilers for Fortran and also their memory analyzer toolkit named Intel Inspector are part of their HPC Toolkit.

3 Likes

I find the overview on VI-HPS :: Tools Overview quite helpful.

5 Likes

The hpc-toolkit (http://hpctoolkit.org/) from Department of Energy’s ECP project is quite nice for hotspot, trace, and events profiling. This shouldn’t be confused with Intel’s HPCToolkit product, though.
It works well for serial, parallel (MPI and/or OpenMP), and GPU accelerated applications on multiple vendor’s hardware.

7 Likes

We developed Caliper at Lawrence Livermore. It requires adding calls into your code to instrument what you want to measure. It integrates with some thirdparty tools to help measure GPU performance.
https://software.llnl.gov/Caliper/

5 Likes

Vtune and the Intel Advisor both work with Intel Fortran Classic compiler (ifort).
For Intel Fortran Coarrays surprisingly the Intel MPI profiling tool Intel Trace actually works. This is because Trace works with MPICH and Intel’s CAF sits on top of Intel MPI which is MPICH. At least in the past I was able to see individual data movement between images (ranks) in a CAF application.

For single process profiling I really like Intel Advisor. It has a Vectorization Advisor tool within Intel Advisor that profiles down at the loop level. It’ll tell you what loops vectorized, what loops didn’t and why, show trip counts, hot loops, and show you which SSE/AVX level is used in each loop. So if you want to go down to loop level profiling it’s quite good. Downside, it uses the Intel compiler Opt-Report output to give you this detailed explanation of what each loop is doing so it won’t work with gfortran.

For macro, function level profiling, yes, Vtune is good but I find it complicated. They do have some new summary profiles that are a good starting point.

Any of these tools can be downloaded for free, and you don’t have to download a monster toolkit. A-la-carte downloads of each tool is HERE

5 Likes

I’ve also used Open|SpeedShop in the past.

Often I just use cpu_time on a loop and compare to a theoretical performance peak.

1 Like

Another tool worth looking at is LIKWID. Specifically, likwid-perfctr and the marker API.

On Ubuntu 20.04 I just installed LIKWID with apt install likwid. It also has a bunch of other helpful tools. For example, running likwid-topology, will give you a nice overview of your system. In my case:

$ likwid-topology
--------------------------------------------------------------------------------
CPU name:	11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz
CPU type:	Unknown Intel Processor
CPU stepping:	1
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:		1
Cores per socket:	8
Threads per core:	2
--------------------------------------------------------------------------------
HWThread	Thread		Core		Socket		Available
0		0		0		0		*
1		0		1		0		*
2		0		2		0		*
3		0		3		0		*
4		0		4		0		*
5		0		5		0		*
6		0		6		0		*
7		0		7		0		*
8		1		0		0		*
9		1		1		0		*
10		1		2		0		*
11		1		3		0		*
12		1		4		0		*
13		1		5		0		*
14		1		6		0		*
15		1		7		0		*
--------------------------------------------------------------------------------
Socket 0:		( 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level:			1
Size:			48 kB
Cache groups:		( 0 8 ) ( 1 9 ) ( 2 10 ) ( 3 11 ) ( 4 12 ) ( 5 13 ) ( 6 14 ) ( 7 15 )
--------------------------------------------------------------------------------
Level:			2
Size:			512 kB
Cache groups:		( 0 8 ) ( 1 9 ) ( 2 10 ) ( 3 11 ) ( 4 12 ) ( 5 13 ) ( 6 14 ) ( 7 15 )
--------------------------------------------------------------------------------
Level:			3
Size:			16 MB
Cache groups:		( 0 8 1 9 2 10 3 11 4 12 5 13 6 14 7 15 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains:		1
--------------------------------------------------------------------------------
Domain:			0
Processors:		( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Distances:		10
Free memory:		23621 MB
Total memory:		31915.9 MB
--------------------------------------------------------------------------------
5 Likes

I’ve used the TAU profiling suite on my project and highly recommend it. This was at the beginning of my work to port the code to use GPUs with OpenACC (since we didn’t want to use mixed-language programming with CUDA C++ or HIP, yet). I was able to do hotspot profiling with TAU to confirm the bottleneck of the CPU-only Fortran code that our collaborators pointed out a few years back:

fig05-ParaProf-stat-table

Plus, the developers at U of Oregon are friendly and very responsive (I’ve submitted a patch that was accepted into the code).

@adenchfi To answer your original question, TAU is able to do 1 (timing), 4 (callpath), and 5 (MPI and OpenMP). You might want to check out the recording from this tutorial on TAU usage at OLCF.

I’ve also used HPCToolkit that @fluidnumerics_joe mentioned above to look at MPI communication patterns in the code, which turned out to be mostly collective routines like MPI_Bcast and MPI_Reduce. HPCToolkit can do a lot more than that – it’s a very powerful toolkit, after all. I would recommend starting with the slides from this workshop at NERSC. During that workshop the HPCToolkit dev team helped us profile our code, which was unique in the sense that it uses MPI, OpenMP, OpenACC, the MAGMA linear algebra library, as well as NVIDIA Multi-Process Service – the last item was not yet well supported by HPCToolkit back then.

I also have an ongoing project with @Arjen, @jeremie.vandenplas, and @Lenore involving item number 3 (memory usage). So far our approach involves mining data from /proc/ and calling getrusage() C system call on Linux.

For item number 2 (I/O) I’ve heard about the Darshan I/O profiler. It’s installed and enabled by default on Summit and Perlmutter.

For item number 6 (register usage) I think I remember reading that PAPI can do exactly that – tapping into hardware sensors and hardware counters.

1 Like

@ivanpribec I’ve heard of LIKWID before but haven’t used it yet. Does it basically provide the same info that hwloc can give you, or much more than that?