What time are you reporting here? Is it the section between 'initialise_phi'
and 'time_loop'
?
If you do some counting of the memory accesses, you can quickly estimate how well the code is performing.
First loop nest
Variable | Write | Read |
---|---|---|
phi |
0 | 5 |
tempr |
0 | 5 |
lap_phi |
1 | 0 |
lap_tempr |
1 | 0 |
phi_dx |
1 | 0 |
phi_dy |
1 | 0 |
epsil |
1 | 0 |
epsilon_deriv |
1 | 0 |
Second loop nest
Variable | Write | Read |
---|---|---|
phi |
1 | 1 |
epsil |
0 | 5 |
epsilon_deriv |
0 | 4 |
phi_dx |
0 | 2 |
phi_dy |
0 | 2 |
tempr |
1 | 1 |
lap_phi |
0 | 1 |
If you sum these values, it gives you the amount of memory accessed to update one grid cell:
Write | Read | |
---|---|---|
Total | 8 | 26 |
So a total of 34 elements per cell update, 8 bytes each, which gives 272 bytes per cell update. I ignored the where(phi > ...) ...
clipping part, which is essentially a third and fourth loop nest, so take the following numbers with a grain of salt.
Now if you calculate the rate (cell updates per second), using the best number from Test 2:
rate = ((2000 * 2000) cells) * (2000 steps) / 98.173 s = 81.5 MUPS (mega cell updates per second)
If you multiply the rate by the memory balance, you get an effective bandwidth:
bw = (81.5 MUPS) * (272 bytes per cell update) = 22168 MB/s = 22.17 GB/s
If you look up the properties of your processor, you’ll find it has a peak memory bandwidth of 50 GB/s, so you are using about 40 % of it. (For the 2000^2 grid, 8 field variables, double precision, you need 256 MB of memory which exceeds the 12 MB L3 cache. The 200^2 case only needs 2.56 MB)
Assuming your code is bandwidth-limited (typically stencils codes are) this is not that bad. I’m guessing you could probably go a little faster (say up to 60-70 % of the theoretical bandwidth). Your kernel uses trigonometric functions (sin, cos, atan) so perhaps that gives it a slightly higher arithmetic intensity, which balances out the memory load time. You could verify this by looking at the hardware performance counters using certain tools. For your loop nests, if you would introduce a layer of halo cells for the periodic boundary condition, it would make the loop easier to vectorize and perhaps speed it up a little.
Instead of guessing, I’d recommend using tools like the Intel Application Performance Snapshot, as I’ve described before here and also here. The tool essentially helps identify what is the bottleneck (at least from the processor point of view), for instance if it is using scalar or vector instructions, if the cache is stalling, if there is thread imbalance or overhead, etc… I find it really useful. If not obvious, despite being an Intel tool, you can also use it to profile executables produced by other compilers (gfortran, nvfortran), and also on other x86-64 machines (for instance if you have a CPU from AMD).
Another thing you can try doing is plotting the rate (cell updates per second) as a function of grid size. In this plot you should be able to clearly see the effect of going from cache to main memory. Here is an example what it looks like when you hit the “memory wall”:
I’d expect you to see something similar.