I still need to write a post on my “setup” for AI assisted coding but I thought it would be nice to first discuss how I set things up for my AI’s success.
Yeah on the CPU you’re better off with doing kjibecause of what you said. Which is an issue with porting codes and wanting to maintain performance on both cpu and GPU
I’m unfamiliar with this detail of GPU programming. Are there any GPUs that prefer noncontinguous memory access?
Regarding nested loop order in general for either OMP or do concurrent, is it correct that fortran compilers no longer rearrange loops into the optimal order? I think this was mentioned elsewhere recently for the intel compiler. In the 1980s, this was a routine optimization that was performed on scalar machines (due to virtual memory paging), vector machines (due to the underlying vector hardware), pipelined scalar machines (due to the underlying instruction set), and parallel machines (due to nearest-neighbor and other topography communication features). Did this knowledge somehow evaporate over the last 30 years?
So there’s two issues: my loop example was a bit too general, and also is not the Fortran compiler but the openmp on top.
do concurrent will rearrange if the loop can be collapsed and but the moment you put the omp target ... directive over the normal do loop the openmp will take over and honor your loop structure. With fully collapsable loops this is not that much of a problem but it’s an issue when your loop has a serial bit. Say you can parallelize over i and j but not over k. A lot of it still comes from what compiler you are using.