The rantings of a stubborn GPU programmer

jorgeg · June 14, 2026, 10:27pm

I have kept a “running diary” where I’ve written down some of my ideas and reasoning behind writing GPU accelerated Fortran code with the assistance of LLMs and agentic ish workflows. You can find them on my website. Part 1: discussion on software architecture and design, Part 2: GPUs using OpenMP target offloading, and currently Part 3: GPUs via standard parallelism

I still need to write a post on my “setup” for AI assisted coding but I thought it would be nice to first discuss how I set things up for my AI’s success.

Cheers

sumseq · June 16, 2026, 7:50pm

I will be sharing some new standard parallelism results on NVIDIA, AMD, and Intel GPUs soon - stay tuned!

jorgeg · June 16, 2026, 11:55pm

This is super nice! I am working on something similar. I’ll update here too

aledinola · June 19, 2026, 2:06pm

Thanks for sharing!

One question on the first example about parallelizing with openmp on cpu:

!$omp parallel do collapse(3)
do j = 1, dims
do i = 1, dims
do k = 1, dims
c(i,j,k) = alpha * a(i,j,k) + b(i,j,k)
end do
end do
end d

Shouldn’t the optimal loop order be reversed? Index i should vary fastest, then j and last k

jorgeg · June 19, 2026, 8:46pm

Yeah on the CPU you’re better off with doing kjibecause of what you said. Which is an issue with porting codes and wanting to maintain performance on both cpu and GPU

RonShepard · June 19, 2026, 11:13pm

I’m unfamiliar with this detail of GPU programming. Are there any GPUs that prefer noncontinguous memory access?

Regarding nested loop order in general for either OMP or do concurrent, is it correct that fortran compilers no longer rearrange loops into the optimal order? I think this was mentioned elsewhere recently for the intel compiler. In the 1980s, this was a routine optimization that was performed on scalar machines (due to virtual memory paging), vector machines (due to the underlying vector hardware), pipelined scalar machines (due to the underlying instruction set), and parallel machines (due to nearest-neighbor and other topography communication features). Did this knowledge somehow evaporate over the last 30 years?

jorgeg · June 20, 2026, 1:32am

So there’s two issues: my loop example was a bit too general, and also is not the Fortran compiler but the openmp on top.

do concurrent will rearrange if the loop can be collapsed and but the moment you put the omp target ... directive over the normal do loop the openmp will take over and honor your loop structure. With fully collapsable loops this is not that much of a problem but it’s an issue when your loop has a serial bit. Say you can parallelize over i and j but not over k. A lot of it still comes from what compiler you are using.

Topic		Replies	Views
Asynchronous GPU programming with Fortran Help	2	419	September 21, 2025
GPU utilization in a multithreaded code Help	6	1021	June 6, 2023
Parallelization on GPU with Intel compiler Intel	55	3166	September 20, 2024
Nvfortran comparison of do concurrent vs OpenMP code Help	24	1033	September 9, 2024
Do concurrent: not seeing any speedup	39	1023	June 2, 2025

The rantings of a stubborn GPU programmer

Related topics