AMDs Fortran efforts for GPUs

Jweber · November 14, 2024, 1:27pm

On Phoronix there is an article about AMDs efforts on running Fortran Code on GPUs.

Unfortunately, I can not provide more context.

rwmsu · November 14, 2024, 2:56pm

So I guess this is AMDs answer to Nvidia’s offloading capability in nvfortran. I wonder if they will also work on implementing co-arrays etc and continue to optimize for the CPU side. Hopefully, one day AMD will have the courage to unlock FP64 on their $500 dollar plus commodity GPUs. Having the promise of sustained teraflop 64 bit performance on a 500 dollar card will be a game changer for lot of folks in academia etc. doing low to mid level scientific computing. Large scale simulations will always be done on big iron but there is a lot of engineering that can be done (preliminary design tasks etc) on a 16 core system with a teraflop performance GPU.

ivanpribec · November 16, 2024, 7:46pm

For simulations it is also important to pick a card with a good memory bandwidth.

In Germany you can pick up an Intel Arc A750 for 225 € (ASRock Intel Arc A750 Challenger D 8GB OC, Grafikkarte). It has GDDR6 memory and a 256-bit memory bus. The Nvidia RTX 4060 has similar specs, but the memory bus is 128-bit.

Unfortunately the Intel card doesn’t support FP64 natively, just using emulation. I did spot some numbers in the recent work (PDF, 2.4 MB) of @sumseq:

Notice how the 3060 Ti and 3090 Ti are faster than the 4070? I suspect it’s also related to the superior memory bus of the higher-end cards (despite being older generations).

GPU	Memory Size	Memory Type	Bus Width
RTX 4060	8 GB	GDDR6	128-bit
RTX 4070	12 GB	GDDR6X	192-bit
RTX 3060 Ti	8 GB	GDDR6	256-bit
RTX 3090 Ti	24 GB	GDDR6X	384-bit
Arc A580	8 GB	GDDR6	256-bit
Arc A750	8 GB	GDDR6	256-bit
Arc A770	16 GB	GDDR6	256-bit

On Reddit some users posted impressive real-time CFD simulations using the Arc A750 where the higher bandwidth was key. (This code works in single-precision though.)

Anyways I agree that it would be nice to have an AMD GPU in this same space which also offered FP64 and could be used with OpenMP offloading. It looks like the flang compiler is getting there.

rwmsu · November 16, 2024, 9:59pm

Agree on bandwith but you also need as much memory as you can afford. Thats also my rule of thumb about when I’m building a new system. I’ll go with a slightly slower CPU with fewer cores but try to max out memory as much as I can afford. Early last year I bought a Nvidia RTX A4500 for around 900 dollars thinking I was going to use it for some neural network regression models but decided to retire instead. It has 24 GB memory and a 192 bit bus. At the time it was actually cheaper than some the non “workstation” NVIDIA cards with a slower bus and less memory.

Edit. Also I thought I saw some post online (that I think were in error) that you “might” be able to unlock FP64 or at least emulate it for the Quaddro/RTX Axxxx cards. I’ve assumed that cards like the A4500 had the hardware to support FP64 but NVIDIA has disable it in firmware or something similar. I’ve never verified if thats true or not.

Edit 2. Got the wrong specs for the A4500. It has 20 GB memory but uses a 320 bit bus. Bandwidth is 640 GB/s. FP64 (I presume emulated) is supposedly 369.6 GFLOPS which is still plenty fast for a lot of simulations.

ivanpribec · November 16, 2024, 11:10pm

In the Turing architecture one had 2 FP64 units per 64 FP32 units (so a ratio of 1:32 in terms of FLOP rate). In the Ampere architecture this ratio is 1:64 (talking specifically for the GA102 die which is in the A4500; more details in the GA102 whitepaper).

As somebody noted on the Nvidia Developer Forum:

On a philosophical level it is a bit disconcerting that FP64 support in consumer GPUs has been reduced to such a low level where parts of it could be replaced by software emulation if it were not for the issue of register pressure. From a business perspective it makes sense, though, as NVIDIA is supply constrained on the foundry side, thus slashing hardware components that do not directly drive revenue.

When it comes to AMD GPUs, I think the ones that support Fortran and OpenMP offloading are the higher class ones in the Instinct series (e.g. the MI250X used by Frontier and Lumi, or the MI300A used in El Capitan).

When it comes to commodity GPUs that also support compute workloads (using HIP or OpenMP), you need to check if the GPU is listed in the ROCm compatibility tables: System requirements (Linux) — ROCm installation (Linux) (excluding the Instinct accelerators, there are just 11 GPUs to choose from).

Looking at the specs of the AMD Radeon RX 7900 XTX it also has a a 1:32 restriction on FP64 FLOP rate (this is a ~ 900€ card with 24 GB memory, GDDR6 and a 384-bit memory bus)

Topic		Replies	Views
AMD ROCm 6.3 open-source platform Announcements	0	165	November 29, 2024
"Portable" GPU support, opinion wanted Help	7	289	August 15, 2024
How fast can GPU speedup a Fortran CPU code?	9	1942	November 12, 2021
FOSDEM '21 (Feb. 6 & 7)	1	421	February 5, 2021
Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years? Poll	21	5185	February 10, 2021

AMDs Fortran efforts for GPUs

Related topics