Exposing Metal acceleration libraries on MacOS for Fortran

Walt · January 14, 2023, 4:41pm

I was tinkering with interfacing Metal performance shaders to Fortran the other night and I was curious if this would be a useful addition to the Fortran community.

According to this poll about 33% of Fortran users are on MacOS so it may be of use.

To me the lack of CUDA support and deprecation of OpenCL is a real problem for MacOS users looking for GPU acceleration, the only supported way to access GPU acceleration is through Metal at this point. Unfortunately it’s obfuscated in an Objective C interface which requires a good deal of ObjC->C->Fortran hacking work that most people aren’t willing to take on.

Looking at the Metal Performance Shaders documentation (Metal Performance Shaders | Apple Developer Documentation) there are a few interfaces I would be interested, the Matrix and Neural network interfaces are of interest to me but maybe its just me… my question is what would be of interest in the MPS library to the MacOS Fortran community? I could expose Metal compute shaders also, which MPS are based on…

Would it be worth posting on GitHub and leave it open to community support? There are only a few cases I would want but the community may be interested in exposing more functionality.

Thanks

Walt

ivanpribec · January 14, 2023, 8:08pm

I’ve been tinkering with some graphical applications in Fortran and was looking at what is available on MacOS. While compiling a test OpenGL application, a number of deprecation warnings popped up, suggesting I use Metal. I found there is a Metal C++ interface (Getting started with Metal-cpp - Metal - Apple Developer), but due to the complexity of it, and the need for double bridging C++ → C → Fortran, I gave up immediately.

IMHO, without an automatic interface generation tool, it’s only feasible on an as needed basis. In other words, you should only expose the minimum set of functionality needed to get the job done.

Absolutely, even if the full interface is not covered, at least we can follow your example and potentially expose more of the interface over time.

It’s also a shame there seems to be no SYCL support for Mac GPU devices. This would be an easier path for those that already use oneAPI on other devices. It might be possible (in the future) by translating SPIR-V to MSL, as @JeffH suggested here. The SPIRV-Cross tool has seen some progress in disassembling/lifting SPIR-V to MSL.

FedericoPerini · January 14, 2023, 8:55pm

As an avid Fortran and Mac user I’d be eager to see at least an example online, I think it would be great for the Fortran community to strengthen the fact that Fortran is not a closed ecosystem. Thank you if you’re willing to do that!

On a broader perspective, I think most Fortran applications/codes target scientific computing, so they’re hardly Mac-centric. Mac OS may cover some fraction (10%? 30%?) of the total “OS share”, but that probably sees Unix and MS Windows come first. In this scenario, using cross-platform libraries makes a lot more sense IMHO because it maximizes coding productivity: OpenCL is my choice for GPU programming.

It’s true Apple is trying to move people to use Metal but I doubt support for OpenCL is going to be ditched anytime soon, since so many Mac apps still use it.

FedericoPerini · January 14, 2023, 9:04pm

this repo provides a C interface already, looks like it could be a good starting point for a Fortran interface.

Walt · January 15, 2023, 4:18pm

I don’t think the Apple cpu’s support OpenCL, I checked a few months ago and it appeared that only a CPU device was available… I wrote a OpenGL 4.6 interface on top of Metal (MGL) last year and it works well but the feedback from the Fortran community on OpenGL was the data post processed and displayed using other interface language.

I worked at Apple and on OpenCL 1.0 in 2008… and I can say it hasn’t changed much, the last supported version was 1.2 and thats where Apple dropped support for OpenCL.

But I do know that the some of the same people that worked on OpenCL 1.0 also work in the Accelerate group and they optimized the MPS so we can expect MPS to use optimized kernels.

The problem I have with Metal is that it doesn’t support doubles, but thats what we get… Metal is targeted at IOS devices and there’s no need for double support on IOS.

ivanpribec · February 8, 2023, 9:49pm

You could try to emulate FP64 with a pair of floats (not the same range as IEEE double). Here are a few threads in this direction:

Taking the code snippet in the second link above, multiplying two dblefloats could be achieved with something like this:

#include <metal_common> // fma

typedef float2 dblfloat;  // .y = head, .x = tail

dblfloat mul_dblfloat (dblfloat x, dblfloat y)
{
    dblfloat t, z;
    float sum;
    t.y = x.y * y.y;
    t.x = fma (x.y, y.y, -t.y);
    t.x = fma (x.x, y.x, t.x);
    t.x = fma (x.y, y.x, t.x);
    t.x = fma (x.x, y.y, t.x);
    /* normalize result */
    sum = t.y + t.x;
    z.x = (t.y - sum) + t.x;
    z.y = sum;
    return z;
}

I’m guessing however you’d have to implement the kernels yourself, and wouldn’t be able to reuse the existing shaders.

Walt · February 8, 2023, 10:29pm

I am in the process of finishing an initial pass right now, I was going to release something in a couple of days.

I abstracted the interface to something akin to OpenGL to avoid passing pointers around and keeping the pointers inside ObjectiveC land which preserves the reference counts.

There are a lot of kernel types, I just don’t want to be supporting every call entry so I am abstracting kernel generation to a target which indicates which type of MPS kernel it is.

This way you get a generic interface driven by enums and named objects which simplifies all the pointer handling.

Here is a quick look at the C header for the interface, right now I support mpsMatrixMult / mpsMatrixSum and Metal shaders… its not complete as I have just switched over to kernel types

typedef enum {

kInvalid = 0,

kFloat32,

kFloat16,

kInt64,

kInt32,

kInt16,

kInt8,

kUint64,

kUInt32,

kUInt16,

kUInt8,

kBool,

kComplexFloat32,

kComplextFloat16,

kData,

kMaxMPSType

} DataType;

typedef enum {

kShader = 0,

kMatrixMult,

kMatrixSum

} KernelTarget;

typedef enum {

kAlpha,

kBeta,

kTranspose_A,

kTranspose_B,

kMaxOption

} MatrixOption;

// enable functions, mps kernels can take optional values

void mpsEnable(KernelTarget target, MatrixOption option);

void mpsDisable(KernelTarget target, MatrixOption option);

// set options value, mps kernels

void mpsSetOptionValuef(KernelTarget target, MatrixOption option, float value);

void mpsSetOptionValued(KernelTarget target, MatrixOption option, double value);

// gen type functions

unsigned mpsGenBuffer(size_t size, void *data);

unsigned mpsGenVector(unsigned len, void *data, DataType type);

unsigned mpsGenMatrix(unsigned rows, unsigned cols, void *data, DataType type);

// gen kernel functions

unsigned mpsGenMatrixMultKernel(unsigned rows, unsigned cols);

unsigned mpsGenMatrixSumKernel(unsigned rows, unsigned cols);

// data functions

void mpsSubBuffer(unsigned name, void *src, size_t size, size_t offset);

void mpsSubBufferVector(unsigned name, void *src, size_t len, size_t offset, DataType type);

void mpsSubBufferMat(unsigned name, void *src, unsigned row, unsigned col, unsigned width, unsigned height, DataType type);

// synchronize data from GPU to CPU after a GPU operation modifies the contents

void mpsSyncBuffer(unsigned name);

void mpsSyncVector(unsigned name);

void mpsSyncMatrix(unsigned name);

// copy metals copy of the data to a dst pointer in the same format it was submitted in

void mpsGetBuffer(unsigned name, void *dst);

void mpsGetVector(unsigned name, void *dst);

void mpsGetMatrix(unsigned name, void *dst);

// command buffer flush routines

void mpsFlush(void);

void mpsFinish(void);

This is a simple test of MatrixMult, again it’s missing the kernel target support but it’s getting there.

matA = mpsGenMatrix(rows, cols, dataA, kFloat32);

matB = mpsGenMatrix(rows, cols, dataB, kFloat32);

result = mpsGenMatrix(rows, cols, dataResult, kFloat32);

mpsEnable(kMatrixMult, kAlpha);

mpsSetOptionValued(kMatrixMult, kAlpha, 0.5);

mpsMatrixMult(matA, matB, result);

mpsSyncMatrix(result);

mpsFinish();

mpsGetMatrix(result, dataResult);

I will check it in github for review if people are interested in expanding on the work.

Walt

Topic		Replies	Views
Fortran Developer's Poll - 2022 Poll	13	812	July 1, 2022
Options for linking Fortran on a Mac Help	11	1375	June 19, 2022
Fortran Programmers : How do you want to offload to GPU accelerators in the next 5 years? Poll	21	5078	February 10, 2021
Evidence of Fortran boosting GenAI	4	299	April 2, 2024
GSoC'22: Accelerating Fortran DO CONCURRENT in GCC GSoC-2022	9	1403	June 12, 2022

Exposing Metal acceleration libraries on MacOS for Fortran

Related topics