fastGPT: Faster than PyTorch in 300 lines of Fortran

I would like to announce fastGPT, a fast GPT-2 inference written in Fortran:

I recommend to read the blog post above for background and motivation. See the README at GitHub for an example and benchmarks.

It’s pure Fortran, it’s short, readable and most imporantly: fast. On my Apple M1 it looks like it is faster than PyTorch in fair comparison, and a lot faster if I use optimizations/backends that PyTorch doesn’t use. It also starts immediately. It is a standalone Fortran application, currently we still need Python to encode the input string to tokens, but then fastGPT takes it, generates more tokens and converts them back to text.

It is written like any other numerical computational code. I think Fortran is the perfect fit, at least for GPT-2 inference, but probably for other similar ML/AI models too.

fastGPT is currently only parallelized via parallel OpenBLAS. We have a great single core CPU performance, and this provides a solid foundation for parallelization and GPU offloading. I am hoping some of you would be interested in helping. We can try MPI, and @rouson can try coarrays. :slight_smile: I recommend to approach it like any other physics or numerical code, and let’s see how fast we can make it in parallel. This would also be a great GSoC project, both parallelization and making the application more user friendly (such as porting the encoder into Fortran so that we don’t need Python, see the issue tracker for more ideas).

35 Likes

I think between @certik, @milancurcic and @rouson , we’re going to end up with more and faster ML/AI libraries than Python. :tada:

8 Likes

What’s the accuracy of your fast tanh? Asking because I believe it will be likely faster to directly approximate erf. Specifically (in Julia)

function fasterf(x)
    x2 = x*x
    res = x*evalpoly(x2, (0.7975839f0, -0.13200624f0, 0.019021248f0, -0.0019748025f0, 0.00013678304f0, -5.5545797f-6, 9.853275f-8))
    return ifelse(x2<12.75f0, res, copysign(1,x))
end

The advantage is that you need fewer terms to get an accurate result.

2 Likes

Kudos @certik, very nice and interesting effort.

You’re on the right track, you’re looking at Fortran beyond the pigeon hole of particular number crunching into which it has been boxed by many people including many of the WG5, J3 committees.

Your comment, "we still need Python to encode the input string to tokens, " is a needlessly sorry situation for Fortran. With just some imagination and vision and a bit of effort in the language, Fortran can be superior and safer for string handling and encoding as well compared to other far more popular approaches currently.

With your effort with LFortran, you can initiate a massive interest in Fortran and also adoption toward a variety of computing domains. Sky is the limit.

3 Likes

@oscardssmith awesome, thanks for the tip! I haven’t checked the accuracy much, just quickly coded it, and it seems to produce the same token results. I was wondering the same thing how they got into the weird expression involving tanh and the polynomial in it. And then I read that it’s an approximation to erf(x). So we should approximate erf(x) directly, exactly as you posted. The only worry is that it will be then different than the tanh(x) approximation, which then in return might change the answers of the model, if it was trained on tanh(x), I don’t know. So we have to try it and see.

1 Like

Thanks for your encouraging comments @FortranFan. It’s not too bad, I did the tokens to string decoder here:

it can probably still be simplified (it even does simplified UTF-8 decoding!). The encoder will be harder, essentially we need to translate this little Python file: https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/encode_input.py, there is a regex in it, but I am hoping we can hand code it. We’ll have to write lots of tests to ensure we didn’t make a mistake, but it shouldn’t be hard, I was focusing on performance first.

3 Likes

The reason people use the tanh version is if you’re on a GPU, Nvidia has a fancy tanh builtin that you can’t match and the exact shape doesn’t really matter, just the smoothness and rough shape (the activation functions are mostly made up anyway).

1 Like

Ah, I see, now that makes sense to me. Perfect, I’ll try erf(x) and see. I think focusing on a CPU, there are quite a few things that one can do to further speed this up. For a GPU, we might need to adapt the code anyway, and possibly maintain a dedicated version. It’s just a few functions. The same with a parallel code.

erf is a Fortran 2008 intrinsic.

yes but it will be a bunch slower than the approximated version which is only accurate to 4 digits.

1 Like

Yes, so is tanh(x), but the fast_tanh(x) in the code is a lot faster, even at full accuracy, and as @oscardssmith said, it looks like we might get away with a lower accuracy version as well.

Nice work, @certik! I hope to talk with you soon about how we might collaborate in this area and what synergies there might be with Inference-Engine. I’m curious what file format you use. Inference-Engine uses a JSON file exported from PyTorch. I’m also investigating ONNX.

1 Like

Definitely! There is also @milancurcic’s neural-fortran. We should figure out how to join forces, I think Fortran has a lot to offer in this area.

Right now the only documentation of it is the code that reads it:

and writes it:

It’s just binary array data. I think it’s actually platform-independent, except that it is little-endian, but I think most platforms are these days.

My list of Fortran codes on GitHub has a Neural Networks and Machine Learning section.

1 Like

Where’s the fpm.toml file? :slight_smile:

2 Likes

@rjzak send a PR! :slight_smile:

2 Likes

This is very impressive! I hope this will bring a lot of attention to Fortran.

What’s the benefit of using n_embd and n_seq in the mha function as extra arguments, instead of using ubound or extend of x? Is it just for readability, or am I missing something?

real(sp), intent(in) :: x(n_embd,n_seq)
1 Like

Oops, forgot to mention that this is erf(x/sqrt(2)) since that is what gelu needs.

1 Like

@oscardssmith here I found a case where my fast_tanh() function produces different output than tanh(): An example where the current fast_tanh() gives different results · Issue #25 · certik/fastGPT · GitHub. Both outputs look ok. How do you judge which one is better?

I assume what is happening is that it gives probabilities of all the tokens, and if I print them, I assume I would find similar probabilities for both cases, but slightly numerically different (due to the tanh numerical differences), and the “greedy” mode selects a different token, but from the probability perspective the results might still be “equivalent”. Is there a way to determine at which point the results stop being “equivalent”? What accuracy in the final token probabilities is needed?

I wonder if one can think of the reduced precision tanh as reducing the precision of the whole model, and there are other ways to do it as well, such are reducing the default 32bit float weights to 16bit, 8bit or even 4bit. It must affect the final probabilities, but I wonder what are some ways to judge the quality of the result. One way would be to compute the error function for some texts and see how much it changes based on the various reduced precision changes. Is that the way to approach it? And if it gets worse just by a few percent, it’s not a problem, but if it changes a lot, it might be a problem?

It’s a little bit hard to tell. On GPUs these models are likely running with bfloat16 or a mixed precision scheme so I would think that as long as you are within 2^-10 or so you should get reasonable results, but it’s hard to say.

1 Like