fastGPT: Faster than PyTorch in 300 lines of Fortran

Thanks for your encouraging comments @FortranFan. It’s not too bad, I did the tokens to string decoder here:

it can probably still be simplified (it even does simplified UTF-8 decoding!). The encoder will be harder, essentially we need to translate this little Python file: https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/encode_input.py, there is a regex in it, but I am hoping we can hand code it. We’ll have to write lots of tests to ensure we didn’t make a mistake, but it shouldn’t be hard, I was focusing on performance first.

3 Likes

The reason people use the tanh version is if you’re on a GPU, Nvidia has a fancy tanh builtin that you can’t match and the exact shape doesn’t really matter, just the smoothness and rough shape (the activation functions are mostly made up anyway).

1 Like

Ah, I see, now that makes sense to me. Perfect, I’ll try erf(x) and see. I think focusing on a CPU, there are quite a few things that one can do to further speed this up. For a GPU, we might need to adapt the code anyway, and possibly maintain a dedicated version. It’s just a few functions. The same with a parallel code.

erf is a Fortran 2008 intrinsic.

yes but it will be a bunch slower than the approximated version which is only accurate to 4 digits.

1 Like

Yes, so is tanh(x), but the fast_tanh(x) in the code is a lot faster, even at full accuracy, and as @oscardssmith said, it looks like we might get away with a lower accuracy version as well.

Nice work, @certik! I hope to talk with you soon about how we might collaborate in this area and what synergies there might be with Inference-Engine. I’m curious what file format you use. Inference-Engine uses a JSON file exported from PyTorch. I’m also investigating ONNX.

1 Like

Definitely! There is also @milancurcic’s neural-fortran. We should figure out how to join forces, I think Fortran has a lot to offer in this area.

Right now the only documentation of it is the code that reads it:

and writes it:

It’s just binary array data. I think it’s actually platform-independent, except that it is little-endian, but I think most platforms are these days.

My list of Fortran codes on GitHub has a Neural Networks and Machine Learning section.

1 Like

Where’s the fpm.toml file? :slight_smile:

2 Likes

@rjzak send a PR! :slight_smile:

2 Likes

This is very impressive! I hope this will bring a lot of attention to Fortran.

What’s the benefit of using n_embd and n_seq in the mha function as extra arguments, instead of using ubound or extend of x? Is it just for readability, or am I missing something?

real(sp), intent(in) :: x(n_embd,n_seq)
1 Like

Oops, forgot to mention that this is erf(x/sqrt(2)) since that is what gelu needs.

1 Like

@oscardssmith here I found a case where my fast_tanh() function produces different output than tanh(): An example where the current fast_tanh() gives different results · Issue #25 · certik/fastGPT · GitHub. Both outputs look ok. How do you judge which one is better?

I assume what is happening is that it gives probabilities of all the tokens, and if I print them, I assume I would find similar probabilities for both cases, but slightly numerically different (due to the tanh numerical differences), and the “greedy” mode selects a different token, but from the probability perspective the results might still be “equivalent”. Is there a way to determine at which point the results stop being “equivalent”? What accuracy in the final token probabilities is needed?

I wonder if one can think of the reduced precision tanh as reducing the precision of the whole model, and there are other ways to do it as well, such are reducing the default 32bit float weights to 16bit, 8bit or even 4bit. It must affect the final probabilities, but I wonder what are some ways to judge the quality of the result. One way would be to compute the error function for some texts and see how much it changes based on the various reduced precision changes. Is that the way to approach it? And if it gets worse just by a few percent, it’s not a problem, but if it changes a lot, it might be a problem?

It’s a little bit hard to tell. On GPUs these models are likely running with bfloat16 or a mixed precision scheme so I would think that as long as you are within 2^-10 or so you should get reasonable results, but it’s hard to say.

1 Like

This is a very good point. I initially just used declarations like real(sp), intent(in) :: x(:,:), but there are lots of arrays and it quickly became really hard to ensure I didn’t make a mistake. So I reverted to the style of real(sp), intent(in) :: x(n_embd,n_seq) where the compiler can check the compatibility of arrays (typically at runtime) and that helped a lot to catch bugs, that I multiply matrices correctly, and loop over the correct bounds, etc. In Python the indices are reverted (column vs row major), so it’s really easy to get it wrong. I also found it’s nicer to document what each index is doing directly like this, rather than having it as comments. I could infer it and still set dimensions of everything, but that becomes very hard to read if you have size(x, 1) everywhere. Much more natural to use the problem parameters like n_embd, n_seq, or n_layers.

See my proposal here for how this can be improved: How to concisely declare a function result as a multi-dimensional automatic array? - #4 by certik

1 Like

Obviously … if I had read the post more carefully. Sorry for the noise.

1 Like

I created a GSoC idea for fastGPT here:

https://github.com/fortran-lang/webpage/wiki/GSoC-2023-Project-ideas#improving-fastgpt-making-it-faster-easier-to-use-and-more-general

If you are a student and interested, please let us know!

4 Likes

I implemented the input tokenizer (encoder) in Fortran, so now fastGPT is completely standalone. This allows to create a chat interface, which I also did. Here is an example session using the largest 1558M model:

$ ./chat
Your name is fastGPT and you are an AI bot. The user will ask you questions and you answer in a nice, truthful, short way.
User: What is the capital of Czechia?
fastGPT: Prague.
User: How many legs does a dog have?
fastGPT: Four.
User: What color does the sky have?
fastGPT: Blue.
User: What can you type a document on?
fastGPT: A typewriter.
User: What can you drive in?
fastGPT: A car.
User: What can you fly in?
fastGPT: A plane.
User: What continent is Germany in?
fastGPT: Europe.
User: When did Second World War start?
fastGPT: 1939.
User: When did it end?
fastGPT: 1945.
User: When did the U.S. enter the Second World War?
fastGPT: 1941.
User: When did the First World War start?
fastGPT: 1914.
User: When did it end?
fastGPT: 1918.
User: When did the Mexican-American war start?
fastGPT: 1846.
User: When did it end?
fastGPT: 1848.
User: What color is snow?
fastGPT: White.
User: What color do plants usually have?
fastGPT: Green.
User: What is your name?
fastGPT: fastGPT.

(You must scroll to see the full output.)

I think it’s actually very impressive that GPT-2 without any fine-tuning can not only act as a chat bot, but even answer all these questions correctly! All running locally, the inference calculation in about 300 lines of Fortran.

8 Likes

Brilliant, thank you!

Readers should now know for sure this is yet barely scratching the surface of what’s possible with Fortran.

With a bit of added language support and the increasingly better ecosystem, Fortran can be among the first choice languages for any form of computing, not merely number-crunching, with easy to read and good-looking syntax as well as elegant semantics.

2 Likes