fastGPT: Faster than PyTorch in 300 lines of Fortran

certik · March 14, 2023, 10:46pm

I would like to announce fastGPT, a fast GPT-2 inference written in Fortran:

Code: GitHub - certik/fastGPT: Fast GPT-2 inference written in Fortran
Blog post: fastGPT: Faster than PyTorch in 300 lines of Fortran
Twitter: https://twitter.com/OndrejCertik/status/1635768419307110400
Hacker News: FastGPT: Faster than PyTorch in 300 lines of Fortran | Hacker News

I recommend to read the blog post above for background and motivation. See the README at GitHub for an example and benchmarks.

It’s pure Fortran, it’s short, readable and most imporantly: fast. On my Apple M1 it looks like it is faster than PyTorch in fair comparison, and a lot faster if I use optimizations/backends that PyTorch doesn’t use. It also starts immediately. It is a standalone Fortran application, currently we still need Python to encode the input string to tokens, but then fastGPT takes it, generates more tokens and converts them back to text.

It is written like any other numerical computational code. I think Fortran is the perfect fit, at least for GPT-2 inference, but probably for other similar ML/AI models too.

fastGPT is currently only parallelized via parallel OpenBLAS. We have a great single core CPU performance, and this provides a solid foundation for parallelization and GPU offloading. I am hoping some of you would be interested in helping. We can try MPI, and @rouson can try coarrays. I recommend to approach it like any other physics or numerical code, and let’s see how fast we can make it in parallel. This would also be a great GSoC project, both parallelization and making the application more user friendly (such as porting the encoder into Fortran so that we don’t need Python, see the issue tracker for more ideas).

everythingfunctional · March 14, 2023, 11:12pm

I think between @certik, @milancurcic and @rouson , we’re going to end up with more and faster ML/AI libraries than Python.

oscardssmith · March 15, 2023, 12:40am

What’s the accuracy of your fast tanh? Asking because I believe it will be likely faster to directly approximate erf. Specifically (in Julia)

function fasterf(x)
    x2 = x*x
    res = x*evalpoly(x2, (0.7975839f0, -0.13200624f0, 0.019021248f0, -0.0019748025f0, 0.00013678304f0, -5.5545797f-6, 9.853275f-8))
    return ifelse(x2<12.75f0, res, copysign(1,x))
end

The advantage is that you need fewer terms to get an accurate result.

FortranFan · March 15, 2023, 1:41am

Kudos @certik, very nice and interesting effort.

You’re on the right track, you’re looking at Fortran beyond the pigeon hole of particular number crunching into which it has been boxed by many people including many of the WG5, J3 committees.

Your comment, "we still need Python to encode the input string to tokens, " is a needlessly sorry situation for Fortran. With just some imagination and vision and a bit of effort in the language, Fortran can be superior and safer for string handling and encoding as well compared to other far more popular approaches currently.

With your effort with LFortran, you can initiate a massive interest in Fortran and also adoption toward a variety of computing domains. Sky is the limit.

certik · March 15, 2023, 2:44am

@oscardssmith awesome, thanks for the tip! I haven’t checked the accuracy much, just quickly coded it, and it seems to produce the same token results. I was wondering the same thing how they got into the weird expression involving tanh and the polynomial in it. And then I read that it’s an approximation to erf(x). So we should approximate erf(x) directly, exactly as you posted. The only worry is that it will be then different than the tanh(x) approximation, which then in return might change the answers of the model, if it was trained on tanh(x), I don’t know. So we have to try it and see.

certik · March 15, 2023, 3:00am

Thanks for your encouraging comments @FortranFan. It’s not too bad, I did the tokens to string decoder here:

github.com

certik/fastGPT/blob/2aab7cf537b68f584c9899fd50335c451dfc7560/gpt2.f90#L280-L305


      
          function decode(tokens, idx, decoder_txt, byte_decoder) result(output)
          integer, intent(in) :: tokens(:), idx(0:), byte_decoder(0:)
          character, intent(in) :: decoder_txt(:)
          character(:), allocatable :: output
          character(:), allocatable :: output2, tmp
          integer :: i, c, d
          allocate(character(0) :: output2) ! Fix GFortran warning
          output2 = ""
          do i = 1, size(tokens)
              output2 = output2 // c2s(decoder_txt(idx(tokens(i))+1:idx(tokens(i)+1)))
          end do
          i = 1
          output = ""
          do
              c = iachar(output2(i:i))
              if (c >= 128) then
                  i = i + 1
                  d = iachar(output2(i:i))
                  c = ior(ishft(iand(c, 31), 6), iand(d, 63))
              end if

This file has been truncated. show original

it can probably still be simplified (it even does simplified UTF-8 decoding!). The encoder will be harder, essentially we need to translate this little Python file: https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/encode_input.py, there is a regex in it, but I am hoping we can hand code it. We’ll have to write lots of tests to ensure we didn’t make a mistake, but it shouldn’t be hard, I was focusing on performance first.

oscardssmith · March 15, 2023, 3:20am

The reason people use the tanh version is if you’re on a GPU, Nvidia has a fancy tanh builtin that you can’t match and the exact shape doesn’t really matter, just the smoothness and rough shape (the activation functions are mostly made up anyway).

certik · March 15, 2023, 3:46am

Ah, I see, now that makes sense to me. Perfect, I’ll try erf(x) and see. I think focusing on a CPU, there are quite a few things that one can do to further speed this up. For a GPU, we might need to adapt the code anyway, and possibly maintain a dedicated version. It’s just a few functions. The same with a parallel code.

DavidB · March 15, 2023, 4:09am

erf is a Fortran 2008 intrinsic.

oscardssmith · March 15, 2023, 4:22am

yes but it will be a bunch slower than the approximated version which is only accurate to 4 digits.

certik · March 15, 2023, 4:26am

Yes, so is tanh(x), but the fast_tanh(x) in the code is a lot faster, even at full accuracy, and as @oscardssmith said, it looks like we might get away with a lower accuracy version as well.

rouson · March 15, 2023, 7:36am

Nice work, @certik! I hope to talk with you soon about how we might collaborate in this area and what synergies there might be with Inference-Engine. I’m curious what file format you use. Inference-Engine uses a JSON file exported from PyTorch. I’m also investigating ONNX.

certik · March 15, 2023, 1:28pm

Definitely! There is also @milancurcic’s neural-fortran. We should figure out how to join forces, I think Fortran has a lot to offer in this area.

Right now the only documentation of it is the code that reads it:

github.com

certik/fastGPT/blob/2aab7cf537b68f584c9899fd50335c451dfc7560/main.f90#L27-L54


      
          ! Load the model
          print "(a)", "Loading the model..."
          call cpu_time(t1)
          open(newunit=u, file="model.dat", form="unformatted", access="stream", status="old")
          read(u) n_vocab, n_ctx, n_embd, n_layer, n_head, n_decoder_idx, n_decoder_txt, &
              n_byte_decoder
          allocate(wte(n_embd,n_vocab), wpe(n_embd,n_ctx), &
              mlp_fc_w(4*n_embd,n_embd,n_layer), mlp_fc_b(4*n_embd,n_layer), &
              mlp_proj_w(n_embd,4*n_embd,n_layer), mlp_proj_b(n_embd,n_layer), &
              attn_w(3*n_embd,n_embd,n_layer), attn_b(3*n_embd,n_layer), &
              attn_proj_w(n_embd,n_embd,n_layer), attn_proj_b(n_embd,n_layer), &
              ln1_b(n_embd,n_layer), ln1_g(n_embd,n_layer), &
              ln2_b(n_embd,n_layer), ln2_g(n_embd,n_layer), &
              lnf_b(n_embd), lnf_g(n_embd), &
              decoder_idx(n_decoder_idx), decoder_txt(n_decoder_txt), &
              byte_decoder(n_byte_decoder))
          read(u) wte, wpe, &
              mlp_fc_w, mlp_fc_b, &
              mlp_proj_w, mlp_proj_b, &
              attn_w, attn_b, &

This file has been truncated. show original

and writes it:

github.com

certik/fastGPT/blob/2aab7cf537b68f584c9899fd50335c451dfc7560/create_model.py#L157-L171


      
          # Save the model
          f = open("model.dat", "w")
          np.array([n_vocab, n_ctx, n_embd, n_layer, n_head,
              len(idx),len(decoder_txt.encode("utf-8")),len(byte_decoder)], dtype=np.int32).tofile(f)
          wte.tofile(f); wpe.tofile(f)
          mlp_fc_w.tofile(f); mlp_fc_b.tofile(f)
          mlp_proj_w.tofile(f); mlp_proj_b.tofile(f)
          attn_w.tofile(f); attn_b.tofile(f)
          attn_proj_w.tofile(f); attn_proj_b.tofile(f)
          ln1_b.tofile(f); ln1_g.tofile(f)
          ln2_b.tofile(f); ln2_g.tofile(f)
          lnf_b.tofile(f); lnf_g.tofile(f)
          idx.tofile(f)
          f.write(decoder_txt)
          byte_decoder.tofile(f)

It’s just binary array data. I think it’s actually platform-independent, except that it is little-endian, but I think most platforms are these days.

Beliavsky · March 15, 2023, 1:52pm

My list of Fortran codes on GitHub has a Neural Networks and Machine Learning section.

rjzak · March 15, 2023, 3:36pm

Where’s the fpm.toml file?

certik · March 15, 2023, 4:02pm

@rjzak send a PR!

Carltoffel · March 15, 2023, 4:52pm

This is very impressive! I hope this will bring a lot of attention to Fortran.

What’s the benefit of using n_embd and n_seq in the mha function as extra arguments, instead of using ubound or extend of x? Is it just for readability, or am I missing something?

real(sp), intent(in) :: x(n_embd,n_seq)

oscardssmith · March 15, 2023, 6:00pm

Oops, forgot to mention that this is erf(x/sqrt(2)) since that is what gelu needs.

certik · March 15, 2023, 6:20pm

@oscardssmith here I found a case where my fast_tanh() function produces different output than tanh(): An example where the current fast_tanh() gives different results · Issue #25 · certik/fastGPT · GitHub. Both outputs look ok. How do you judge which one is better?

I assume what is happening is that it gives probabilities of all the tokens, and if I print them, I assume I would find similar probabilities for both cases, but slightly numerically different (due to the tanh numerical differences), and the “greedy” mode selects a different token, but from the probability perspective the results might still be “equivalent”. Is there a way to determine at which point the results stop being “equivalent”? What accuracy in the final token probabilities is needed?

I wonder if one can think of the reduced precision tanh as reducing the precision of the whole model, and there are other ways to do it as well, such are reducing the default 32bit float weights to 16bit, 8bit or even 4bit. It must affect the final probabilities, but I wonder what are some ways to judge the quality of the result. One way would be to compute the error function for some texts and see how much it changes based on the various reduced precision changes. Is that the way to approach it? And if it gets worse just by a few percent, it’s not a problem, but if it changes a lot, it might be a problem?

oscardssmith · March 15, 2023, 6:25pm

It’s a little bit hard to tell. On GPUs these models are likely running with bfloat16 or a mixed precision scheme so I would think that as long as you are within 2^-10 or so you should get reasonable results, but it’s hard to say.

Topic		Replies	Views
Why is my code compiled with GFortran on Windows slower than on Ubuntu?	51	5776	May 3, 2022
Black-Scholes option pricing benchmark	11	925	October 20, 2021
ForOpenAI - A Fortran library for OpenAI API Announcements	4	964	September 20, 2023
What ChatGPT says about Fortran AI	53	5198	December 31, 2023
Comparing Fortran and Julia's Bessel function performance	69	4913	October 23, 2022

fastGPT: Faster than PyTorch in 300 lines of Fortran

Related topics