Llama2.f90 - fast CPU inference on llama-style language models

This is a Fortran implementation of the llama2 model architecture that is used by Meta (Facebook) for their current generation of language models. It runs at competitive speeds to other optimized compiled models (and much faster than e.g. python) and is (hopefully) very easy to understand and modify for one’s purposes. It’s not a library, it’s just a text file (still needs cleanup).

I’ve been working on this for a little while now, it has been linked here in comments a couple times. It started as a toy that was really just a direct port of another toy from C, but I’ve since done a lot of optimization and added quantization and wanted to share a version that runs “real” language models. Still lots to do, but if you need a generative language model for something, please consider using this as a starting point. I’m happy to help if anyone wants to use it. Thanks! (Edit to mention, I’ve only tried it on an intel cpu)

12 Likes

I hope my question is neither trivial nor redundant within the documentation. Can your Fortran Llama2 implementation analyze .pdf files? For instance, I’d like to place numerous PDFs (such as scientific papers or programming language documentation) into a folder and then pose questions to the language model regarding the content of these files.

This feature isn’t available for free in ChatGPT, and unfortunately, Google’s Bard is quite buggy, rendering it unusable in my experience. Additionally, I have reservations about foreign companies collecting my data and making moral judgments on acceptable topics. I must admit I have a personal aversion to Microsoft, Google, and especially Facebook.

I greatly appreciate that the Fortran community offers frontends for their models (although I’m not entirely certain if “frontends” is the correct term), but to be honest, my ideal solution would be a well-documented, user-friendly, and lightweight language model that can be trained by user and implemented in pure Fortran (or alternatively, Rust or Ada). Users would receive pure Fortran code along with comprehensive documentation explaining how to train the model with their own data. And please, no Python!

I am currently working on machine learning in Fortran, primarily processing economic data. I have no prior experience with language models from a programming perspective. If such a program doesn’t already exist, I am not ruling out the possibility of creating one in 2024.

Disclaimer: This post was originally written by a human, but the non-native English has been edited by ChatGPT.

Short answer is No.
Longer answer: The way I’ve seen what you’re describing done has several components

  1. Ingesting pdf data and formatting appropriately
  2. Dividing data into chunks
  3. Indexing in a database, typically as sentence embeddings for semantic search
  4. A retrieval step that takes a query and returns relevant extracts from the database
  5. A language model that can take the extracts plus prompting text and formulate a response.

Model inference frameworks like llama2.f90 only run the language model for step 5. There are various pdf data extraction utilities that can be used for 1 and 2 though it’s my view and experience that this is where care needs to be taken to tailor the extraction to the document types and expected query types you have in order to support good answers. Naively stripping the text out of a .pdf and breaking it into chunks doesn’t work that well.

3 and 4, virtually everything I’m aware of is in python. See e.g. https://www.sbert.net/ I thought I had seen a C implementation but couldn’t find it immediately. It would be interesting to implement this in Fortran and I don’t foresee any major technical problems, the models are already there, it’s just a question of porting the inference code. Let me know if this is something that is interesting.

A naive python version of what you are describing is GitHub - imartinez/privateGPT: Interact privately with your documents using the power of GPT, 100% privately, no data leaks though I tried it and didn’t find it worked very well.

One last thing - there is an alternate “version” of what you’re describing where the language model is fine-tuned on the custom data (in the version I describe it just gets prompted with extracts). That’s a different proposition, personally it’s not my preferred solution, and in any event it would require a model training framework which for all intents and purposes is currently pytorch.

I hope that helps, I’m happy to discuss further if it’s something that’s interesting.

1 Like

Thanks for your explanations, @rbitr! I wasn’t aware of these links. Certainly, an open-source project can be easily translated from Python into Fortran. However, attention must be paid to copyright issues, and I’m not entirely certain if I’m the right person for this task.

While working on some machine learning tasks involving numerical arrays in Fortran, I recently discovered that Fortran is significantly faster than Rust and faster than anything else I’ve tried in my life. Considering that Fortran is easy to code in, I can’t find any technical reason to continue developing AI in Python. I believe that implementing all of this in Fortran would yield significant benefits.

I may have sounded a bit exaggerated in my statements about large companies, data collection, or censorship. However, I truly wish to run my own models on my own hardware. I understand that these companies prevent their models from generating content that could be considered inappropriate, such as sexual or terrorist content. Nevertheless, I recently had a disagreement with ChatGPT when I tried to get it to generate Assembly code for me. It repeatedly insisted that Assembly wasn’t the right tool for the job. Therefore, either training or fine-tuning an open-source model appears to be the solution.

1 Like

@rbitr awesome, great job!

You and I discussed the performance a bit in Performance · Issue #3 · rbitr/llama2.f90 · GitHub. We can also almost compile your code with LFortran, just a few things left.

I am happy that you found Fortran to be Simple & Hackable & Fast. I have the same experience.

2 Likes

I think AI algorithms are usually implemented in C++ with a Python API, since Python is a popular and accessible language, so the question is whether there are technical reasons to prefer Fortran to C++ for AI.

I believe there are several reasons to prefer Fortran over C++ for AI. Firstly, there’s the speed of program execution, and secondly, the ease of coding. Unfortunately, I don’t use C++ myself, as I have a preference for Rust, Ada, and Lua/LuaJIT. It might be interesting to consider starting a separate thread with some benchmarks, perhaps next week. While I’m more proficient in Rust than Fortran, I couldn’t optimize it to achieve the same execution speeds as Fortran.

If the same code implemented in Fortran and another compiled language takes 3 minutes and 10 minutes to execute, respectively, in Fortran and the other language, the difference between these languages may not seem significant. However, when considering that I need to perform thousands of tests during the development phase, switching to Fortran allows me to save an entire year of work!

Edit:

I don’t want to criticize C++ as I’m not familiar with that language. When it comes to Python, it is neither as fast as Fortran, nor as well-designed as Ada and Rust, nor as easy to use as Lua. For me, the popularity of Python is one of the greatest mysteries in the field of informatics. It certainly has some valid use cases, but I only use it for plotting (with matplotlib). If we really need an interpreted language, I would definitely pick up Lua (recently, I’ve been experimenting with AI in Lua). One of the issues with Lua is that my Fortran program requires 128-bit precision. Seriously, Fortran is the only language that works for AI, at least in my particular use case.

I would say that there is not much mystery, the one thing that makes python so popular is:
pip install <you_name_it> (or conda install) … the package manager made it so easy to just test any library out there that it was a no cracker that people jumped on it. I can see it even when I talk with university professors that were before Fortraners or C-( umm, is ther a jargon here? :laughing:) and just because of pure pressure of how to teach their topics without losing days with compilers and build systems well, this was a no brainer either … Oh, and lets not forget that in C++ with templates or in Python that anything can be anything (with the dangers that entails) it is quite easy to test formulations without having to declare N times the same function.

I have hopes that fpm will bring Fortran back to the game as it makes going from nothing to something very fast!!! And generics will be the cherry on top of the pie.

I do agree that Fortran is just as or maybe even better suited for doing AI as any other language one can select, this community is what it needed to get the tooling and libraries such that anyone can get on the boat faster :slight_smile:

This work by @rbitr is an example for that!!

2 Likes

@piotr I ended up translating one of the sbert models I referenced above into Fortran, an embedding model for semantic search. See GitHub - rbitr/ferrite: Simple, lightweight transformers in Fortran if you are interested.

I only tried with one model but there are many variations that use the same architecture. All that to say, between that and the llama2.f90 llm it’s becoming more feasible to build a pure Fortran conversational information retrieval system. Personally I’ve used python and the “transformers” package a lot for embedding generation / semantic search in the past but I find that package has too much unnecessary abstraction. Now that it exists in Fortran I’m going to try and make use of it going forward.

1 Like

I wanted to follow up here as I’ve released a new version that should be easier to run if anyone wants to try (just make and download the model file), and is beginning to be optimized for speed based on an ongoing discussion with @certik. Also with an updated name, though i may change that again.

Trying to make it as easy to try and as performant as possible to get some people using it. Thanks!

4 Likes

Beautiful. Great job!

1 Like

Another update on this - now also supporting the mamba selective state space model architecture. This is an alternative to the transformer architecture that appears promising because it scales better for long sequences. The notable part is that this Fortran implementation (afaik) is the only implementation outside of the original python code and is a minimal easy to run aspirationally very fast version. I tried to make it simple enough to run that anyone can try, hopefully it will draw some interest: llm.f90/ssm at master · rbitr/llm.f90 · GitHub

5 Likes