Transforming LLMs for HPC Code

Scope is all you need: Transforming LLMs for HPC Code
by Tal Kadosh et al.
arXiv 18 Aug 2023
GitHub project: Tokompiler

Abstract:
With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.

Several coauthors work at Intel. This paragraph from the paper explains what they are doing:

In contrast to common tokenizers, Tokompiler is a tokenization
approach designed to preprocess code for language model pretraining,
specifically targeting high-performance computing and
compilation tasks. The Tokompiler tokenization process involves
generating an anonymized version of the original code by replacing
variable names, numbers, and strings; parsing this anonymized
code to create an Abstract Syntax Tree (AST); updating the AST to
reflect anonymization changes andmaintaining a one-to-one change
dictionary; converting the modified AST back into code while discarding
extraneous details; splittingmulti-part tokens like variable
names for improved understanding; and attaching random numbers
from a predefined range to recurring tokens to reduce reliance
on specific replacements.

ChatGPT-3 tokenizes Fortran code differently than a tokenizer that understood Fortran would. For example, the line of code

REAL (dp) :: rms(in+1:ncol), sumxx, sumxy, sumyy, work(in+1:ncol)

from Alan Miller’s lsq.f90 is tokenized as

RE AL ( dp ) :: r ms ( i n + 1 : n col ), sum xx, sum xy, sum yy, work ( in + 1 : n col )

I guess the idea of the paper is that an LLM using a Fortran-specific tokenizer can do better.

The authors have another project GitHub - Scientific-Computing-Lab-NRCN/HPCorpus: HPCorpus: C, C++ and Fortran codes from GitHub to gather HPC codes for LLM training. Another preprint of theirs that uses the HPCorpus data set is

Quantifying OpenMP: Statistical Insights into Usage and Adoption

Another group is using LLMs to translate Fortran to C++: Creating a Dataset for High-Performance Computing Code Translation: A Bridge Between HPC Fortran and C++

3 Likes