Libraries / tools needed for creating parsers

I made a tokenizer in pure Fortran. It’s probably not very efficient, but it’s simple enough to be used to tokenize anything as it uses a single boolean function, so this is maybe the easiest part to address.

I’m also interested in having something like lark, where you write a lark grammar (which is in general way easier to write) and get a parser.

And I found Brad Richardson’s take on parsing generators a good start writing one.

I guess that this would require building a tree structure and then letting the user create a visiting structure, as lark does (the LFortran docs is also helpful here).

I have my own takes[1] on parsers, but they are not near as stable or generic enough - I’m currently trying to improve the expression parser (writing a Precedence Climbing based on this work), but I keep hitting all sorts of weird runtime errors [while working with class(*), which is a must if you want to deal with derived types], so it won’t be available so soon.

There are other algorithms around, like pratt and pika that may be easier to implement in Fortran.

4 Likes

Our tool fpt, is written almost entirely in Fortran. The lexical analyser contains a complete tokeniser in Fortran, but with a small number of constructs deferred to the first part of static semantics - e.g. some uses of ‘.’ ‘%’ ‘:’ (a curse on alphabetic colons). Note that we handle the VMS, MPX and HP3000 extensions.

I do not know whether we can make this open source. But note that all of this is written in Fortran and the basic structure predates Fortran 90 (though this is the current style).

It can (and has) been done.

4 Likes

There are at least three Fortran wrappers around RE C libraries, at least several of which are in @Beliavsky’s list. As the first reference I ever saw to regular expressions was in RATFOR (A Fortran preprocessor/variant) I started a homage module for that code in GitHub - urbanjost/M_match: subset of Regular Expressions implemented in Fortran and discussed at length writing a new library using more modern methods and the potential gains from using coarrays in the implementations; but did not seem to garner much interest. Don’t really have time currently, but could at least add full BRE capabilities to M_match. I do personally find it useful to have a small pure Fortran implementation of BRE for portability purposes in particular.

I find Fortran very capable of doing anything C does with ASCII characters. It is easy to implement C-style string functions with CHARACTER arrays. Now that there are allocatable arrays and stream I/O in Fortran there are even less issues. I always find that a strange (but vary common) statement that C is better at string manipulation than Fortran. I find it the opposite. It is the system interfaces and support of stream I/O for stdin and stdout and the larger standard libraries that I find useful for string-based applications in C/C++ and Python that I find useful, not the core language capabilities. It was easier to use C because it was easier to use streams, raw I/O, have libraries available for regular expressions, POSIX routines, globbing and such that was better.

Now, unicode/wide character support is a big issue, but most RE usage is at least currently still largely ASCII or extended ASCII, I think (?)

A complaint I had for years was that using integers and/or HOLLERITH was far faster than using CHARACTER variables in a lot of FORTRAN compilers, but I find that rarely if at all with modern Fortran compilers.

It was hard to find good open source papers on modern RE techniques last time I looked. There was one in particular I really liked I was going to use that I lost track of. There was a discussion about this in the Fortran stdlib issues that might be a good place to start if anyone is pursuing this.

Just to finish up the homage version, I think I will find a copy of Software Tools 1st Edition
by Brian W. Kernighan, P. J. Plauger; which I remember as having a very early BRE Ratfor version in it (A coworker had it, I have wanted to read that and redo the examples in modern Fortran for a long time, maybe someone else already did that?)

2 Likes

most modern regress engines support full unicode (and these features are commonly used). one you move out of a white US centric context, the real world is Unicode. names have accents, and are written in non Latin alphabets. currency amounts aren’t always dollars, etc. in some cultural contexts, you can pretend ASCII still exists, but those days are in the past.

2 Likes

I have an LR parser generator, all in Fortran. It uses Pager’s algorithm to produce LALR where possible and extra LR states where necessary. I modified and modernized it from a code by Al Shannon and Charles Wetherell. I added generation of the extra states for Tom Pennello’s Forward Move Algorithm for error recovery. The parser generates an AST. It would benefit from more work in the error recovery area. It’s at sourceforge, or I can send a tarball.

3 Likes

As a longer term project, I think there would be educational value to a pure Fortran parser toolkit. But for practical programing today, I’d just use a C or C++ library. Since Unicode is important, this is an interesting option:

1 Like