Libraries / tools needed for creating parsers

Seems to me what is needed is an agreed format for the list of tokens that comprise a bit of Fortran source. Diffing and some polishing (possibly live-in-editor) would then be done on that format. So, who is going to write a Fortran tokenizer in Fortran?

2 Likes

While I wouldn’t take the lead of such a project, I’m interested in parsers etc, however inexperienced, and would be happy to assist.

1 Like

I don’t see a reason of it being written in Fortran, but besides that we already have a few, very functional Fortran tokenizers laying around which we can use and we can make more based on them:

  • LFortran’s parser (AST, possibly ASR) (C++)
  • fortls’s parser, which is error resistant dues to its use for Language Servers (Python)
  • FORD also has a parser but I cannot speak about it’s ability to produce full ASTs (Python)
  • Rewriting fortls’s parser using pyparsing, which would automatically fix a bunch of existing bugs, not that big of a task. (Python)

From my perspective the parser is the easy part, the set of rules that standardise the style would be a lot harder to create IMO.

1 Like

A reason could be that contributors don’t have to learn a second language. It’s the same with fpm been written in Fortran. I guess there are many more uses for a Fortran parser/tokenizer written in Fortran.

The problem is that Fortran simply lacks the tools necessary for easily creating parsers, that are also performant. We can of course spend the time making them, it would definitely make for a fun series of projects, but the reality is we lack the time and the funding to redesign a working solution.

Also, I am not entirely convinced that using Fortran would increase the number of contributors to the project. See for example toml-f, entirely written in Fortran, has a massive user base given fpm relies on it, but the contributors are still mostly Sebastian Contributors to toml-f/toml-f · GitHub.

IMO the language doesn’t matter as long as it gets the job done quickly.

5 Likes

One more option to your list maybe: the llvm-flang parser? It’s manually written in C++ and under active development.

1 Like

True, I keep forgetting about llvm-flang

TOML Fortran has a constant influx of 30 to 100 git cloners on a daily basis, if you trust the GitHub traffic statistics. Two to three of those are actually finding the repo and show up as visitors. I guess this is due to some CI relying either directly or indirectly in TOML Fortran. It quite a luxury problem for a project to have a large user base but still be searching for actual users.

Writing a parser in any language ends up using some language subset or domain specific language to express the actual tokenization, lexing and parsing process. I have written a couple of parsers in Fortran now, and they don’t even remotely look like Fortran IO at all (having a read(...) statement it is usually a bug in this context), since handling the character stream is wrapped as is transfering slices from string to actual values, rather you see some abstract pseudocode about obtaining tokens and extracting their value.

This will not be much different if you work in C++ (using bison, re2c, …), Python (using PYL, …) or whatever language you choose. The actual IO in a parser will be the first thing pushed to a library or abstracted in some way, the resulting parser however is pretty generic regardless of the language.

2 Likes

It’s more like that Fortran cannot easily use the lex/yacc approach that you normally use for parsers. It doesn’t easily allow for string manipulations and it has no support for REGEX, which is a fundamental aspect of creating a Parsing Expression Grammars (PEGs).

I am sure we can write libraries for all of these, maybe some already exist, and you are right in that the parser will end up being abstracted away from the Fortran IO and it will look more like standard OOP piece of code, but I don’t see a reason why we would have to go through all that when we can use existing solution, that are well tested and have big supporting communities behind them.

That is my opinion of course, but I wouldn’t sacrifice the interactivity of Python, especially during debugging for the sake of using Fortran.

1 Like

Why not? Nothing is stopping you from using lax/yacc for generating the parser in Fortran. I’m not using it in TOML Fortran only due to the special constraint of being an fpm dependency, which requires pure Fortran without C.

I don’t think it a matter of feasibility, it’s definitely possible, but then again you wouldn’t be using Fortran for your parser, you would be using C or C++ and linking back to Fortran. I don’t know of a lexical analyzer or a grammar compiler that produces Fortran code. As for the regex support in Fortran, it’s mostly nonexistent.

Again, I am not saying it’s impossible, I am simply pointing out that Fortran is not necesserily the right language for string manipulations, regex, lexers and parsers. We can create awesome libraries that boost Fortran’s capabilities, but the language itself would not be capable of such tasks out of the box.

1 Like

FWIW, a REGEX module in Fortran, and not just an interface to a C library, would be very useful.

2 Likes

Yes, please!

I also think a REGEX module would be useful for the community but I don’t currently have a use for it and all the regex I use is either Python or TypeScript/JS, so it might be a while until I get to it. Projects that are in the pipeline for me before this are currently: Modern Fortran VS Code extension, fortls, lfortran/lpython language server, fprettify and Fortran formatting i.e. this post, GDB Fortran interactive data visualisation and then REGEX.

1 Like

@gnikit, can you please elaborate on the quoted comments above? Keep in mind few languages, are capable “out of the box” of the tasks you list.

The dominant languages then have introduced “standard” libraries to achieve the effect which may allude to “out of the box” functionality but that’s not the case strictly speaking.

Fortran instead has intrinsic functions and with the couple of additional ones introduced in Fortran 202Y there is a strong mindset out there, clearly among major influential ones on the Fortran standard committee, all the intrinsic capabilities any author would need will be available in the language standard itself that will then enable efficient “string” manipulation of all kinds.

Hence it will be useful if you can explain what you meant.

Does this really matter? Whether it’s part of a standard library or incorporated by some other way in the language, if the user can easily do regex that’s enough IMO.

Maybe that will be true for plain strings but I somehow doubt that will be the case for regex. It takes a lot more than a few new intrinsic functions and some added functionality to make efficient regex. We can discuss what the main parts of a regex library in Fortran should be, but this is slightly outside the scope of this post.

At any rate, I would be more than glad for people to prove me wrong about regex in Fortran. If someone wishes to they can try and create a PEG in Fortran which is the type of library needed for most regex-related tasks.

@gnikit , it only matters given your own earlier statement, “We can create awesome libraries that boost Fortran’s capabilities, but the language itself would not be capable of such tasks out of the box.”

I’m still curious as to what you mean by “the language itself would not be capable of such tasks”

Regular expressions are programmed in the C language, and the C language doesn’t even really have a character string type. All it really has is a one-byte integer type called char. In fortran this would be an integer with a particular kind value. An array of these integers then stores a string, and there are library routines written to copy, concatenate, compare, etc. these arrays. That is how regular expressions are written in C, in terms of these low-level integer arrays. So in principle, the same thing could be done in fortran with one-byte integer arrays, all in terms of library functions. That is not the way it should be implemented in fortran, because fortran does have a higher-level character data type, so that should be used instead. But as far as having sufficient capabilities within the language, it is all there.

4 Likes

There are a few in the Strings section of my list.

1 Like

Maybe we should split this thread. I propose to keep this thread for discussing the style (guide). In separate threads, we could continue discussing on implementing PEG, regex etc. in Fortran. Meanwhile, we could try to utilize the existing tools (lexers etc.) in a way so that we can easily replace them with Fortran implementations, later.

4 Likes