Interest in refactoring/replacing FORD parser

cmacmackin · September 28, 2021, 2:01pm

Hi all. I’m the original author of FORD (though no longer the maintainer). I am currently in the process of applying for a grant which would fund some work on FORD. I want to improve the current parser and semantic analysis. Currently these tasks are combined into a single (massive) Python module. The parser, in particular, is not very robust, consisting of an ad-hoc set of regular expressions. Instead I’m looking to offload the parsing and much of the semantic analysis to Flang. This will have numerous advantages:

Most of the maintenance burden for supporting new language features will be offloaded to an external project
Flang can handle fixed-form Fortran natively, while the current parser must first pass it through a third party converter
Flang is able to get precise locations within the source code, which will make extracting code listings faster and more reliable
The code to use Flang to analyse Fortran files could have other applications such as static analysis or linting

So my questions here are two-fold:

Is there anyone that would be interested in collaborating on this project? This could mean as little as helping to test the refactored version your code or discussing any design decisions that affect FORD end-users.
Can people provide examples of where FORD is or could be used on scientific code, preferably across a wide range of fields? When applying for this grant I need to demonstrate that the work will be beneficial to science. Examples from researchers based in the UK would be especially helpful, as it will be a British government grant.

awvwgk · September 28, 2021, 2:22pm

Welcome to the forum, Chris. FORD has been most helpful for me in the past to generate documention for my projects.

I’ve been using FORD for a couple of my (scientific) projects, including:

DFT-D4 (computational chemistry)
MCTC-library (computational chemistry)
https://awvwgk.github.io/simple-dftd3/ (computational chemistry)
TOML-Fortran (non-scientific)

But those are not UK based, sorry. I have more projects from the computational chemistry domain, where I would like to use FORD in the future, like xtb and a couple of other projects in the parent organization.

I know DFTB+ in principle supports generating docs with FORD and at least one of the developers is UK based (not sure if this counts).

Also, both stdlib and fpm are users of FORD.

cmacmackin · September 28, 2021, 3:29pm

Thanks. Actually, it seems I don’t need these projects to be UK based. Glad to hear it’s being used in computational chemistry; that is one of the research areas of the funding body I’m apply to. Do you know if Fortran and/or FORD are widely used in the field?

Edit: Also, any thoughts on what might make FORD easier to adopt in your other projects? Anything which the proposed changes might be able to help with?

MarDie · September 28, 2021, 3:42pm

Good news!
I actually planned to use FORD for DAMASK (damask3.mpie.de) but I did not work. If I remember correctly, it was because of lacking support for nested submodules. I hope that this information is also helpful for your proposal, let me know if you need more information.

awvwgk · September 28, 2021, 3:43pm

The package index for open source scientific Fortran projects might be a good starting point for searching projects.

everythingfunctional · September 28, 2021, 5:11pm

Hi @cmacmackin , welcome to the forum. This is good news, as I have been adding FORD documentation to several of my projects, (notably vegetables, jsonff and quaff). @rouson and I have also been using it for a couple of client projects with success.

urbanjost · September 28, 2021, 5:26pm

Virtually all of the modules listed in this index already do or are in the process of using ford(1); as well as txt2man(1). A good number use both ford(1) and doxygen(1), which might be of interest to some for purposes of a side-by-side comparison. Perhaps the AST parser in the LFortran project might be of interest as well, especially because of it’s relation to fortran-lang and the recent growth in projects using ford(1) that seems to correlate with the fortran-lang and fpm(1) projects. Although the majority of the modules are intentionally not directly scientific packages they are primarily intended for assisting in the more rapid development of analytical codes.

Since they are also all fpm(1) packages on github they are very easy to pull down and each contains (almost) a ford.md input file so they might be useful for testing with. The help text was originally designed for txt2man so it is not always quite compatible with ford(1) and doxygen(1) without some tweeking.

For quick checks I wish ford could produce a single document presentation so I would not have to click around quite so much to see how things look (might have missed it though so let me know if that is there).

Great tool. Terrific to hear it may be undergoing further development; been somewhat hesitant to use it – not being sure about the state of maintenance in the past.

A way to separate out user documentation from developer documentation would be appealing.

There have been several things that have stopped ford(1) from running that I have wished it ignored, like non-existent directories being listed in the exclude list.

I wish now I had kept a list, as there were several tweaks I needed to make to get some modules to work with ford(1) but they were mostly caused by the fact that long before ford(1)/doxygen(1) I used comments starting with !! and !*! extensively for other reasons. Great news. Looking forward to the results.

certik · September 28, 2021, 9:53pm

Hi @cmacmackin first of all, welcome to the forum and thanks for the post! Great news.

As I have already offered in private communication and a video call, I am happy to collaborate on this if you decide to use LFortran’s AST or ASR representations. LFortran’s parser can parse any free form Fortran 2018 (modulo bugs which we’ll fix quickly) and it has precise location information for each AST node (the beginning and end). Our parser is very fast, which might be helpful. We also have a fixed-form parser that can parse most of SciPy for example, but it still has some bugs to fix.

We parse all comments into AST, so you will be able to access them and extract the documentation from them.

The tokenizer and parser can be extracted from LFortran if you want to simply include the few C++ files into FORD. They compile in just a few seconds (all of LFortran compiles under 30s on my laptop).

If you decide to use LFortran, then I will do my best to ensure that things work for you.

cmacmackin · September 29, 2021, 8:48am

Thanks @certik. I did look at LFortran as well as flang and definitely appreciate the strong community around the former. Of the two it looks far easier to contribute to, should that prove necessary. I do also like how quick it is to compile (whereas flang is quite a behemoth, especially if you want to compile all the rest of LLVM at the same time). There were a few things that swung me towards flang, but perhaps you can convince me otherwise:

Last time I checked, lfortran didn’t seem to provide as good information about source locations; the prescanner normalised the file into a single stream, and then locations were given based on lines in that stream. This meant that include files and line continuation were not properly accounted for. However, perhaps that’s changed since the version I was looking at.
Flang can generate “symbol tables” even when there are semantic errors, whereas LFortran can not produce ASR under those circumstances. This is a problem, given that LFortran is still some way from supporting the entire standard (e.g., even the use of an unsupported intrinsic function will cause it to fail to produce ASR).
Flang’s “symbol tables” actually do most of the work of collecting information and performing semantic analysis that I need for FORD. It thus wouldn’t take much effort to wrap and use them. Perhaps I’m wrong, but it looks like the LFortran ASR might take a bit more work and if I can’t rely on it and thus need to analyse the AST myself then that would take a lot more work.
Comments that fall after line continuation characters are currently lost. (Although the fact comments are included in the AST is a big thing LFortran has over flang.)
Flang has an integrated preprocessor and can even work with that when it comes to locations in source. This isn’t vital for FORD (currently it just runs the code through a user-specified preprocessor before parsing it) but is definitely nice.
Frankly, for a tool like FORD, being able to analyse “most of” Fortran (even if fixed-form) isn’t good enough. It really has to be able to parse all of it. Otherwise I get bug reports.
To be honest, I found it much easier to understand the workings of flang than LFortran. The former had much more extensive developer documentation, everything in the code was clearly named, it made extensive use of modern C++ features, etc. A lot of LFortran felt more like programming in C (e.g., I couldn’t seem to wrap my head around the datastructures you used for nodes in the ASR and AST) and there weren’t many comments in the code. However, perhaps the assistance which would be readily available from the community could make up for this.

All that being said, the grant for this project wouldn’t begin until April, so perhaps some of these issues will be fixed by then or by the time I’d be creating a release of my own.

cmacmackin · September 29, 2021, 8:53am

There is a new maintainer for FORD now, so a lot of pull requests have been accepted and bugs have been fixed in the past couple of months. Not sure how much time he’ll be able to devote to feature development himself, but certainly bug fixes are being done now and he’s quick to approve PRs others have written for new features.

Please do create bug reports and feature requests on the repo.

cmacmackin · September 29, 2021, 12:28pm

It’s possible that the issue has now been fixed, as FORD received a lot of bugfixes over the past few months. However, if not, the reasons would be interesting to know. Being able to point to problems that my work will fix and how that will allow increased adoption would look good on the application.

urbanjost · September 29, 2021, 3:33pm

As I am using ford(1) extensively now that sounds like a great opportunity. I will pull down a new version first and run through a rather large collection of codes of varied styles and vintages, hopefully within this week.
One of the appeals of ford(1) is that it is quite easy to set up, especially after your first one is working so I will try to produce some useful results.

certik · September 29, 2021, 4:48pm

Thanks @cmacmackin for the feedback. Here are my comments:

Yes, the prescanner can map the continuous stream to the correct line/column, but we have not yet hooked it into error reporting or a nice API that tools like FORD could use to get the correct info. We’ll get it done soon.

I have a couple thoughts on this.

When you point FORD to a single module that depends on other modules, in theory you would have to do semantic analysis on all other modules first, for example so that you know if dp (that was used) means double precision or something else. Now in practice, one can use all kinds of heuristics to avoid having to do semantic analysis on other modules, and just deal with a single module, and just do “educated guesses”, that dp probably means double precision (who would use it for single precision?), if you use a pure function that is used in a declaration to determine the size of the return array, you just guess the declaration from the way it is used and create the proper ASR nodes for it, on a best effort basis. Yes, this can fail in theory, but probably would work really well in practice.

How exactly do Flang’s symbol tables handle this?

In LFortran, the AST->ASR conversion goes in two passes: first a symbol table structure is created, then it is filled in with functions implementations (bodies). Most of our current work is on the bodies part, as most Fortran features concern those. For FORD, it seems you don’t care about the bodies, only about the symbol tables.

If this is the main blocker, what I could do is add a mode to LFortran that only deals with the symbol tables, but ignores function bodies. The ASR would have empty “bodies”, but FORD would not care about it. And then ensure all of Fortran works in this mode. I think we are actually quite close.

Regarding errors, such as invalid Fortran code — what would you like to happen: return ASR anyway, on a best effort basis, trying to recover from the errors?

I think the “symbol table only mode” might greatly help with errors: as long as the overall structure is ok, it would never do semantic analysis of procedure bodies, where (I assume) most of the semantic errors would be.

This would help with our Python wrapper backend also, I think we also only care about the “symbol table only mode”.

CC @hsnyder.

Yes. These are a bit tricky how to best parse and represent in AST, as those currently get thrown away by the prescanner. We have to figure out a good solution for this.

How do you plan to do this with Flang? My understanding is that Flang throws away all comments. That seems like a pretty big blocker.

Yes, we are planning to implement one also.

Right. Have you discovered any issues with parsing free form to AST? I am not aware of any bugs.

The fixed form has been lower priority, since we concentrated on modern Fortran first. I just talked with @ThirumalaiShaktivel today and we’ll write a proper tokenizer for fixed-form so that we can parse all of it. (We have been focusing on modern Fortran first.)

We will be happy to add more comments. I have written documentation how it works here a little bit:

But it’s not as well detailed for developers yet, more of a general overview. The generated files are indeed C like because that was the best performance (of LFortran itself) that I was able to get (I tried C++ style inheritance as well as std::variant, etc.). However, as a developer, you do not touch the C like files, you write C++ style visitor pattern to operate on the AST or ASR, such as here:

src/lfortran/semantics/ast_body_visitor.cpp · 94bf06129614f418cbccbca71f1a07eafbb1d77b · lfortran / lfortran · GitLab

You just add a method for each AST or ASR node that you want to visit. It seems as simple as it can get. You can consult the AST.asdl files to see what member variables each node contains.

We try to limit a lot of modern features as well as heavy use of templates (we do use some) in order to ensure the whole project compiles quickly with any C++ compiler. That is very important for a good developer experience.

If you see something that is just not well designed, please definitely let us know. I know that a lot of these decisions are more of a “taste”, and a choice of particular C++ development style and I get that. I personally like the LFortran simple direct style. But if you prefer the Flang’s C++ style, then you should use Flang.

However, if the above technical reasons are more important, and if they are fixed you would use LFortran, I will prioritize them to get them fixed soon.

Let me know.

cmacmackin · September 29, 2021, 11:28pm

I’ll think more about this tomorrow (it’s very late here) but these are my initial thoughts.

Currently how this is done in FORD is it does an initial scan where it identifies names of all symbols in a file. It does this for all files in the project and then works out the order it should perform further processing based on the DAG of module imports. Flang, on the other hand, behaves like a typical compiler, writing .mod files for each module it compiles and then consulting those when performing semantic analysis on dependent modules.

I was thinking I’d structure a Flang wrapper in the following way (I assume something similar would work if using LFortran):

When I process a given file, I start by doing the parsing pass
I consult the AST to see if there are any use statements and, if so, what modules were used
I consult a data structure holding references to all currently analysed modules.
If the module is present but the file holding it has been modified since the last parse, re-parse and perform the semantic analysis again.
If the module I need is present grab a reference to its data and pass that along to the semantic analysis stage (will need to further analyse flang’s API to confirm the details).
If the module I need is not present, I continue parsing other files in the project until I find it (exact details of how I jump back to the original module TBD).
If the module can not be found anywhere, print a warning and proceed on a “best guess” basis.

This will allow the results of semantic analysis to be cached (likely to the disk as well as to memory) for reuse in later runs. This isn’t such a big deal for FORD, as parsing actually isn’t the main bottleneck, but could be very useful if this library were used for something like a language server that needs real-time analysis.

Something like that could be useful to me. Ideally I’d be able to look at bodies too, as these allow the construction of call graphs, dependency graphs, etc. However, it’s the symbol table that is most important.

Yes, that would be what I’d like. Currently in FORD my rule is that it should run successfully on anything which is standard compliant or the latest version of gfortran will compile (barring weird legacy extensions like Cray pointers). I’m not necessarily concerned with supporting non-standard features, I’m just concerned about when I could expect LFortran to be able to produce an ASR for all Fortran 2018 features.

I knew from experience that the ccls and clangd language servers can pick up documentation in comments, so I started digging into the source code to learn how they did it. Turns out in clang (which, I grant, is structured quite differently from flang) when the scanner skips over a comment during tokenization, it registers it with a data structure based on its position in the source code. Nodes in the AST then have a method which can consult this datastructure to find all adjacent comments. (There are quite a lot of implementation details I’m leaving out, but that’s the high-level view.) My plan was to submit a patch to flang to provide this functionality. One cause of concern I have, however, is the question of how such a patch would be received. Flang looks more difficult to submit contributions to than LFortran. Also, this process would undeniably be more difficult to implement than just looking at the comments in the AST in LFortran.

I don’t think I’ve seen it fail on modern Fortran. I’ve definitely seen it fail on fixed-form however and one of my goals of adopting this new parser is to make fixed-form a first-class citizen in FORD. While fixed-form is an abomination that should be consigned to the dust-bin of history, there is a lot of legacy code out there.

I do appreciate that (especially when I’m waiting for LLVM to compile).

I’ll take another look tomorrow.

Right now my biggest priority is getting the grant application written, as it is due in 2 weeks. It may be that I decide not to specify which compiler I’ll use in the backend in the proposal and wait until the project has started to make my decision—that way I can see what new features have become available in the intervening months.

certik · September 30, 2021, 4:14am

Thanks @cmacmackin for the feedback. If that would be ok with you to write the proposal in a way that allows you to evaluate Flang and LFortran later on to decide, that would be awesome.

I was thinking how to best handle the comments. That’s a good idea to register them in a separate structure, and then allow to look them up based on AST or ASR location information.

I have implemented lfortran --symtab-only in the latest git which only does the symbol table and it does not do any semantic processing on function bodies. I’ll see what it would take to support all of Fortran in this mode. This is actually a great idea, as a lot of tools would suddenly be possible, not just FORD. It seems this is the main obstacle, the other issues you probably believe can be done.

Let me create a roadmap for the --symtab-only mode.

MarDie · October 1, 2021, 1:49pm

A little bit off-topic, but maybe worth to discuss: Make FORD compatible to sphinx. This would allow to reuse a lot of nice functionality for theming etc. It could imagine that sphinx is already flexible enough to simply make FORD a plugin.
From a user’s perspective, the biggest difference would be reStructuredText vs Markdown.

awvwgk · October 1, 2021, 1:53pm

Compatibility with sphinx would be a nice feature. There is already sphinx-fortran, which makes use f2py for this purpose. Note that sphinx is not limited to rst and also supports markdown quite well.

cmacmackin · October 1, 2021, 2:07pm

That is something I’ve thought about and agree would be worth doing. Frankly, I should probably have limited FORD to being a Sphinx plugin from the start, but too late now.

That task is outside the scope of the work I’m currently proposing, but if the current project goes ahead I plan to have it store records of the analysis it has done on each file. At that point it might be worth thinking about whether FORD’s current HTML generation components should just become one possible target. It might be possible to refactor FORD to be more extensible, with front- and back-ends that can be mixed and matched.

certik · October 1, 2021, 3:36pm

The Myst Markdown supports everything that RestructuredText does: MyST - Markedly Structured Text, you can even test it yourself here: https://mystyc.herokuapp.com/, just type any RST and it will convert it to Markdown. Looks like they integrated it with Sphinx: MyST with Sphinx, so that would be my recommendation to use. I’ve written a lot using RST, but plan to switch to Myst.

zerothi · October 5, 2021, 9:26am

Actually, quite the contrary I think.

If ford can more easily get pushed forward by using a different backend perhaps transferring learned lessons to a sphinx plugin might be the way forward?
It will also offload lots of documentation parsing complexity to the sphinx backend. This was one of my problems (that encouraged my PR’s to ford) in that ford didn’t really scale to large documentations.

So perhaps now is a good/only(?) time to switch to what you suggest? Doing the sphinx interoperability as another backend might require a substantial effort!

Topic		Replies	Views
Using & Including flang in Modern Fortran VS Code Help	0	352	May 9, 2022
Libraries / tools needed for creating parsers	25	1360	September 3, 2022
Looking for parsers for fortran77 Help	8	956	October 2, 2024
Fully integrating the Flang Fortran compiler with standard MLIR (preprint)	1	333	October 4, 2024
Support for Flang effort Help	21	1385	March 23, 2023

Interest in refactoring/replacing FORD parser

Related topics