Fortran `regex` library

FedericoPerini · December 17, 2022, 6:47pm

I’m happy to announce that I’ve just released a regex library written in Modern Fortran. Contributions (especially practical test cases) are welcome!

fortran-regex

Fortran-regex is a Modern Fortran port of the tiny-regex-c library for regular expressions. It is based on the original C implementation, but the API is modelled in Fortran style, which is similar to the intrinsic index function.

API

The main API is modelled around Fortran’s index intrinsic function (which performs a simple search for a substring within a string):

   ! Simple regex
   result = REGEX(string, pattern)
   
   ! Regex with output matched pattern length
   result = REGEX(string, pattern, length)

Object-oriented interface

One can also parse a regex pattern into a type(regex_op) structure, and use that instead of a string pattern. I have no idea why this should be useful, but at least it’s given with a consistent interface

Overview

The original tiny-regex-c code has been significantly refactored, to:

Remove all references to NULL character string termination, and replace them with Fortran’s string intrinsics (len, len_trim, etc.)
Remove all C escaped characters (\n, \t, etc), replace with Fortran syntax.
Even in presence of strings, use pure elemental functions wherever possible
It is a standalone module that has no external dependencies besides compiler modules.

Example programs

! Demonstrate use of regex
program test_regex
   use regex_module
   implicit none
   
   integer :: idx,ln
   character(*), parameter :: text = 'table football'
   
   
   
   idx = REGEX(string=text,pattern='foo*',length=ln);

   ! Prints "foo"
   print *, text(idx:idx+ln-1)
   
end program


! Demonstrate use of object-oriented interface
program test_regex
   use regex_module
   implicit none
   
   integer :: idx,ln
   character(*), parameter :: text = 'table football'
   type(regex_op) :: re
   
   ! Parse pattern into a regex structure
   re = parse_pattern('foo*')
   
   idx = REGEX(string=text,pattern=re,length=ln);

   ! Prints "foo"
   print *, text(idx:idx+ln-1)
   
end program

To do list

Add a BACK optional keyword to return the last instance instead of the first.
Option to return ALL instances as an array, instead of the first/last one only.
Replace fixed-size static storage with allocatable character strings (slower?)

Reporting problems

Please report any problems! It is appreciated. The original C library had hacks to account for the fact that several special characters are read in with escaped sequences, which partially collides with the escaped sequence options in regex. So, expect the current API to be still a bit rough around the edges.

License

fortran-regex is released under the MIT license. The code it’s based upon is in the public domain.

vmagnin · December 17, 2022, 8:02pm

I have added in the fpm.toml:

[[test]]
name="runTests"
source-dir="test"
main="tests.f90"

to launch the (successful) tests with fpm test.

vmagnin · December 17, 2022, 8:15pm

It looks great!

Is it possible to avoid those warnings (I’ve got a lot of them with GFortran under Ubuntu)?

At line 646 of file ./src/regex.f90
Fortran runtime warning: An array temporary was created

FedericoPerini · December 17, 2022, 8:28pm

Thank you!

I see fpm turns on more debugging flags than I have on Debug builds. I wonder why that happens?

EDIT it looks like passing an array of the derived type causes the temporary being created. It must have to do with the structure contents,

Federico

vmagnin · December 17, 2022, 8:40pm

With the 'table football' test, I have tried successfully those patterns:

f[o]+
e f
e\s+f

but e[ ]f does not work (it found nothing).
And e[ ]+f crashes the program:

At line 75 of file test/tests.f90
Fortran runtime error: Substring out of bounds: lower bound (0) is less than one

Error termination. Backtrace:
#0  0x7fdc4de23ad0 in ???

(In line 75 I have added print *, idx, text(idx:idx+ln-1) to see the result)

FedericoPerini · December 17, 2022, 8:43pm

The reason was that input was defined polymorphic:

class(regex_pattern), intent(in) :: pattern(:)

Please submit failing cases to the github repo, so that I can work on that!

Thank you,
Federico

FedericoPerini · December 17, 2022, 9:07pm

Both fixed now!

I was relying on a fixed-size string for the character class, so clearly using len_trim would crash that case. One could have used the whitespace wildcard \s but that’s not the same, cause it also includes tabs, etc.

vmagnin · December 18, 2022, 12:13pm

Great, the latest version seems rather solid!
I tried many things, and even the following one was successful:

       character(*), parameter :: text = 'football dollars $'

       success = check_pattern(text,'\$$',expected="$")

FedericoPerini · December 18, 2022, 12:46pm

Glad that it’s starting to work, and thank you for the contributions!

I believe many more edge case issues will still have to be found, but we can keep tackling them one at a time….

vmagnin · December 18, 2022, 1:55pm

People wanting to test it can just create a fpm project: fpm new my_regex

then put that code into the app/main.f90:

program main
   use regex_module

   implicit none
   integer :: idx,ln
   character(*), parameter :: text = 'Avida Dollar$'

   idx = REGEX(string=text, pattern='[A-Z][a-z]+\$', length=ln)

   print '(I0, "|", A ,"|")', idx, text(idx:idx+ln-1)
end program main

and add those lines in the fpm.toml manifest:

[dependencies]
fortran-regex = { git = "https://github.com/perazz/fortran-regex", branch = "main" }

then just type: fpm run

and you are on the way for unknown pleasures with 100% Fortran regex!

You can test your regex with online tools like https://pythex.org/
(Frederico put the supported patterns at the beginning of https://github.com/perazz/fortran-regex/blob/main/src/regex.f90)

FortranFan · December 18, 2022, 3:14pm

@FedericoPerini ,

This is awesome, thank you! Any thoughts on making this part of Fortran stdlib?

davidk · December 18, 2022, 11:07pm

Just what I wanted. Thanks very much.

everythingfunctional · December 18, 2022, 11:24pm

I have been looking forward to a regex library to use in this line for a while now. I look forward to giving it a try.

snano · December 19, 2022, 1:45am

I tried to test this, but got an error:

fpm run
Project is up to date
At line 248 of file build/dependencies/fortran-regex/src/regex.f90
Fortran runtime error: Substring out of bounds: upper bound (3) of 'str' exceeds string length (2)

urbanjost · December 19, 2022, 2:03am

Nice.

Tried it. The results were mostly good. A few issues/questions …

Modifying a QA program from a similar project
I get out of bounds errors with

fpm test '*suite*'

but if I do not use the debug flags it runs

fpm test '*suite*' --profile release

It looks pretty good, but a few things were unexpected:

o When - is the first or last character it should be used literally,
not as part of a range

    F for regex X[-+]Y and string X-Y expected T got F getpat= 0

o dusty corner

   F for regex ^[0-9]*$ and string  expected T got F getpat= 0

o Does not appear to recognize “escape” characters and backslash. Would be nice, as
that allows easily looking for tab characters in particular.

   F for regex X\tY and string X    Y expected T got F getpat= 0
   F for regex X[\tAb]Y and string X        Y expected T got F getpat= 0
   F for regex X[\tAb]Y and string XtY expected F got T getpat= 1

FAILED: PASSED= 86 FAILED= 5

The (quicky) converted test case traceback looks like this:

At line 248 of file ./src/regex.f90
Fortran runtime error: Substring out of bounds: upper bound (3) of 'str' exceeds string length (2)

Error termination. Backtrace:
#0  0x152b09a49700 in ???
#1  0x152b09a4a259 in ???
#2  0x152b09a4a8d6 in ???
#3  0x562a367c37af in matchrange
        at ./src/regex.f90:248
#4  0x562a367c33d5 in matchcharclass
        at ./src/regex.f90:267
#5  0x562a367bf995 in __regex_module_MOD_pat_match
        at ./src/regex.f90:589
#6  0x562a367bf5d7 in matchpattern
        at ./src/regex.f90:696
#7  0x562a367bf863 in __regex_module_MOD_re_matchp
        at ./src/regex.f90:629
#8  0x562a367c2397 in __regex_module_MOD_re_match_nolength
        at ./src/regex.f90:415
#9  0x562a367bb31d in mymatch
        at test/test_suite_fortran-regex.f90:128
#10  0x562a367bc48b in test_suite_m_match
        at test/test_suite_fortran-regex.f90:32
#11  0x562a367bd8aa in main
        at test/test_suite_fortran-regex.f90:3
T for regex P*o and string FooBar expected T got T getpat= 2
T for regex Fo*o and string FooBar expected T got T getpat= 1
T for regex Po*o and string FooBar expected F got F getpat= 0
<ERROR> Execution failed for object " test_suite_fortran-regex "
<ERROR>*cmd_run*:stopping due to failed executions

For fpm, some files in example/ might be nice, eg.:

program demo_regex
use regex_module
implicit none
character(len=1024) :: line=''
character(len=:),allocatable :: argument
integer             :: ios, ln, indx, i
   call get_command_argument(1,length=ln)
   allocate(character(len=ln) :: argument)
   call get_command_argument(1,argument)
   if(argument.eq.'')stop 'missing regular expression'

   INFINITE: do i=1,huge(0)-1
      read(*,'(a)',iostat=ios)line
      if(ios.ne.0)exit INFINITE
      indx=regex(string=line,pattern=argument,length=ln)
      if (ln>0) then
         write(*,'((i6.6,":",i0,"-",i0,": ",g0))')i,indx,indx+ln-1,trim(line)
      endif
   enddo INFINITE
end program demo_regex

which would let you try it interactively easily.

Maybe the returned value should be -1 for bad regular expressions?

QA that produced traceback

program test_suite_M_match
use, intrinsic :: iso_fortran_env, only : ERROR_UNIT
use regex_module
!use M_match,   only : getpat, match, regex_pattern
!use M_match,   only : YES, ERR !, NO
implicit none
integer,parameter :: HT=9   ! horizontal tab
logical,allocatable         :: TALLY(:)

allocate(tally(0))
!    mymatch("regexp",     "String",        expected result )
call mymatch("Foo",        "FooBar",        .true.   )
call mymatch("Poo",        "FooBar",        .false.  )
call mymatch("Bar",        "FooBar",        .true.   )
call mymatch("Par",        "FooBar",        .false.  )
call mymatch("Foo",        "Foo",           .true.   )
call mymatch("Fo",         "Foo",           .true.   )
call mymatch("Foo",        "Fo",            .false.  )
call mymatch("ooB",        "FooBar",        .true.   )
call mymatch("ooP",        "FooBar",        .false.  )
call mymatch(".",          "FooBar",        .true.   )
call mymatch("P.",         "FooBar",        .false.  )
call mymatch("^Foo",       "FooBar",        .true.   )
call mymatch("^Bar",       "FooBar",        .false.  )
call mymatch("Foo$",       "FooBar",        .false.  )
call mymatch("Bar$",       "FooBar",        .true.   )
call mymatch(".*o",        "FooBar",        .true.   )
call mymatch("o*o",        "FooBar",        .true.   )
call mymatch("P*o",        "FooBar",        .true.   )
call mymatch("Fo*o",       "FooBar",        .true.   )
call mymatch("Po*o",       "FooBar",        .false.  )
call mymatch("F[po]o",     "FooBar",        .true.   )
call mymatch("F[op]o",     "FooBar",        .true.   )
call mymatch("F[qp]o",     "FooBar",        .false.  )
call mymatch("F[^po]o",    "FooBar",        .false.  )
call mymatch("F[^op]o",    "FooBar",        .false.  )
call mymatch("F[^qp]o",    "FooBar",        .true.   )
call mymatch("F[po]*o",    "FooBar",        .true.   )
call mymatch("F[56]*o",    "F5oBar",        .true.   )
call mymatch("F[46]*o",    "F5oBar",        .false.  )
call mymatch("F[46]*5",    "F5oBar",        .true.   )
call mymatch("F[46]*5o",   "F5oBar",        .true.   )
call mymatch("F[op]*o",    "FooBar",        .true.   )
call mymatch("F[qp]*o",    "FooBar",        .true.   )
call mymatch("P[qp]*o",    "FooBar",        .false.  )
call mymatch("F[^po]*o",   "FooBar",        .true.   )
call mymatch("F[^op]*o",   "FooBar",        .true.   )
call mymatch("F[^qp]*o",   "FooBar",        .true.   )
call mymatch("P[^qp]*o",   "FooBar",        .false.  )

call mymatch("[0-9][0-9]*$",  "0123456789",  .true.  )
call mymatch("[0-9][0-9]*$",  "A0123456789", .true.  )
call mymatch("^[0-9][0-9]*$", "A0123456789", .false. )
call mymatch("^[0-9][0-9]*$", "",            .false. )
call mymatch("^[0-9]$", "",                  .false. )
call mymatch("^[0-9]*$", "",                 .true.  )
call mymatch("^$", "",                        .true. )
call mymatch("^$", " ",                       .false.)
call mymatch("^[A-Z ][A-Z ]*$", "",          .false. )
call mymatch("^[ ]*[A-Z][A-Z ]*$", " THIS IS ALL UPPERCASE",    .true.   )
call mymatch("^[ ]*[a-z][a-z ]*$", " this is all lowercase",    .true.   )
call mymatch("^[ ]*[A-Z][A-Z ]*$", " THIS IS not ALL UPPERCASE",    .false.  )
call mymatch("^[ ]*[a-z][a-z ]*$", " this is NOT all lowercase",    .false.  )

! check dash in character class at beginning and end instead of in range
call mymatch("X[-+]Y", "X-Y",                        .true. )
call mymatch("X[-+]Y", "X+Y",                        .true. )
call mymatch("X[+-]Y", "X-Y",                        .true. )
call mymatch("X[+-]Y", "X+Y",                        .true. )
call mymatch("X[-+]Y", "Y-X",                        .false. )
call mymatch("X[-+]Y", "Y+X",                        .false. )
call mymatch("X[+-]Y", "Y-X",                        .false. )
call mymatch("X[+-]Y", "Y+X",                        .false. )
! tabs
call mymatch("X\tY", "X"//char(HT)//"Y",             .true. )
call mymatch("X[\tab]Y", "X"//char(HT)//"Y",         .true. )
call mymatch("X[\tab]Y", "XtY",                      .false. )
call mymatch("X[\tab]Y", "XaY",                      .true. )

call mymatch("[0-9][0-9]*\.[0-9]*",   "1.9",           .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "1.99",          .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "1.999",         .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "1.9999",        .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "1.99999",       .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "11.99999",      .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "111.99999",     .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "1111.99999",    .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "11111.99999",   .true.   )
call mymatch("[0-9][0-9]*\.[0-9]*",   "123456.99999",  .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1.9",           .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1.99",          .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1.999",         .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1.9999",        .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1.99999",       .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "11.99999",      .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "111.99999",     .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "1111.99999",    .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "11111.99999",   .true.   )
call mymatch("^[0-9][0-9]*\.[0-9]*",  "111111.99999",  .true.   )
call mymatch("a[0-9][0-9]*\.[0-9]*",  "a1.9",          .true.   )
call mymatch("a[0-9][0-9]*\.",        "a1.9",          .true.   )
call mymatch("a[0-9][0-9]*",          "a1.9",          .true.   )
call mymatch("a",                "a1.9",          .true.   )
call mymatch("\\",               "\",             .true.   )
call mymatch("\.",               "\",             .false.  )
call mymatch(".",                "\",             .true.   )
! this version closes the set -- it really should complain
!!call mymatch("F[qpo", "FooBar", .false.) ! intentional bad regex 
call mymatch("F[qpo", "FooBar", .false.) ! intentional bad regex 

! write up total results and if anything failed exit with a non-zero status
write(*,'(*(g0))')'TALLY;',tally
if(all(tally))then
   write(*,'(*(g0))')'PASSED: all ',count(tally),' tests passed '
else
   write(*,*)'FAILED: PASSED=',count(tally),' FAILED=',count(.not.tally)
   stop 4
endif

contains

subroutine mymatch(expression,string,expected)
character(len=*),intent(in) :: expression
character(len=*),intent(in) :: string
logical,intent(in)          :: expected
integer                     :: indx, ln
logical                     :: answer
   indx=regex(string,expression)
   if(indx .lt. 0) then
      write(*,'(*(g0,1x))').not.expected, 'illegal pattern for regex',expression, &
      & 'and string',string,'expected',expected,'getpat=',indx
      tally=[tally,.false.]
   else
      answer=merge(.true.,.false.,indx.gt.0)
      write(*,'(*(g0,1x))') answer .eqv. expected, 'for regex',expression,'and string',string, &
      & 'expected',expected,'got',answer,'getpat=',indx
      tally=[tally,answer.eqv.expected]
   endif

end subroutine mymatch

end program test_suite_M_match

FedericoPerini · December 19, 2022, 6:09am

@urbanjost @snano @everythingfunctional thank you for testing it.

This is all expected: the original library also has bugs, plus, I may have broken something when replacing the C pointer carousel. the only way is to keep finding them out.

Please, submit your breaking testcases to the GitHub repo as string + pattern + expected result, so I can work them out when I have some time,

Thank you,
Federico

FedericoPerini · December 19, 2022, 6:25am

My policy so far was to not use C-style syntax anymore besides the base escaped classes. C will read \t as a tab or \n for newline , but Fortran will not. So to do regex with tabs, you need to use the Fortran versions with //new_line(“”)// etc.

I’m willing to change/extend this but I would stick to being Fortran-centric, maybe those could be allowed with an extension (like cregex)?

FedericoPerini · December 19, 2022, 6:03pm

Hi @urbanjost can I ask you a question about your regex test cases?
I have used all of them for testing, now I have 161/161 passed!.

But I don’t know which answer I should expect from the last one:

call mymatch("F[qpo", "FooBar", .true.) ! intentional bad regex

I’ve tested it on PCRE2-based regex101.com and my library behaves like it (i.e., no match: invalid command). Why should be it .true. instead? I think the bracket should be escaped if one wants it to be used as a character (even so, the answer would be .false.).

urbanjost · December 19, 2022, 6:40pm

I thought I deleted that one. It varies between regex packages whether that one uses the unmatched [ as a literal character, or whether it closes the set and uses it like [pqo] or throws an error. My own silently closes the set; which I have marked as something to change. Testing with gnu grep it throws an error. I personally favor throwing an error; but all three responses are reasonable.

Topic		Replies	Views
Regular expressions (regex) and Fortran	2	2536	March 13, 2021
Project Strings: improving strings support \|\| Blog Post #3 \|\| GSoC GSoC-2021	1	411	June 14, 2021
Project Strings: improving strings support \|\| Blog Post #2 \|\| GSoC GSoC-2021	1	489	June 14, 2021
Project Strings: improving strings support \|\| Blog Post #4 \|\| GSoC GSoC-2021	1	512	June 18, 2021
Project Strings: improving strings support \|\| Blog Post #5 \|\| GSoC GSoC-2021	0	444	June 25, 2021