Get_close_matches in Fortran

nicholaswogan · February 16, 2022, 4:46pm

Looking for an implementation of Python’s get_close_matches, but in Fortran.

Thanks!

Beliavsky · February 16, 2022, 6:40pm

I don’t know of one. Here is a code from the Fortran Wiki (slightly adapted) that does wildcard matches.

module match_mod
implicit none
private
public :: match_wild
contains
elemental LOGICAL FUNCTION match_wild(pattern, string)
! compare given string for match to pattern which may
! contain wildcard characters:
! "?" matching any one character, and
! "*" matching any zero or more characters.
! Both strings may have trailing spaces which are ignored.
! Authors: Clive Page, userid: cgp  domain: le.ac.uk, 2003 (original code)
!          Rolf Sander, 2005 (bug fixes and pattern preprocessing)
! Minor bug fixed by Clive Page, 2005 Nov 29, bad comment fixed 2005 Dec 2.
! Serious bug fixed by  Robert H McClanahan, 2011 April 11th
!    This program is free software; you can redistribute it and/or modify
!    it under the terms of the GNU General Public License as published by
!    the Free Software Foundation; either version 2 of the License, or
!    (at your option) any later version.
!
!    This program is distributed in the hope that it will be useful,
!    but WITHOUT ANY WARRANTY; without even the implied warranty of
!    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
!    GNU General Public License for more details.
!
!    You should have received a copy of the GNU General Public License
!    along with this program; if not, write to the Free Software
!    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
!    02110-1301  USA
!
   IMPLICIT NONE

   CHARACTER(LEN=*), INTENT(IN) :: pattern ! pattern may contain * and ?
   CHARACTER(LEN=*), INTENT(IN) :: string  ! string to be compared
   INTEGER :: lenp, lenp2, lens, n, p2, p, s
   INTEGER :: n_question, n_asterisk
   LOGICAL :: found

   CHARACTER(LEN=LEN(pattern)) :: pattern2
   lens = LEN_TRIM(string)
   lenp = LEN_TRIM(pattern)

! If the pattern is empty, always return true
   IF (lenp == 0) THEN
     match_wild = .TRUE.
     RETURN
   ENDIF

! The pattern must be preprocessed. All consecutive occurrences of
! one or more question marks ('?') and asterisks ('*') are sorted and
! compressed. The result is stored in pattern2.

   pattern2(:)=''
   p  = 1 ! current position in pattern
   p2 = 1 ! current position in pattern2
   DO
     IF ((pattern(p:p) == '?').OR.(pattern(p:p) == '*')) THEN
! a special character was found in the pattern
       n_question = 0
       n_asterisk = 0
       DO WHILE (p <= lenp)
         ! count the consecutive question marks and asterisks
         IF ((pattern(p:p) /= '?').AND.(pattern(p:p) /= '*')) EXIT
         IF (pattern(p:p) == '?') n_question = n_question + 1
         IF (pattern(p:p) == '*') n_asterisk = n_asterisk + 1
         p = p + 1
       ENDDO
       IF (n_question>0) THEN ! first, all the question marks
         pattern2(p2:p2+n_question-1) = REPEAT('?',n_question)
         p2 = p2 + n_question
       ENDIF
       IF (n_asterisk>0) THEN ! next, the asterisk (only one!)
         pattern2(p2:p2) = '*'
         p2 = p2 + 1
       ENDIF
     ELSE
! just a normal character
       pattern2(p2:p2) = pattern(p:p)
       p2 = p2 + 1
       p = p + 1
     ENDIF
     IF (p > lenp) EXIT
   ENDDO
!!   lenp2 = p2 - 1
   lenp2 = len_trim(pattern2)

! The modified wildcard in pattern2 is compared to the string:

   p2 = 1
   s = 1
   match_wild = .FALSE.
   DO
     IF (pattern2(p2:p2) == '?') THEN
! accept any char in string
       p2 = p2 + 1
       s = s + 1

     ELSEIF (pattern2(p2:p2) == "*") THEN
       p2 = p2 + 1
       IF (p2 > lenp2) THEN
! anything goes in rest of string
         match_wild = .TRUE.
         EXIT ! .TRUE.
       ELSE
! search string for char at p2
         n = INDEX(string(s:), pattern2(p2:p2))
         IF (n == 0) EXIT  ! .FALSE.
         s = n + s - 1
       ENDIF

     ELSEIF (pattern2(p2:p2) == string(s:s)) THEN
! single char match
       p2 = p2 + 1
       s = s + 1
     ELSE
! non-match
!       EXIT ! .FALSE.
! Previous line buggy because failure to match one character in the pattern
! does not mean that a match won't be found later. Back up through pattern string
! until first wildcard character is found and start over with the exact character
! match. If the end of the string is reached, then return .FALSE.
!      04/11/2011 Robert McClanahan    Robert.McClanahan   <<at>>   AECC.COM
!
       found = .FALSE.
       DO WHILE (p2 > 0 .AND. .NOT. found)
         p2 = p2 - 1
         IF (p2 == 0) EXIT  !  .FALSE.
         IF (pattern(p2:p2) == '*' .OR. pattern(p2:p2) == '?') found = .TRUE.
       END DO
       s = s + 1
     ENDIF

     IF (p2 > lenp2 .AND. s > lens) THEN
! end of both pattern2 and string
       match_wild = .TRUE.
       EXIT ! .TRUE.
     ENDIF

     IF (s > lens .AND. p2 == lenp) THEN
       IF(pattern2(p2:p2) == "*") THEN
! "*" at end of pattern2 represents an empty string
         match_wild = .TRUE.
         EXIT
       ENDIF
     ENDIF

     IF (p2 > lenp2 .OR. s > lens) THEN
! end of either pattern2 or string
       EXIT ! .FALSE.
     ENDIF
   ENDDO

END FUNCTION match_wild
end module match_mod

program xmatch_wild
use match_mod, only: match_wild
implicit none
print*,match_wild("boy*",["boys","girl"]) ! T F
print*,match_wild("*man",["woman","women"]) ! T F
end program xmatch_wild

everythingfunctional · February 16, 2022, 6:59pm

I could tell you what the interface would likely look like.

function get_close_matches(word, possibilities, n, cuttoff) result(matches)
  character(len=*), intent(in) :: word
  character(len=*), intent(in) :: possibilities(:)
  integer, intent(in), optional :: n
  real, intent(in), optional :: cutoff
  character(len=:), allocatable :: matches(:)
end function

The algorithm for calculating the similarity score is probably the hardest part, but once you’ve got that (and you can probably translate it from the python pretty simply), the rest shouldn’t be too hard.

urbanjost · February 16, 2022, 10:39pm

Fuzzy string searches fall into several categories. Popular algorithms
that determine the minimum number of characters required to change one
string into another are:

Levenshtein distance - words; this can become very inefficient for
long strings.
Needleman-Wunsch or Smith-Waterman - biometric sequences, but can be
applied to phrases.

There are others that look for the most words in common, and/or combined
weights of the Levenshtein distance of the words, the number of matched
words, the similiarity in length, … A good place to start might be
“Fuzzy name match algorithms”.

Are you trying to find a description based on a few keywords in a manual,
trying to find names (ie. is “Bob” similar to “Robert?” do you need to
handle vernaculars and slang? (“I have a beef”, “have a cow”, … is
that about livestock or a grievance?), looking for similar chemical
composition or protein pairs? Exactly what kind of problem are
you solving?

Just matching a word or two, like finding the correct spelling of a word in a dictionary can be
handled by something like the Levenshtein algorithm; or looking up a famous quote related to a few words; things can get very complicated very quickly if the data you are trying to match is not categorized in some way. There is a big difference between a spelling checker that gives you a suggestion and a web search engine or trying to determine if someone talking about it “being the bomb” is a teenager thinking something is a great idea or a terrorist; or trying to determine the function of a gene sequence but fuzzy matches are used in all of that; so quite a range of possibilities to guess about.

PS:

If the python functionality is all you need, for just a word or two see:

nicholaswogan · February 16, 2022, 11:49pm

Thanks for the all the feedback so far.

My specific problem isn’t that hard, I don’t think. I have a long array of chemical species:

species = ["H2O", "CO2,aq", "CH4,AQ", ...]

I have a function which, takes in a single species, and looks for a match in the list with findloc:

ind = findloc(species, "CO2", 1)

If the search fails (ind == 0), then I throw an error saying that the input species isn’t in the list. But I want to be more helpful. I want the error message to contain the “closest matches”. So in the above example, I would want the error message to say ““CO2” is not in the list of species. Closest matches are “CO2,aq”, etc…”

urbanjost · February 17, 2022, 1:01am

My guess would be to normalize the strings (ignore case, remove initial spaces, then change other alphanumerics to a space so “CO2(aq)”, “CO 2 (aq)”, “co2-aq”, “CO2,aq” would all be “co2,aq” ) and then calculate matched words in order and give those a heavy weight, and then do the edit distance using the Fortran links in the first two links (caveat emptor – did not try them); things like lower, substitute, split, the links above and weighting and then maybe requiring the first word to match or no word having a high edit distance might be where to end up, but just normalizing the string and doing the edit distance on the input string and the species trimming the dictionary to no more than one word longer than the input either starting at the beginning or at end of the first non-matching character (simpler than it sounds) would be good; if the species are all as simple as in the example or you can tell people to follow a particular syntax you could go exact match, glob match (like match routine above or using glob, then edit distance would be nice and simple. (glob, lower, split, substitute examples are all in M_strings if that is helpful).

If you use fpm(1) the dependency string is

M_strings      = { git = "https://github.com/urbanjost/M_strings.git" }

So if you already are doing the findloc and users have to follow a syntax like the compound is uppercase, comma delimited and lowercase after then you could use the wildcard/glob routine, and if no match give the three lowest edit distances; good start and you are already half there; then maybe the normalizing I mentioned, then play with weights between the measures of word match and edit distance. Probably getting too general, if your species are standardized …

Harper · February 17, 2022, 1:33am

Ignoring case is no use for chemists: CO is carbon monoxide, Co is cobalt.

urbanjost · February 17, 2022, 2:44am

Ha! I had a lapse there! Was just thinking strings.

Started a skeleton, mostly to show the pros and cons of edit distance
and to try the linked code, and how globbing might be what is wanted,
and that it would only take a few lines and then of course, things went
south. For what it is worth …

It shows nicely why using edit distance by itself can
get odd results if you do not deal with string length or weight by
matching characters (H2O is as good a match as CO2,aq!).

glob matches are : CO2,aq
best matches are  H2O                  CO2,aq              
fpm: Leaving directory '/home/urbanjs/venus/V600/github/chem'

program main
use chem, only: distance
use M_strings, only : glob
implicit none
character(len=20),parameter :: species(*) = [character(len=20) :: "H2O", "CO2,aq", "CH4,AQ"  ]
character(len=20),allocatable :: matches(:)
character(len=:),allocatable :: mystring
integer :: ind,i,ismall
integer,allocatable :: distances(:)
mystring='CO2'
FOUNDIT : block
  ind = findloc(species, "CO2", 1)
  if(ind.ne.0)then
     write(*,'(*(g0))')'found match at ',species(ind)
     !exit FOUNDIT
  endif
  matches=pack(species,[(glob(species(i),mystring//'*'),i=1,size(species))])
  if(size(matches).ne.0)then
     write(*,'(*(g0:,": "))')'glob matches are ',(trim(matches(i)),i=1,size(matches))
     !exit FOUNDIT
  endif
  distances=[(distance(mystring,species(i)),i=1,size(species))]
  ismall=minval(distances)
  ! you would want to heavily weight matching characters
  write(*,'(*(g0,1x))')'best matches are ',pack(species,distances.eq.ismall)
endblock FOUNDIT
end program main

module chem
  implicit none
  private
  public :: distance
contains
!> The Levenshtein distance function returns how many edits (deletions,
!! insertions, transposition) are required to turn one string into another.
integer function distance (a,b)
integer :: len_a, len_b, i, j, cost
character(len=*), intent(in) :: a, b
! matrix for calculating levenshtein distance
integer, dimension(0:len_trim(a), 0:len_trim(b)) :: leven_mat
integer, dimension(3) :: three_vals
   len_a = len_trim(a)
   len_b = len_trim(b)
   do i = 0,len_a
      leven_mat(i,0) = i
   enddo
   do j = 0, len_b
      leven_mat(0,j) = j
   enddo
   do i = 1, len_a
      do j = 1, len_b
         if (a(i:i) == b(j:j)) then
            cost = 0
         else
            cost = 1
         endif
         three_vals(1) = leven_mat(i-1,j) + 1
         three_vals(2) = leven_mat(i, j-1) + 1
         three_vals(3) = leven_mat(i-1,j-1) + cost
         leven_mat(i,j) = minval(three_vals)
      enddo
   enddo
   distance = int(leven_mat(len_a,len_b))
end function distance
end module chem

septc · February 17, 2022, 6:37am

If the purpose is to show a list of candidate items that contain the word “CO2” exactly, then a “minimalist” approach may be like this…?

program main
    implicit none
    character(:), allocatable :: species(:), mol
    integer, allocatable :: locs(:)
    integer :: i

    mol = "CO2"
    ! mol = "H2O"

    species = [ character(10) :: "H2O", "CO2,aq", "CH4,AQ", "[CO2,gas]", "H2O,g" ]
    locs = pack( [(i, i=1,size(species))], index(species, mol) >= 1 )

    print *, "mol             = ", mol
    print *, "species(:)      = ", species(:)
    print *, "exact-match loc = ", findloc( species, mol )
    print *, "candidate list  = ", species( locs )
    print *, "candidate locs  = ", locs
end

The output for mol = “CO2” is:

 mol             = CO2
 species(:)      = H2O       CO2,aq    CH4,AQ    [CO2,gas] H2O,g     
 exact-match loc =            0
 candidate list  = CO2,aq    [CO2,gas] 
 candidate locs  =            2           4

and for mol = “H2O”:

 mol             = H2O
 species(:)      = H2O       CO2,aq    CH4,AQ    [CO2,gas] H2O,g     
 exact-match loc =            1
 candidate list  = H2O       H2O,g     
 candidate locs  =            1           5

urbanjost · February 17, 2022, 12:30pm

That may well be the case the OP wants to solve; the fancier things here allow the user to actually put in the globbing pattern themselves, so they could input ‘CO*aq*’ for anything
starting with “CO” followed by “aq”, and the edit distances are probably what the original question about a Fortran version of the python procedure uses, which allows for misspellings and other differences which are still interesting and useful variants, but that is a good combination just using intrinsics I use a lot. The other approach often used is allowing the user to use Regular Expressions; for others with similar problems.

Topic		Replies	Views
Fortran `regex` library Announcements	18	928	December 19, 2022
Project Strings: improving strings support \|\| Blog Post #3 \|\| GSoC GSoC-2021	1	411	June 14, 2021
Regular expressions (regex) and Fortran	2	2548	March 13, 2021
Project Strings: improving strings support \|\| Blog Post #2 \|\| GSoC GSoC-2021	1	489	June 14, 2021
Project Strings: improving strings support \|\| Blog Post #4 \|\| GSoC GSoC-2021	1	512	June 18, 2021

Get_close_matches in Fortran

PS:

Related topics