Looking for an implementation of Python’s get_close_matches, but in Fortran.
Thanks!
Looking for an implementation of Python’s get_close_matches, but in Fortran.
Thanks!
I don’t know of one. Here is a code from the Fortran Wiki (slightly adapted) that does wildcard matches.
module match_mod
implicit none
private
public :: match_wild
contains
elemental LOGICAL FUNCTION match_wild(pattern, string)
! compare given string for match to pattern which may
! contain wildcard characters:
! "?" matching any one character, and
! "*" matching any zero or more characters.
! Both strings may have trailing spaces which are ignored.
! Authors: Clive Page, userid: cgp domain: le.ac.uk, 2003 (original code)
! Rolf Sander, 2005 (bug fixes and pattern preprocessing)
! Minor bug fixed by Clive Page, 2005 Nov 29, bad comment fixed 2005 Dec 2.
! Serious bug fixed by Robert H McClanahan, 2011 April 11th
! This program is free software; you can redistribute it and/or modify
! it under the terms of the GNU General Public License as published by
! the Free Software Foundation; either version 2 of the License, or
! (at your option) any later version.
!
! This program is distributed in the hope that it will be useful,
! but WITHOUT ANY WARRANTY; without even the implied warranty of
! MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
! GNU General Public License for more details.
!
! You should have received a copy of the GNU General Public License
! along with this program; if not, write to the Free Software
! Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
! 02110-1301 USA
!
IMPLICIT NONE
CHARACTER(LEN=*), INTENT(IN) :: pattern ! pattern may contain * and ?
CHARACTER(LEN=*), INTENT(IN) :: string ! string to be compared
INTEGER :: lenp, lenp2, lens, n, p2, p, s
INTEGER :: n_question, n_asterisk
LOGICAL :: found
CHARACTER(LEN=LEN(pattern)) :: pattern2
lens = LEN_TRIM(string)
lenp = LEN_TRIM(pattern)
! If the pattern is empty, always return true
IF (lenp == 0) THEN
match_wild = .TRUE.
RETURN
ENDIF
! The pattern must be preprocessed. All consecutive occurrences of
! one or more question marks ('?') and asterisks ('*') are sorted and
! compressed. The result is stored in pattern2.
pattern2(:)=''
p = 1 ! current position in pattern
p2 = 1 ! current position in pattern2
DO
IF ((pattern(p:p) == '?').OR.(pattern(p:p) == '*')) THEN
! a special character was found in the pattern
n_question = 0
n_asterisk = 0
DO WHILE (p <= lenp)
! count the consecutive question marks and asterisks
IF ((pattern(p:p) /= '?').AND.(pattern(p:p) /= '*')) EXIT
IF (pattern(p:p) == '?') n_question = n_question + 1
IF (pattern(p:p) == '*') n_asterisk = n_asterisk + 1
p = p + 1
ENDDO
IF (n_question>0) THEN ! first, all the question marks
pattern2(p2:p2+n_question-1) = REPEAT('?',n_question)
p2 = p2 + n_question
ENDIF
IF (n_asterisk>0) THEN ! next, the asterisk (only one!)
pattern2(p2:p2) = '*'
p2 = p2 + 1
ENDIF
ELSE
! just a normal character
pattern2(p2:p2) = pattern(p:p)
p2 = p2 + 1
p = p + 1
ENDIF
IF (p > lenp) EXIT
ENDDO
!! lenp2 = p2 - 1
lenp2 = len_trim(pattern2)
! The modified wildcard in pattern2 is compared to the string:
p2 = 1
s = 1
match_wild = .FALSE.
DO
IF (pattern2(p2:p2) == '?') THEN
! accept any char in string
p2 = p2 + 1
s = s + 1
ELSEIF (pattern2(p2:p2) == "*") THEN
p2 = p2 + 1
IF (p2 > lenp2) THEN
! anything goes in rest of string
match_wild = .TRUE.
EXIT ! .TRUE.
ELSE
! search string for char at p2
n = INDEX(string(s:), pattern2(p2:p2))
IF (n == 0) EXIT ! .FALSE.
s = n + s - 1
ENDIF
ELSEIF (pattern2(p2:p2) == string(s:s)) THEN
! single char match
p2 = p2 + 1
s = s + 1
ELSE
! non-match
! EXIT ! .FALSE.
! Previous line buggy because failure to match one character in the pattern
! does not mean that a match won't be found later. Back up through pattern string
! until first wildcard character is found and start over with the exact character
! match. If the end of the string is reached, then return .FALSE.
! 04/11/2011 Robert McClanahan Robert.McClanahan <<at>> AECC.COM
!
found = .FALSE.
DO WHILE (p2 > 0 .AND. .NOT. found)
p2 = p2 - 1
IF (p2 == 0) EXIT ! .FALSE.
IF (pattern(p2:p2) == '*' .OR. pattern(p2:p2) == '?') found = .TRUE.
END DO
s = s + 1
ENDIF
IF (p2 > lenp2 .AND. s > lens) THEN
! end of both pattern2 and string
match_wild = .TRUE.
EXIT ! .TRUE.
ENDIF
IF (s > lens .AND. p2 == lenp) THEN
IF(pattern2(p2:p2) == "*") THEN
! "*" at end of pattern2 represents an empty string
match_wild = .TRUE.
EXIT
ENDIF
ENDIF
IF (p2 > lenp2 .OR. s > lens) THEN
! end of either pattern2 or string
EXIT ! .FALSE.
ENDIF
ENDDO
END FUNCTION match_wild
end module match_mod
program xmatch_wild
use match_mod, only: match_wild
implicit none
print*,match_wild("boy*",["boys","girl"]) ! T F
print*,match_wild("*man",["woman","women"]) ! T F
end program xmatch_wild
I could tell you what the interface would likely look like.
function get_close_matches(word, possibilities, n, cuttoff) result(matches)
character(len=*), intent(in) :: word
character(len=*), intent(in) :: possibilities(:)
integer, intent(in), optional :: n
real, intent(in), optional :: cutoff
character(len=:), allocatable :: matches(:)
end function
The algorithm for calculating the similarity score is probably the hardest part, but once you’ve got that (and you can probably translate it from the python pretty simply), the rest shouldn’t be too hard.
Fuzzy string searches fall into several categories. Popular algorithms
that determine the minimum number of characters required to change one
string into another are:
There are others that look for the most words in common, and/or combined
weights of the Levenshtein distance of the words, the number of matched
words, the similiarity in length, … A good place to start might be
“Fuzzy name match algorithms”.
Are you trying to find a description based on a few keywords in a manual,
trying to find names (ie. is “Bob” similar to “Robert?” do you need to
handle vernaculars and slang? (“I have a beef”, “have a cow”, … is
that about livestock or a grievance?), looking for similar chemical
composition or protein pairs? Exactly what kind of problem are
you solving?
Just matching a word or two, like finding the correct spelling of a word in a dictionary can be
handled by something like the Levenshtein algorithm; or looking up a famous quote related to a few words; things can get very complicated very quickly if the data you are trying to match is not categorized in some way. There is a big difference between a spelling checker that gives you a suggestion and a web search engine or trying to determine if someone talking about it “being the bomb” is a teenager thinking something is a great idea or a terrorist; or trying to determine the function of a gene sequence but fuzzy matches are used in all of that; so quite a range of possibilities to guess about.
If the python functionality is all you need, for just a word or two see:
Thanks for the all the feedback so far.
My specific problem isn’t that hard, I don’t think. I have a long array of chemical species:
species = ["H2O", "CO2,aq", "CH4,AQ", ...]
I have a function which, takes in a single species, and looks for a match in the list with findloc
:
ind = findloc(species, "CO2", 1)
If the search fails (ind == 0
), then I throw an error saying that the input species isn’t in the list. But I want to be more helpful. I want the error message to contain the “closest matches”. So in the above example, I would want the error message to say ““CO2” is not in the list of species. Closest matches are “CO2,aq”, etc…”
My guess would be to normalize the strings (ignore case, remove initial spaces, then change other alphanumerics to a space so “CO2(aq)”, “CO 2 (aq)”, “co2-aq”, “CO2,aq” would all be “co2,aq” ) and then calculate matched words in order and give those a heavy weight, and then do the edit distance using the Fortran links in the first two links (caveat emptor – did not try them); things like lower, substitute, split, the links above and weighting and then maybe requiring the first word to match or no word having a high edit distance might be where to end up, but just normalizing the string and doing the edit distance on the input string and the species trimming the dictionary to no more than one word longer than the input either starting at the beginning or at end of the first non-matching character (simpler than it sounds) would be good; if the species are all as simple as in the example or you can tell people to follow a particular syntax you could go exact match, glob match (like match routine above or using glob, then edit distance would be nice and simple. (glob, lower, split, substitute examples are all in M_strings if that is helpful).
If you use fpm(1) the dependency string is
M_strings = { git = "https://github.com/urbanjost/M_strings.git" }
So if you already are doing the findloc and users have to follow a syntax like the compound is uppercase, comma delimited and lowercase after then you could use the wildcard/glob routine, and if no match give the three lowest edit distances; good start and you are already half there; then maybe the normalizing I mentioned, then play with weights between the measures of word match and edit distance. Probably getting too general, if your species are standardized …
Ignoring case is no use for chemists: CO is carbon monoxide, Co is cobalt.
Ha! I had a lapse there! Was just thinking strings.
Started a skeleton, mostly to show the pros and cons of edit distance
and to try the linked code, and how globbing might be what is wanted,
and that it would only take a few lines and then of course, things went
south. For what it is worth …
It shows nicely why using edit distance by itself can
get odd results if you do not deal with string length or weight by
matching characters (H2O is as good a match as CO2,aq!).
glob matches are : CO2,aq
best matches are H2O CO2,aq
fpm: Leaving directory '/home/urbanjs/venus/V600/github/chem'
program main
use chem, only: distance
use M_strings, only : glob
implicit none
character(len=20),parameter :: species(*) = [character(len=20) :: "H2O", "CO2,aq", "CH4,AQ" ]
character(len=20),allocatable :: matches(:)
character(len=:),allocatable :: mystring
integer :: ind,i,ismall
integer,allocatable :: distances(:)
mystring='CO2'
FOUNDIT : block
ind = findloc(species, "CO2", 1)
if(ind.ne.0)then
write(*,'(*(g0))')'found match at ',species(ind)
!exit FOUNDIT
endif
matches=pack(species,[(glob(species(i),mystring//'*'),i=1,size(species))])
if(size(matches).ne.0)then
write(*,'(*(g0:,": "))')'glob matches are ',(trim(matches(i)),i=1,size(matches))
!exit FOUNDIT
endif
distances=[(distance(mystring,species(i)),i=1,size(species))]
ismall=minval(distances)
! you would want to heavily weight matching characters
write(*,'(*(g0,1x))')'best matches are ',pack(species,distances.eq.ismall)
endblock FOUNDIT
end program main
module chem
implicit none
private
public :: distance
contains
!> The Levenshtein distance function returns how many edits (deletions,
!! insertions, transposition) are required to turn one string into another.
integer function distance (a,b)
integer :: len_a, len_b, i, j, cost
character(len=*), intent(in) :: a, b
! matrix for calculating levenshtein distance
integer, dimension(0:len_trim(a), 0:len_trim(b)) :: leven_mat
integer, dimension(3) :: three_vals
len_a = len_trim(a)
len_b = len_trim(b)
do i = 0,len_a
leven_mat(i,0) = i
enddo
do j = 0, len_b
leven_mat(0,j) = j
enddo
do i = 1, len_a
do j = 1, len_b
if (a(i:i) == b(j:j)) then
cost = 0
else
cost = 1
endif
three_vals(1) = leven_mat(i-1,j) + 1
three_vals(2) = leven_mat(i, j-1) + 1
three_vals(3) = leven_mat(i-1,j-1) + cost
leven_mat(i,j) = minval(three_vals)
enddo
enddo
distance = int(leven_mat(len_a,len_b))
end function distance
end module chem
If the purpose is to show a list of candidate items that contain the word “CO2” exactly, then a “minimalist” approach may be like this…?
program main
implicit none
character(:), allocatable :: species(:), mol
integer, allocatable :: locs(:)
integer :: i
mol = "CO2"
! mol = "H2O"
species = [ character(10) :: "H2O", "CO2,aq", "CH4,AQ", "[CO2,gas]", "H2O,g" ]
locs = pack( [(i, i=1,size(species))], index(species, mol) >= 1 )
print *, "mol = ", mol
print *, "species(:) = ", species(:)
print *, "exact-match loc = ", findloc( species, mol )
print *, "candidate list = ", species( locs )
print *, "candidate locs = ", locs
end
The output for mol = “CO2” is:
mol = CO2
species(:) = H2O CO2,aq CH4,AQ [CO2,gas] H2O,g
exact-match loc = 0
candidate list = CO2,aq [CO2,gas]
candidate locs = 2 4
and for mol = “H2O”:
mol = H2O
species(:) = H2O CO2,aq CH4,AQ [CO2,gas] H2O,g
exact-match loc = 1
candidate list = H2O H2O,g
candidate locs = 1 5
That may well be the case the OP wants to solve; the fancier things here allow the user to actually put in the globbing pattern themselves, so they could input ‘CO*aq*’ for anything
starting with “CO” followed by “aq”, and the edit distances are probably what the original question about a Fortran version of the python procedure uses, which allows for misspellings and other differences which are still interesting and useful variants, but that is a good combination just using intrinsics I use a lot. The other approach often used is allowing the user to use Regular Expressions; for others with similar problems.