Character validation functions

Hello everyone,

A month ago I created a set of routines for validating character functions, similar to the functions available in the <ctype.h> header of the C standard library.

I tested different Fortran implementations, which can all be found at my GitHub repository: https://github.com/ivan-pi/fortran-ascii

The different approaches I found are

  1. A direct approach relying on comparison operators (currently available in stdlib):
pure logical function is_alphanum(c)
  character(len=1), intent(in) :: c
  is_alphanum = (c >= '0' .and. c <= '9') .or. (c >= 'a' .and. c <= 'z') &
      .or. (c >= 'A' .and. c <= 'Z')
end function
  1. Select case statements:
pure logical function is_alphanum(c)
  character(len=1), intent(in) :: c
  select case(iachar(c))
    case (48:57,65:90,97:122) ! A .. Z, 0 .. 9, a .. z
      is_alphanum = .true.
    case default
      is_alphanum = .false.
  end select
end function
  1. Lookup table (this was the most fun to program)
pure logical function is_alpha(c)
  character(len=1), intent(in) :: c
  is_alpha = btest(table(iachar(c,i8)),2)
end function
  1. Interfacing to the C standard library (turns out to be slowest)
pure logical function is_alphanum(c)
  character(len=1), intent(in) :: c
  is_alphanum = isalnum(iachar(c,c_int)) /= 0
end function

There turned out to be measurable differences between the various approaches:

I only tested this with the gfortran compiler. For some of the character validation routines Fortran was able to match C++ or atleast reach 80 % of it’s speed. Since this is essentially a micro-benchmarking problem, there is some uncertainty in the results. I’ve also posted these benchmark results in an issue open at the stdlib repository.

If you have any suggestions or ideas on how to make the Fortran timings more consistent, how to interpret the differences, or how to improve the accuracy of the measurements, please let me know.

5 Likes

Thanks, Ivan, this is a nice comparison. I don’t have any further insight than what you already wrote.

However, I’m wondering whether there is additional penalty to this variant:

pure logical function is_alphanum(c)
  character(len=1), intent(in) :: c
  is_alphanum = is_alpha(c) .or. is_digit(c)
end function

I think that I suggested doing this in the original PR for readability, but it’s worth knowing if it incurs any added penalty.

1 Like

Naive question: is the first solution more standard ? (we just suppose there is a logical ordering of characters, whatever code is used)

After all, Fortran was born before ASCII. But my knowledge about the coding of characters in Fortran is quite null. Does the Fortran standards say something about ASCII ?

I did some further testing, and the compiler flags -mtune=native or -march=native might change the timings a little bit. With the Intel PS Compilers, Fortran actually is a little bit faster than C++. But since I am not a computer scientist, I cannot say if my benchmarking methods are 100% correct. Concerning your second question, I will come back to you after doing some measurements.

@vmagnin concerning the ordering of characters, the standard does say that conforming processor provides a character collating sequence which is required to satisfy the following conditions (M&R, 2018):

  • A is less than B is less than C … is less than Y is less than Z;
  • a is less than b is less than c … is less than y is less than z;
  • 0 is less than 1 is less than 2 … is less than 8 is less than 9;
  • blank is less than A and Z is less than 0, or blank is less than 0 and 9 is less than A;
  • blank is less than a and z is less than 0, or blank is less than 0 and 9 is less than a.

Thus, we see that there is no rule about whether the numerals precede or succeed the letters, nor about position of any of the special characters or the underscore, apart from the rule that blank precedes both partial sequences.

Concerning ASCII I will borrow the reply posted by William Clodius in one of my issues on Github:

The DEFAULT character kind is guaranteed to contain all the characters of the Fortran character set, which is all the printable characters of ASCII. It says nothing about the control codes or the order of printable characters in the character set. The order dependence for the printable characters can be consistently worked around by using ACHAR and IACHAR. In practice, the default character set is a mapping to the system’s internal character set which is UTF-8 on Linux, UTF-16 on Windows, and Mac Roman(?) on the Macintosh. All map to ASCII for code points 0:127. The Chinese and Japanese computers tend to use national character septs that map to ASCII for 0:127. I don’t know if the code set is well defined for Berkely Unix, but the ones I know use the Latin character sets which also map to ASCII for code points 0:127. I don’t know what they use in India, but I would be very surprised if their character sets didn’t also map to ASCII for 0:127. The only computers I know of that don’t map to ASCII in code points 0:127, are those using EBCDIC(?) mostly IBM mainframes. The EBCDIC actually comprise a variety of character sets with the specific active one context dependent. The XL Fortran compiler, https://www.ibm.com/support/knowledgecenter/SS2MB5_14.1.0/com.ibm.xlf141.bg.doc/language_ref/asciit.html, appears to use an EBCDIC character set with equivalents to all the ASCII control characters.

1 Like

It admits it exists, and there are intrinsics that convert between integer and character according to the ASCII collating sequence (ACHAR and IACHAR), but there is no requirement that the default character kind be ASCII.

The first solution would fail if the default character kind were EBCDIC, which has gaps in the alphabetic collating sequence.

1 Like

Ouch, I did not realize EBCDIC has gaps between the letters of the alphabet. I guess this is sufficient reason to abandon the first solution completely, and rely on a portable solution with the iachar intrinsic instead.

The way the lookup table works is I use iachar(c,i8) to get an eight-bit integer. The lookup table is declared as integer(i16) :: table(-128,127) and the bits in the elements 0:127 encode various character properties.

1 Like