Extended ASCII characters and `len`

Inspired by https://fortran-lang.discourse.group/t/how-to-use-utf-8-in-gfortran, I looked at handling of strings that contain extended ASCII letters.

The Fortran standard mentions the 7 bit standard ASCII (ISO/IEC 646 - Wikipedia) and that is also the default in Gfortran (SELECTED_CHAR_KIND (The GNU Fortran Compiler)). For that reason, symbols of the extended ASCII code (in the range 128 to 255) are stored differently and their len is 2 or 3. This is a little bit annoying, especially since the default ASCII character type has 8 bit = 1 byte (at least on Linux), i.e. it could store the 256 symbols (https://theasciicode.com.ar) easily if extended ASCII would be the default.

Does anyone has an idea how to implement a len_fixed that gives the number of characters on screen instead of the size in bytes (or “storage units” in case that a single character uses more bytes, i.e. in the case of UCS-4)? Additionally, it would be nice to have a sensible approach for iterating over a string that contains extended ASCII characters.

Below is the code I’ve used for testing. Gfortran and ifx show the same behavior for ASCII but ifx rejects the code related to UCS-4.

program test_ASCII
  use, intrinsic :: ISO_C_binding

  implicit none
  integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
  integer, parameter :: ascii = selected_char_kind ('ASCII')
  character(kind=ascii) :: c

  integer(C_SIGNED_CHAR), dimension(storage_size('~')/8) :: bits_126
  integer(C_SIGNED_CHAR), dimension(storage_size('Ç')/8) :: bits_128
  integer(C_SIGNED_CHAR), dimension(storage_size('â– ')/8) :: bits_254


  print('(a,i0,a,i0)'), ' "~":  ASCII=126, size/bit=', storage_size('~'), ', len=', len('~')!, C_sizeof(bits_126)
  bits_126 = transfer('~',1_C_SIGNED_CHAR)
  print*, bits_126
  call print_bits(bits_126)

  print('(a,i0,a,i0)'), ' "Ç":  ASCII=128, size/bit=', storage_size('Ç'), ', len=', len('Ç')!, C_sizeof(bits_128)
  bits_128 = transfer('Ç',1_C_SIGNED_CHAR,size=2)
  print*, bits_128
  call print_bits(bits_128)

  print('(a,i0,a,i0)'), ' "â– ":  ASCII=254, size/bit=', storage_size('â– '), ', len=', len('â– ')!, C_sizeof(bits_254)
  bits_254 = transfer('â– ',1_C_SIGNED_CHAR,size=3)
  print*, bits_254
  call print_bits(bits_254)

  c = transfer(126_C_SIGNED_CHAR,'a') ! values > 127 or < -127 require "-fno-range-check" on Gfortran and lead to strange results
  print*, c
  c = transfer(-127_C_SIGNED_CHAR,'a')
  print*, c

#ifdef __GFORTRAN__
  print*, ''
  print*, 'UCS4'
  print*, '~', storage_size(ucs4_'~'), len(ucs4_'~')
  print*, 'Ç', storage_size(ucs4_'Ç'), len(ucs4_'Ç')
  print*, 'â– ', storage_size(ucs4_'â– '), len(ucs4_'â– ')
#endif

 contains


 subroutine print_bits(i)

   integer(C_SIGNED_CHAR), dimension(:), intent(in) :: i
   integer :: s,l

   do l = 1, size(i)
     do s = bit_size(i)-1, 0, -1
       write(*,'(i1)',advance='no') ibits(i(l:l),s,1)
     end do
     write(*,'(a)',advance='no') ' '
   end do
   print*
   print*

 end subroutine print_bits

end program test_ASCII

edit: added output below and fixed mistake (was “byte” instead of “bit”)

 "~":  ASCII=126, size/bit=8, len=1
  126
01111110 

 "Ç":  ASCII=128, size/bit=16, len=2
  -61 -121
11000011 10000111 

 "â– ":  ASCII=254, size/bit=24, len=3
  -30 -106  -96
11100010 10010110 10100000 

 ~
 ďż˝
 
 UCS4
 ~          32           1
 Ç          64           2
 â–           96           3

edit 2: fixed bit-wise representation

2 Likes

Fully supports UTF-8 files and strings without ISO-10646 support with at least flang_new, gfortran, and ifx. Includes overloads of all Fortran intrinsics, additional functions for case, Unicode codepoint-order sorting (not necessarily dictionary sorting), ragged string arrays and a procedural and OOP interface. Still a few issues with the DT and overloading // with ifx so for concatenation I would use “.cat.” and not “//” and avoid using the DT format. Includes a cursory unit test and demo programs of all the functions. The demo for LEN is

   program demo_len
   use m_unicode, only : assignment(=), ut=>unicode_type, len
   use m_unicode, only : write(formatted)
   implicit none
   type(ut)             :: string
   type(ut),allocatable :: many_strings(:)
   integer                        :: ii
    ! BASIC USAGE
      string='Noho me ka hau’oli' ! (Be happy.)
      ii=len(string)
      write(*,'(DT,*(g0))')string, ' LEN=', ii

      string=' How long is this allocatable string? '
      write(*,'(DT,*(g0))')string, ' LEN=', len(string)

    ! STRINGS IN AN ARRAY MAY BE OF DIFFERENT LENGTHS
      many_strings = [ ut('Tom'), ut('Dick'), ut('Harry') ]
      write(*,'(*(g0,1x))')'length of elements of array=',len(many_strings)

      write(*,'(*(g0))')'length from type parameter inquiry=',string%len()

    ! LOOK AT HOW A PASSED STRING CAN BE USED ...
      call passed(ut(' how long? '))

   contains

      subroutine passed(str)
      type(ut),intent(in) :: str
         ! you can query the length of the passed variable
         ! when an interface is present
         write(*,'(*(g0))')'length of passed value is ', len(str)
      end subroutine passed

   end program demo_len
Noho me ka hau’oli LEN=18
 How long is this allocatable string?  LEN=38
length of elements of array= 3 4 5
length from type parameter inquiry=38
length of passed value is 11
4 Likes

I think it is rather the issue of what is the character coding of your source file. If you use UTF-8, as it is common nowadays, you get what you described, because your source code indeed contains 2 or 3 bytes inside the quotes. If, however, you convert your source to extended ASCII coding (your example seems to presume CP437), you will get 1-byte sizes.

$ iconv -f utf8 -t cp437 ext_ascii.F90 > ext_ascii_lat1.F90
$ gfortran ext_ascii_lat1.F90 && ./a.out
"~":  ASCII=126, size/bit=8, len=1
  126
01111110 

 "ďż˝":  ASCII=128, size/bit=8, len=1
 -128
10000000 

 "ďż˝":  ASCII=254, size/bit=8, len=1
   -2
11111110 

 ~
 ďż˝
 
 UCS4
 ~          32           1
 ďż˝          32           1
 ďż˝          32           1

remarks:

  • the code requires edit, to delete hardcoded size=2/3 optional param to transfer function
  • the output, on a UTF-8 coded terminal/HTML page does not show the extended characters well

This is not my experience with Gfortran. I find that the standard character set is an 8-bit set, which is equivalent to 1 byte. Between different compilers there can be some variation with the 8th bit, but this can be easily overcome with character parameters.
This 1-byte character type can be clearly identified when using stream access files.
The concept of 7-bit character set has not appeared with my use of default kind characters in Gfortran.

It would be very surprising for a modern Linux platform to not default to UTF-8
encoding. To see your current locale settings enter

locale

and you will probably see everything set to UTF-8

LANG=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=

and if your source file is UTF8 you will see multiple bytes used to store non-ASCII7 characters. I would strongly suggest using UTF-8 as the default. To
list what else is available use

locale -a

The most portable way if your compiler supports ISO-10646 is to write the code
all in ASCII-7 using codepoints

program use_ucs4
use, intrinsic :: iso_fortran_env, only : stdout=>OUTPUT_UNIT
implicit none
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
character(len=*,kind=ucs4),parameter :: &
 & u1=char(126,kind=ucs4), &
 & u2=char(199,kind=ucs4),&
 & u3=char(9632,kind=ucs4)

   open(stdout,encoding='utf-8')

   call udescribe_me(u1)
   call udescribe_me(u2)
   call udescribe_me(u3)
contains
subroutine udescribe_me(var)
character(len=*,kind=ucs4),intent(in) :: var
integer :: i
   write(*,'("GLYPH:",*(a))') var 
   write(*,'("CODEPOINT:",*(i0,1x))')(ichar(var(i:i)),i=1,len(var))
   write(*,'("STORAGE SIZE:",i0)')storage_size(var)
   write(*,'("LEN:",i0)')len(var)
   write(*,'("ICHAR:",i0)')ichar(var)
end subroutine udescribe_me

end program use_ucs4

All the intrinsics will work with ucs4 characters and you should
see something like the following where LEN=1 for all the characters
and they all take 4 bytes of storage and ICHAR returns a Unicode codepoint.

GLYPH:~
CODEPOINT:126
STORAGE SIZE:32
LEN:1
ICHAR:126
GLYPH:Ç
CODEPOINT:199
STORAGE SIZE:32
LEN:1
ICHAR:199
GLYPH:â– 
CODEPOINT:9632
STORAGE SIZE:32
LEN:1
ICHAR:9632

Using ICHAR() and CHAR() and not using non-ASCII7 characters in the strings
is the more portable method when ISO-10646 is not supported as well. If you
enter the characters in the file directly the encoding being used in the
file and by the compiler becomes key.

If you enter the following program

program use_raw
use, intrinsic :: ISO_C_binding
implicit none
integer, parameter :: ascii = selected_char_kind ('ASCII')
character(len=*,kind=ascii),parameter :: v1="~", v2="Ç", v3="■"
   call describe_me(v1)
   call describe_me(v2)
   call describe_me(v3)
contains
subroutine describe_me(var)
character(len=*,kind=ascii),intent(in) :: var
integer :: i
   write(*,'("GLYPH:",*(a))') var 
   write(*,'("ADE:",*(i0,1x))')(ichar(var(i:i)),i=1,len(var))
   write(*,'("STORAGE SIZE:",i0)')storage_size(var)
   write(*,'("LEN:",i0)')len(var)
end subroutine describe_me

end program use_raw

And the file command says the file is UTF-8

file raw.f90
raw.f90: Unicode text, UTF-8 text

you will see the glyphs require various amounts of
bytes

GLYPH:~
ADE:126
STORAGE SIZE:8
LEN:1
GLYPH:Ç
ADE:195 135
STORAGE SIZE:16
LEN:2
GLYPH:â– 
ADE:226 150 160
STORAGE SIZE:24
LEN:3

If the file actually was in CP437 encoding as described
earlier

iconv -f utf8 -t cp437 raw.f90 > cp437.f90

then you are using (one of several) alternate extended
ASCII encodings. If your system locale matches cp437 it
will display correctly. But if your locale is UTF-8 the
LEN values will all be one and the storage will all be
one byte but the output will probably not appear
correctly.

GLYPH:~
ADE:126
STORAGE SIZE:8
LEN:1
GLYPH:€
ADE:128
STORAGE SIZE:8
LEN:1
GLYPH:Ăľ
ADE:254
STORAGE SIZE:8
LEN:1

but the iconv(1) command will let it appear correctly:

file cp437.f90
cp437.f90: Non-ISO extended-ASCII text
./cp437|iconv -t utf8 -f cp437 

All OS that I know of are using UTF-8 as the default or
are moving to do so. If not the default what to do to
set it to the default and to find an appropriate font and
terminal emulator that supports UTF-8 is system-dependent;
but it is usually easy to find documentation on how to
set up.

In an xterm(1) or uxterm(1) window you usually use ctrl-Mouse3 (often the right mouse button and the control key held simultaneously) to get a menu to appear and try using TruType fonts with utf8 enabled for typically better results. In newer versions you may have a menu bar visible and you can click on the VT-fonts menu instead of using the ctrl-Mouse* method.

If the “locale -a” command listed C.CP437 or En-US.CP437
that means you have that locale available, in which case
in the bash(1) shell you can enter

env LC_ALL= LC_CTYPE=En-US.CP437 ./cp437

More likely you might be using one of the Latin extended
ASCII variants which also represent all characters with
one byte and the same rules apply, but I suggest to use
UTF-8 encoding in the source files if using anything other
than ASCII-7 characters in the source. Extended ASCII is
deprecated in lieu of Unicode as far as I am concerned.

The problem is there are many extended ASCII sets which makes their use
very non-portable and all can be supplanted with
Unicode and UTF-8. Most programming languages use only ASCII-7 outside
of comments and constant strings or use UTF-8 because there
was never a single extended ASCII standard, for example.

The biggest advantage of UTF-8 encoding versus other Unicode encodings is
that ASCII-7 is a subset of it. The biggest disadvantage is that Fortran
traditionally supports intrinsic types with homogeneous sizes in arrays
so it is much more intuitive to use encodings like UCS4 internally,
but most OS and HTML and many utilities are evolving to using UTF-8 for
data and terminal output.

But the initial discussion does conflate UTF-8 multi-byte encoding and
extended ASCII to some extent. They are distinct encodings. Hopefully the
examples help clarify that.

thanks!
The file is indeed stored as UTF-8 and Gfortran and ifx can handle that using the default character type at the cost of reporting the length not in “characters on screen”.

I’ve checked it with

and everything makes sense to me now.

you’re correct, 8bit are used for storage. But any values exceeding 7bit are not handled correctly when interpreting them as extended ASCII. So “for that reason, symbols of the extended ASCII code (in the range 128 to 255) are stored differently and their len is 2 or 3.” means essentially that they are stored as UTF-8.

Yep. Extended sets served a valuable purpose for a long time, but I personally think it is time for them to join ASCII 6/12 and DISPLAY code and EBCDIC for that matter and be retired or phased out and replaced with UTF-8. Maybe some other Unicode encoding will supplant UTF-8 but none of the others are compatible with ASCII7 and several have specific endian-ness so I think it is very unlikely for years for the trend to be anything other than to standardize on UTF-8.

A useful extension of several compilers is to support escape sequences, typically starting with \u, \U, and \x that are then followed by hexadecimal codepoint values. Most only support values between 0 and 255 inclusive if the string is not UCS4 though. The ESCAPE() procedure in the M_unicode module (see manual.txt ) converts plain ASCII CHARACTER variables using the same syntax to UTF-8 encoding, which works to embed a few multi-byte files into otherwise ASCII-7 text just using ASCII-7 characters easily. UCS4 support is not present in all compilers and does not have routines to convert between UCS4 and UTF-8 other than to read and write data from files so even if using UCS4 the companion module M_utf8 has some useful procedures in it for reading UTF-8 from command line options or environment variables or to convert UCS-4 to filenames for use on OPEN statements and such as well. There are also routines for converting LATIN1 and LATIN4 encodings to UTF-8 in a third module.

I am toying with adding basic regular expressions and a few other procedures to M_unicode for completeness, but it is pretty much complete. I almost regret how easily it lets you use UTF-8 source files, as there are a lot of advantages to ASCII-7 being the lingua franca for Fortran; but it works everywhere I currently run with, with at least three compilers. But I seem to be in the minority as the github repo got three stars :pensive_face:. But I have a much better understanding of UTF-8 than I used to :slight_smile: !