Using Unicode Characters in Fortran

drikosev · February 8, 2024, 4:20pm

Hello,

It compiles with older versions of GNU Fortran, ie with 7.4.0:

I don’t know why it now fails with newer versions of gfortran.

Regards,
Ev. Drikos

apoorv01 · February 8, 2024, 4:31pm

Hi @drikosev

This link worked in newer version

Regards
Apurva

drikosev · February 10, 2024, 1:11am

Thanks. Here is an updated version of the example, accepted also by gfortran-13:

program test
    use iso_fortran_env
    implicit none
    
    integer,parameter :: CK = selected_char_kind('ISO_10646')

    !see also https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
    !see also https://www.compart.com/en/unicode/U+006F
    character(kind=CK,len=7) :: s = CHAR(INT(Z'0061'), KIND=CK) //& !a
                                  & CHAR(INT(Z'0310'), KIND=CK) //& !◌̐ -> a̐
                                ! & CHAR(INT(Z'00E9'), KIND=CK) //& !é (precomposed)
                                  & CHAR(INT(Z'0065'), KIND=CK) //& !e (decomposed)
                                  & CHAR(INT(Z'0301'), KIND=CK) //& !◌́
                                ! & CHAR(INT(Z'00F6'), KIND=CK) //& !ö  (precomposed)
                                  & CHAR(INT(Z'006F'), KIND=CK) //& !o (decomposed)
                                  & CHAR(INT(Z'0308'), KIND=CK) //& !Combining Diaeresis
                                  & CHAR(INT(Z'0332'), KIND=CK)     !Combining Low Line
    
    open(output_unit,encoding='utf-8')
    
    write(output_unit,*) s
    write(output_unit,*) 'len(s) = ', len(s)
    write(output_unit,*) 's(1:1) = ', s(1:1)
    
end program test

drikosev · February 10, 2024, 2:07am

Hello,

Just for the record, GNU Fortran supports the command line option ‘-fallow-invalid-boz’ (default) and past versions, ie 7, as already said above, were intentionally accepting BOZ literals in places that shouldn’t. I’m not aware when support for the particular extension was dropped.

Regards,
Ev. Drikos

apoorv01 · February 10, 2024, 10:25am

@drikosev Thanks for your help. This is working in gfortran-13

rnhmjoj · May 11, 2024, 7:02pm

There is a major problem this line of thinking though: if you have to align your output you must really know how much space the characters are going to use, not how many bytes they take. This is made worse by the fact that even the a[w] edit descriptor interprets the width w as a byte size and is thus unusable with unicode text. See this program:

program test
  character(len=*), parameter :: ascii   = "abab"
  character(len=*), parameter :: unicode = "αβαβ"
  print '("|",a10,"|")', ascii
  print '("|",a10,"|")', unicode
end program

which outputs:

|      abab|
|  αβαβ|

sblionel · May 12, 2024, 12:30am

That’s not what the standard says. “If a field width w is specified with the A edit descriptor, the field consists of w characters.” If an implementation supports multi-byte characters, and the kind type parameter of the value being transmitted is one for multi-byte, then it is number of characters, not number of bytes. The LEN intrinsic also talks about characters, not bytes.

In your example, you have not specified a Unicode kind type parameter, nor opened the output unit with the ENCODING keyword. The compiler has no way of knowing you mean Unicode here. Both of your named constants are of default character kind.

rnhmjoj · May 12, 2024, 7:45am

If an implementation supports multi-byte characters, and the kind type parameter of the value being transmitted is one for multi-byte, then it is number of characters

Ok, that’s good to know. I did try to specify the character kind but it didn’t seem to have any effect:

program test
  use iso_fortran_env, only : output_unit

  integer, parameter :: ascii   = selected_char_kind('ASCII')
  integer, parameter :: unicode = selected_char_kind('ISO_10646')

  character(len=*), parameter :: x   = ascii_'abab'
  character(len=*), parameter :: y   = unicode_'αβαβ'
  character(len=*), parameter :: fmt = '("|",a10,"|")'

  open(output_unit, encoding='utf-8')
  write(output_unit, fmt) x
  write(output_unit, fmt) y
end program

still produces the same outputs. I guess the problem may be the compile (gfortran, here) doesn’t understand the unicode in the source code.

RonShepard · May 12, 2024, 5:01pm

What exactly is a fortran compiler supposed to do with the expression len(c) where c is a multibyte character string? Is it supposed to scan through the characters, adding up 1+3+2+1+… and so on until the end of the string?

sblionel · May 12, 2024, 5:08pm

Yes, I would expect LEN to return the number of characters, and yes, that means scanning. I have not played with this in an implementation that supports Unicode. There are some aspects I’m unsure about and plan to ask.

RonShepard · May 12, 2024, 5:24pm

That also means that any kind of substring addressing, c(i:j), would require scanning through to find the beginning of the ith character and the end of the jth character.

Normally, a fortran programmer would expect len(c) and c(i:j) substring addressing to be cheap operations, just references to the underlying character metadata and some trivial offset calculations. With multibyte characters, these rather all become expensive operations, with previous O(0) operations becoming O(N) operations, and previous O(N) operations becoming O(N**2) operations.

I can see why this has been a challenge to implement.

urbanjost · May 12, 2024, 5:43pm

I have so far not gotten UTF4 to work unless I use arrays of single-character. That also partly resolves the potential scanning overhead.
It also seems to be more portable to treat everything as arrays of integers accept for input and output. The 13+ version of GFortran is getting close, but only a simple ‘A’ format works as far as I know; as when I try to use something like A10 I get blank output. In the following a lot works as expected with LEN and an A descriptor, but as soon as I add a length it goes sideways. So running this with gfortran 13.1.0:

program test
   use iso_fortran_env
   integer, parameter :: ascii = selected_char_kind('ascii')
   integer, parameter :: ucs4 = selected_char_kind('ISO_10646')
   character(len=*, kind=ascii), parameter :: ascii8_string = ascii_"abab"
! ALPHA: UTF-16 Encoding: 0x03B1 UTF-32 Encoding: 0x000003B1 Uppercase Character: Α (U+0391)
   character(kind=ucs4, len=1), parameter :: alpha = char(int(z'03B1'), ucs4)
   character(kind=ucs4, len=1), parameter :: beta = char(int(z'03B2'), ucs4)
   character(len=*, kind=ucs4), parameter :: unicode_string = alpha//beta//alpha//beta
   character(kind=ucs4, len=1)           :: uspace = ucs4_' '
   character(len=*, kind=ucs4), parameter :: raw = 'αβαβ'

   open (output_unit, encoding='UTF-8')
   print '("|",a10,"|")', ascii8_string
   print '("|",a,"|")', ascii8_string
   write (*, *) len(ascii8_string)

   print '("|",a10,"|")', unicode_string
   print '("|",a,"|")', unicode_string
   print '("|",a,"|")', repeat(uspace,6)//alpha//beta//alpha//beta
   print '("|",2a,"|")', repeat(' ',6),alpha//beta//alpha//beta ! use A for default and iso_10646
   write (*, *) len(unicode_string)

   print '("|",a10,"|")', raw
   print '("|",a,"|")', raw
   write (*, *) len(raw)

   open (output_unit, encoding='default')
   print '("|",a10,"|")', raw
   print '("|",a,"|")', raw
   write (*, *) len(raw)
end program

|      abab|
|abab|
           4
|      |
|αβαβ|
|      αβαβ|
|      αβαβ|
           4
|  Î²Î±Î²|
|Î±Î²Î±Î²|
           8
|  αβαβ|
|αβαβ|
           8
```text
This was intended to see some of the bugs I have encountered; several of which appear to now be fixed.  It works quite well if you use variables of LEN=1 with gfortran; which some people like to do with ASCII in Fortran anyway, as it is more C-like which some are more familiar with.

RonShepard · May 12, 2024, 5:56pm

If you have an array of such multibyte characters, then each element of that array would have a different byte length, right? I don’t think that is consistent with fortran arrays. Fortran also requires aliases of character arrays and character strings to be interoperable, so I can see that would also be very difficult to satisfy with multibyte charcters.

PierU · May 12, 2024, 9:26pm

In USC4/UTF32 all characters are coded on 4 bytes.

urbanjost · May 12, 2024, 10:27pm

Yep. Probably a good reason gfortran picked it instead of Unicode.

Still looking but have not found a match in the GNU bugzilla reports but the output does not meet my expectations:

program test
   use iso_fortran_env
   integer, parameter :: ascii = selected_char_kind('ascii')
   integer, parameter :: ucs4 = selected_char_kind('ISO_10646')

   character(len=*, kind=ascii), parameter :: ascii8_lower = ascii_"abcdefghijklmnopqrstuvwxyz"
   character(len=*, kind=ascii), parameter :: ascii8_upper = ascii_"ABCDEFGHIJKLMNOPQRSTUVWXYZ"

   character(len=*, kind=ucs4), parameter :: ucs4_lower = ucs4_"abcdefghijklmnopqrstuvwxyz"
   character(len=*, kind=ucs4), parameter :: ucs4_upper = ucs4_"ABCDEFGHIJKLMNOPQRSTUVWXYZ"

! ALPHA: UTF-16 Encoding: 0x03B1 UTF-32 Encoding: 0x000003B1 Uppercase Character: Α (U+0391)
   character(kind=ucs4, len=1), parameter :: alpha = char(int(z'03B1'), ucs4)
   character(kind=ucs4, len=1), parameter :: beta  = char(int(z'03B2'), ucs4)
   character(kind=ucs4, len=1), parameter :: space = ucs4_' '

   open (output_unit, encoding='UTF-8')

   print '("|",a,"|",i0,1x,i0)', ascii8_lower, len(ascii8_lower), storage_size(ascii8_lower)
   print '("|",a,"|",i0,1x,i0)', ucs4_lower,   len(ucs4_lower), storage_size(ucs4_lower)

   print '("|",a,"|")', repeat(space,6)//alpha//beta//alpha//beta

   print '("|",a1,"|")', alpha//beta//alpha//beta 
   print '("|",a2,"|")', alpha//beta//alpha//beta 
   print '("|",a3,"|")', alpha//beta//alpha//beta 
   print '("|",a4,"|")', alpha//beta//alpha//beta 
   print '("|",a5,"|")', alpha//beta//alpha//beta 
   print '("|",a6,"|")', alpha//beta//alpha//beta 
   print '("|",a7,"|")', alpha//beta//alpha//beta 
   print '("|",a8,"|")', alpha//beta//alpha//beta 
   print '("|",a9,"|")', alpha//beta//alpha//beta 
   print '("|",a10,"|")', alpha//beta//alpha//beta 
   print '("|",a11,"|")', alpha//beta//alpha//beta 
   print '("|",a1,"|")', 'abcd'
   print '("|",a2,"|")', 'abcd'
   print '("|",a3,"|")', 'abcd'
   print '("|",a4,"|")', 'abcd'
   print '("|",a5,"|")', 'abcd'
   print '("|",a6,"|")', 'abcd'
   print '("|",a7,"|")', 'abcd'
   print '("|",a8,"|")', 'abcd'
   print '("|",a9,"|")', 'abcd'
   print '("|",a10,"|")', 'abcd'
   print '("|",a11,"|")', 'abcd'

end program

|abcdefghijklmnopqrstuvwxyz|26 208
|abcdefghijklmnopqrstuvwxyz|26 832
|      αβαβ|
|αβαβ|
|αβαβ|
|αβαβ|
|αβαβ|
| βαβ|
|  αβ|
|   β|
|    |
|     |
|      |
|       |
|a|
|ab|
|abc|
|abcd|
| abcd|
|  abcd|
|   abcd|
|    abcd|
|     abcd|
|      abcd|
|       abcd|

implicitall · January 20, 2025, 2:54pm

I think we need to reconsider many of the posts in this discussion with a more realistic view that selected_char_kind(‘ISO_10646’) is completely unimplemented and the behavior is simply that of selected_char_kind('ASCII’).

Topic		Replies	Views
Using Box drawing characters	7	1625	February 2, 2021
Handling files in directories whose names are not entirely ASCII Help	8	1073	November 23, 2022
Could someone please correct this code? Help	12	466	February 6, 2024
ASCII only identifiers?	3	324	April 22, 2024
How do I file-read French special characters like é etc? Help	46	2349	January 22, 2024

Using Unicode Characters in Fortran

Related topics