Hello,
It compiles with older versions of GNU Fortran, ie with 7.4.0:
I don’t know why it now fails with newer versions of gfortran.
Regards,
Ev. Drikos
Hello,
It compiles with older versions of GNU Fortran, ie with 7.4.0:
I don’t know why it now fails with newer versions of gfortran.
Regards,
Ev. Drikos
Thanks. Here is an updated version of the example, accepted also by gfortran-13:
program test
use iso_fortran_env
implicit none
integer,parameter :: CK = selected_char_kind('ISO_10646')
!see also https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
!see also https://www.compart.com/en/unicode/U+006F
character(kind=CK,len=7) :: s = CHAR(INT(Z'0061'), KIND=CK) //& !a
& CHAR(INT(Z'0310'), KIND=CK) //& !◌̐ -> a̐
! & CHAR(INT(Z'00E9'), KIND=CK) //& !é (precomposed)
& CHAR(INT(Z'0065'), KIND=CK) //& !e (decomposed)
& CHAR(INT(Z'0301'), KIND=CK) //& !◌́
! & CHAR(INT(Z'00F6'), KIND=CK) //& !ö (precomposed)
& CHAR(INT(Z'006F'), KIND=CK) //& !o (decomposed)
& CHAR(INT(Z'0308'), KIND=CK) //& !Combining Diaeresis
& CHAR(INT(Z'0332'), KIND=CK) !Combining Low Line
open(output_unit,encoding='utf-8')
write(output_unit,*) s
write(output_unit,*) 'len(s) = ', len(s)
write(output_unit,*) 's(1:1) = ', s(1:1)
end program test
Hello,
Just for the record, GNU Fortran supports the command line option ‘-fallow-invalid-boz’ (default) and past versions, ie 7, as already said above, were intentionally accepting BOZ literals in places that shouldn’t. I’m not aware when support for the particular extension was dropped.
Regards,
Ev. Drikos
@drikosev Thanks for your help. This is working in gfortran-13
There is a major problem this line of thinking though: if you have to align your output you must really know how much space the characters are going to use, not how many bytes they take. This is made worse by the fact that even the a[w]
edit descriptor interprets the width w as a byte size and is thus unusable with unicode text. See this program:
program test
character(len=*), parameter :: ascii = "abab"
character(len=*), parameter :: unicode = "αβαβ"
print '("|",a10,"|")', ascii
print '("|",a10,"|")', unicode
end program
which outputs:
| abab|
| αβαβ|
That’s not what the standard says. “If a field width w is specified with the A edit descriptor, the field consists of w characters.” If an implementation supports multi-byte characters, and the kind type parameter of the value being transmitted is one for multi-byte, then it is number of characters, not number of bytes. The LEN
intrinsic also talks about characters, not bytes.
In your example, you have not specified a Unicode kind type parameter, nor opened the output unit with the ENCODING keyword. The compiler has no way of knowing you mean Unicode here. Both of your named constants are of default character kind.
If an implementation supports multi-byte characters, and the kind type parameter of the value being transmitted is one for multi-byte, then it is number of characters
Ok, that’s good to know. I did try to specify the character kind but it didn’t seem to have any effect:
program test
use iso_fortran_env, only : output_unit
integer, parameter :: ascii = selected_char_kind('ASCII')
integer, parameter :: unicode = selected_char_kind('ISO_10646')
character(len=*), parameter :: x = ascii_'abab'
character(len=*), parameter :: y = unicode_'αβαβ'
character(len=*), parameter :: fmt = '("|",a10,"|")'
open(output_unit, encoding='utf-8')
write(output_unit, fmt) x
write(output_unit, fmt) y
end program
still produces the same outputs. I guess the problem may be the compile (gfortran, here) doesn’t understand the unicode in the source code.
What exactly is a fortran compiler supposed to do with the expression len(c)
where c
is a multibyte character string? Is it supposed to scan through the characters, adding up 1+3+2+1+… and so on until the end of the string?
Yes, I would expect LEN to return the number of characters, and yes, that means scanning. I have not played with this in an implementation that supports Unicode. There are some aspects I’m unsure about and plan to ask.
That also means that any kind of substring addressing, c(i:j)
, would require scanning through to find the beginning of the ith
character and the end of the jth
character.
Normally, a fortran programmer would expect len(c)
and c(i:j)
substring addressing to be cheap operations, just references to the underlying character metadata and some trivial offset calculations. With multibyte characters, these rather all become expensive operations, with previous O(0) operations becoming O(N) operations, and previous O(N) operations becoming O(N**2) operations.
I can see why this has been a challenge to implement.
I have so far not gotten UTF4 to work unless I use arrays of single-character. That also partly resolves the potential scanning overhead.
It also seems to be more portable to treat everything as arrays of integers accept for input and output. The 13+ version of GFortran is getting close, but only a simple ‘A’ format works as far as I know; as when I try to use something like A10 I get blank output. In the following a lot works as expected with LEN and an A descriptor, but as soon as I add a length it goes sideways. So running this with gfortran 13.1.0:
program test
use iso_fortran_env
integer, parameter :: ascii = selected_char_kind('ascii')
integer, parameter :: ucs4 = selected_char_kind('ISO_10646')
character(len=*, kind=ascii), parameter :: ascii8_string = ascii_"abab"
! ALPHA: UTF-16 Encoding: 0x03B1 UTF-32 Encoding: 0x000003B1 Uppercase Character: Α (U+0391)
character(kind=ucs4, len=1), parameter :: alpha = char(int(z'03B1'), ucs4)
character(kind=ucs4, len=1), parameter :: beta = char(int(z'03B2'), ucs4)
character(len=*, kind=ucs4), parameter :: unicode_string = alpha//beta//alpha//beta
character(kind=ucs4, len=1) :: uspace = ucs4_' '
character(len=*, kind=ucs4), parameter :: raw = 'αβαβ'
open (output_unit, encoding='UTF-8')
print '("|",a10,"|")', ascii8_string
print '("|",a,"|")', ascii8_string
write (*, *) len(ascii8_string)
print '("|",a10,"|")', unicode_string
print '("|",a,"|")', unicode_string
print '("|",a,"|")', repeat(uspace,6)//alpha//beta//alpha//beta
print '("|",2a,"|")', repeat(' ',6),alpha//beta//alpha//beta ! use A for default and iso_10646
write (*, *) len(unicode_string)
print '("|",a10,"|")', raw
print '("|",a,"|")', raw
write (*, *) len(raw)
open (output_unit, encoding='default')
print '("|",a10,"|")', raw
print '("|",a,"|")', raw
write (*, *) len(raw)
end program
| abab|
|abab|
4
| |
|αβαβ|
| αβαβ|
| αβαβ|
4
| βαβ|
|αβαβ|
8
| αβαβ|
|αβαβ|
8
```text
This was intended to see some of the bugs I have encountered; several of which appear to now be fixed. It works quite well if you use variables of LEN=1 with gfortran; which some people like to do with ASCII in Fortran anyway, as it is more C-like which some are more familiar with.
If you have an array of such multibyte characters, then each element of that array would have a different byte length, right? I don’t think that is consistent with fortran arrays. Fortran also requires aliases of character arrays and character strings to be interoperable, so I can see that would also be very difficult to satisfy with multibyte charcters.
In USC4/UTF32 all characters are coded on 4 bytes.
Yep. Probably a good reason gfortran picked it instead of Unicode.
Still looking but have not found a match in the GNU bugzilla reports but the output does not meet my expectations:
program test
use iso_fortran_env
integer, parameter :: ascii = selected_char_kind('ascii')
integer, parameter :: ucs4 = selected_char_kind('ISO_10646')
character(len=*, kind=ascii), parameter :: ascii8_lower = ascii_"abcdefghijklmnopqrstuvwxyz"
character(len=*, kind=ascii), parameter :: ascii8_upper = ascii_"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
character(len=*, kind=ucs4), parameter :: ucs4_lower = ucs4_"abcdefghijklmnopqrstuvwxyz"
character(len=*, kind=ucs4), parameter :: ucs4_upper = ucs4_"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
! ALPHA: UTF-16 Encoding: 0x03B1 UTF-32 Encoding: 0x000003B1 Uppercase Character: Α (U+0391)
character(kind=ucs4, len=1), parameter :: alpha = char(int(z'03B1'), ucs4)
character(kind=ucs4, len=1), parameter :: beta = char(int(z'03B2'), ucs4)
character(kind=ucs4, len=1), parameter :: space = ucs4_' '
open (output_unit, encoding='UTF-8')
print '("|",a,"|",i0,1x,i0)', ascii8_lower, len(ascii8_lower), storage_size(ascii8_lower)
print '("|",a,"|",i0,1x,i0)', ucs4_lower, len(ucs4_lower), storage_size(ucs4_lower)
print '("|",a,"|")', repeat(space,6)//alpha//beta//alpha//beta
print '("|",a1,"|")', alpha//beta//alpha//beta
print '("|",a2,"|")', alpha//beta//alpha//beta
print '("|",a3,"|")', alpha//beta//alpha//beta
print '("|",a4,"|")', alpha//beta//alpha//beta
print '("|",a5,"|")', alpha//beta//alpha//beta
print '("|",a6,"|")', alpha//beta//alpha//beta
print '("|",a7,"|")', alpha//beta//alpha//beta
print '("|",a8,"|")', alpha//beta//alpha//beta
print '("|",a9,"|")', alpha//beta//alpha//beta
print '("|",a10,"|")', alpha//beta//alpha//beta
print '("|",a11,"|")', alpha//beta//alpha//beta
print '("|",a1,"|")', 'abcd'
print '("|",a2,"|")', 'abcd'
print '("|",a3,"|")', 'abcd'
print '("|",a4,"|")', 'abcd'
print '("|",a5,"|")', 'abcd'
print '("|",a6,"|")', 'abcd'
print '("|",a7,"|")', 'abcd'
print '("|",a8,"|")', 'abcd'
print '("|",a9,"|")', 'abcd'
print '("|",a10,"|")', 'abcd'
print '("|",a11,"|")', 'abcd'
end program
|abcdefghijklmnopqrstuvwxyz|26 208
|abcdefghijklmnopqrstuvwxyz|26 832
| αβαβ|
|αβαβ|
|αβαβ|
|αβαβ|
|αβαβ|
| βαβ|
| αβ|
| β|
| |
| |
| |
| |
|a|
|ab|
|abc|
|abcd|
| abcd|
| abcd|
| abcd|
| abcd|
| abcd|
| abcd|
| abcd|
I think we need to reconsider many of the posts in this discussion with a more realistic view that selected_char_kind(‘ISO_10646’) is completely unimplemented and the behavior is simply that of selected_char_kind('ASCII’).