Warning: command-line option ‘-finput-chartset=utf-8’ is valid for C/C++/ObjC/ObjC++ but not for Fortran.
How to fix it?
For Fortran if the compiler supports it you specify UTF-8 using standard code, no special compiler options are required in general.
program demo_selected_char_kind
use iso_fortran_env
implicit none
intrinsic date_and_time,selected_char_kind
! set some aliases for common character kinds
! as the numbers can vary from platform to platform
integer, parameter :: default = selected_char_kind ("default")
integer, parameter :: ascii = selected_char_kind ("ascii")
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
integer, parameter :: utf8 = selected_char_kind ('utf-8')
! assuming ASCII and UCS4 are supported (ie. not equal to -1)
! define some string variables
character(len=26, kind=ascii ) :: alphabet
character(len=30, kind=ucs4 ) :: hello_world
character(len=30, kind=ucs4 ) :: string
write(*,*)'ASCII ',&
& merge('Supported ','Not Supported',ascii /= -1)
write(*,*)'ISO_10646 ',&
& merge('Supported ','Not Supported',ucs4 /= -1)
write(*,*)'UTF-8 ',&
& merge('Supported ','Not Supported',utf8 /= -1)
if(default.eq.ascii)then
write(*,*)'ASCII is the default on this processor'
endif
! for constants the kind precedes the value, somewhat like a
! BOZ constant
alphabet = ascii_"abcdefghijklmnopqrstuvwxyz"
write (*,*) alphabet
hello_world = ucs4_'Hello World and Ni Hao -- ' &
// char (int (z'4F60'), ucs4) &
// char (int (z'597D'), ucs4)
! an encoding option is required on OPEN for non-default I/O
if(ucs4 /= -1 )then
open (output_unit, encoding='UTF-8')
write (*,*) trim (hello_world)
else
write (*,*) 'cannot use utf-8'
endif
call create_date_string(string)
write (*,*) trim (string)
contains
! The following produces a Japanese date stamp.
subroutine create_date_string(string)
intrinsic date_and_time,selected_char_kind
integer,parameter :: ucs4 = selected_char_kind("ISO_10646")
character(len=1,kind=ucs4),parameter :: &
nen = char(int( z'5e74' ),ucs4), & ! year
gatsu = char(int( z'6708' ),ucs4), & ! month
nichi = char(int( z'65e5' ),ucs4) ! day
character(len= *, kind= ucs4) string
integer values(8)
call date_and_time(values=values)
write(string,101) values(1),nen,values(2),gatsu,values(3),nichi
101 format(*(i0,a))
end subroutine create_date_string
end program demo_selected_char_kind
ASCII Supported
ISO_10646 Supported
UTF-8 Not Supported
ASCII is the default on this processor
abcdefghijklmnopqrstuvwxyz
Hello World and Ni Hao -- 你好
2025年7月25日
so for gfortran it looks like you want to use ISO_10646 or you will have to handle things at a lower level yourself with UTF-8, most likely using stream I/O. There have been previous discussions on this topic. See
for one
I found that a bit confusing. ‘UTF-8’ is not mentioned as a name of character kind in the Standard, as is the ‘ISO_10646’. On the other hand, ‘UTF-8’ is a standard option to encoding
in open
statement. What follows in the Standard, namely
12.5.6.9 ENCODING= specifier in the OPEN statement
The scalar-default-char-expr shall evaluate to UTF-8 or DEFAULT. The ENCODING= specifier is permitted only for a connection for formatted input/output. The value UTF-8 specifies that the encoding form of the file is UTF-8 as specified in ISO/IEC 10646. Such a file is called a Unicode file, and all characters therein are of ISO 10646 character kind. The value UTF-8 shall not be specified if the processor does not support the ISO 10646 character kind. […]
is even more confusing as it clearly states (unless I totally miss it) that a file written with encoding=’utf-8’
should contain ISO 10646 characters, that is all 32-bit entities. Which is obviously not what utf-8
stands for. Sure enough, the gfortran
implementation writes true utf-8 as expected by common sense but apparently not by the Standard.
For the skeptics, the definition of the above mentioned ‘ISO 10646 character’, from the same F2023 Draft:
3.93 ISO 10646 character
character whose representation method corresponds to UCS-4 in ISO/IEC 10646
All I know about utf-8 is that it is a variable-width character set. Can someone fill in some of the details about fortran support for all of these character sets so we can follow the discussion at least a little.
Some of the confusion has to do with the standards themselves, but first - the internal representation defined by the kind and the encoding of the output are two separate attributes.
The encoding of the file is specified as utf-8 but that does not in any way mean the data is stored internally as utf-8. The open statement says to translate the date to utf-8 on output and to translate it from utf-8 on input The KIND says how the resulting unicode data is to be stored internally by the running program.
UTF-8 is by far the most common unicode encoding supported for files, terminal emulators, and applications. It is more compact than UTF-32 so if you are going to output unicode characters it is almost a given the file encoding will be UTF-8.
But how you store the unicode character data internally might be done better with UCS -4 (aka. UTF-32) because every character is the same size (4 bytes). In Fortran all the intrinsics are of fixed size (so far) and a lot of optimizations are easier when data sizes are a fixed size); so most implementations of Fortran Unicode support the characters internally as UCS-4byte values.
Of course ASCII(all characters are 1 byte) and UTF-8(each character/glyph is from 1 to 4 bytes, but in many cases the smaller sizes are very common) could save a lot of memory, so the compiler might want to support UTF-8 internal encoding as well. But gfortran does not currently do that.
So you have a selection of how the unicode characters are stored internally, and a selection of how the characters are encoded when read and written.
One point of confusion is that iso 10646 defines multiple encodings including ASCII, ucs-4, and utf-8 as different ways to store data representing Unicode glyphs, but the name sometimes gets used when really more specifically just dealing with one of the encodings, which can lead to confusion.
UCS-4 is a good match for Fortran for internally representing unicode characters when considering efficiency except for memory use and is generally easier to implement because it is fixed size, but utf-8 does not care about the endian of the platform and usually takes far less bytes to store and transfer so that is almost always how unicode is stored in files.
Think of how numbers are internally represented as binary data, but you often read and write it as ASCII text. In the same way, unicode might be represented internally is some arbitrary manner, just input and output as UTF-8. But by the standard, the internal representation is defined to be either UTF-32 or UTF-8. Probably at some point UTF-16 will show up.
This is not what -finput-chartset=utf-8
is about.
This (gcc) compiler options sets the sources (the text in e.g. .f90
files) encoding and has nothing (well, directly) to do with string handling in Fortran code.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does
not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line
option. Currently the command-line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s “iconv”
library routine.
The OP never describes the underlying need for UTF-8. It would seem it is either to read and write unicode characters outside of the ASCII character set range, or to allow such glyphs to be used as symbol names. The Fortran character set which must be used to compose all code is a subset of ASCII, so if that is the desire it is clearly non-standard. Even if a compiler supported non-ASCII symbol names it would be highly non-portable. The other more likely use is to read and write arbitrary unicode characters. The technique that is on more solid ground is to use the UCS-4 internal representation if the target compiler is gfortran. You can generally get away with handling unicode as byte streams and handle it yourself if no unicode encoding is supported by the compiler, but it gets confusing if you put anything but ASCII in the source file so I would generally not recommend using unicode glyphs directly in string constants in the code either; although if you understand one “character” may actually be from one to four bytes as far as Fortran is concerned you can make that work. Stream I/O usually makes it a little bit more portable. So without more info one possible scenaro is not valid – no, do not try and use utf-8 input files containing non-ascii symbol names. The other possibility that it is desired to do full unicode I/O is discussed above; as it often confuses new users and because it is easy to get something to “work” in one environment that will not work in others if you do not follow the somewhat non-intuitive standard-supported usage.
Normally (in languages other than Fortran) that is mainly to be able to (correctly) read string literals. Something like
char *not_ascii = "ľščťžýáߥʮ™√🙂";
So the goal is to be able to type unicode characters into constants in
the Fortran source file, which is by definition a subset of ASCII (the Fortran character set). Since ASCII is a subset of the unicode set, basically you are saying you want the source file to be a UTF-8 file. But that would store the unicode as a stream of bytes. Most Fortran compiler default character sets all byte values to be used, not just ASCII. So you can very likely get away with that.
But the only built-in conversion from a bunch of bytes representing utf8 to ucs4,
which is what gfortran currently supports, is to actually read and write the data.
That works as demonstrated below but is a pretty costly way to go about it.
Might be able to do it with an internal read and write but I do not think so but did not try. Interesting that gfortran has a built-in conversion between ucs4 and utf8 in order to do READ and WRITE but as far as I know does not expose that publicly. Maybe using kind and char/ichar might work, never tried that. Interesting. If it ends up char/ichar/internal I/O cannot solve that, perhaps a new intrinsic is in order.
I might play with this. If anyone else figures out a better way to do ucs4-utf8 conversion please post it.
program demo_selected_char_kinducs
use iso_fortran_env
implicit none
intrinsic date_and_time,selected_char_kind
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
character(len=*),parameter:: not_ascii = "ľščťžýáߥʮ™√🙂"
character(len=:,kind=ucs4),allocatable :: corrected
integer :: i
write(*,*)'treated as ASCII'
write(*,*)'NOT_ASCII:',not_ascii
write(*,*)'LEN NOT_ASCII:',len(not_ascii)
write(*,*)'STORAGE_SIZE:',storage_size(not_ascii)
write(*,*)'properly converted to ucs4'
corrected=utf8_to_utf32(not_ascii)
write(*,*)'CORRECTED:',corrected
write(*,*)'LEN CORRECTED:',len(corrected)
write(*,*)'STORAGE_SIZE:',storage_size(corrected)
do i=1,len(corrected)
write(*,'("z''",z0,"''")')corrected(i:i)
enddo
contains
function utf8_to_utf32(string) result(corrected)
! limited to max of 255 unicode characters
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=(4*len(string)),kind=ucs4) :: line
integer :: lun
open(newunit=lun,encoding='UTF-8',status='scratch')
write(lun,'(A)')string
rewind(lun)
read(lun,'(A)')line
close(lun)
corrected=trim(line)
open (output_unit, encoding='UTF-8')
end function utf8_to_utf32
end program demo_selected_char_kind
output
treated as ASCII
NOT_ASCII:ľščťžýáߥʮ™√🙂
LEN NOT_ASCII: 32
STORAGE_SIZE: 256
properly converted to ucs4
CORRECTED:ľščťžýáߥʮ™√🙂
LEN CORRECTED: 14
STORAGE_SIZE: 448
z'13E'
z'161'
z'10D'
z'165'
z'17E'
z'FD'
z'E1'
z'DF'
z'104'
z'118'
z'AE'
z'2122'
z'221A'
z'1F642'
The real problem (when you need to set the compiler option) is when the file is encoded in e.g. ISO8859-2 (the substring "ľščťžýá"
is) but the current locale is e.g. set to UTF-8 (which is the same as the ISO8859-1 codepage, see ISO/IEC 8859-2 - Wikipedia). If the compiler chooses to decode the file and the string as UTF-8, it uses the wrong unicode scalar values, the one from ISO8859-1 instead of ISO8859-2. The result of this would be "µ¹è»¾ýá"
.
VS Studio (without Code) has been a source of “fun” by using UTF-16 encoded files Developer Community.
Well, avoiding mojibake was part of the reason for unicode, and even though UTF-16 looks like it is going to help carry on the tradition of making this harder than it could be I cannot think of a bullet-proof way to always guess the encoding correctly. Probably not going to get anyone to require a magic string at the beginning of each source file that something like the file(1) command uses so I think I am just going to try to sort out some of the UCS-4 UTF-8 coding and not worry about extended ASCII and box characters and mostly-Eastern versus mostly-Central-Western character sets and all the other encodings. But so far I do not think any Fortran compiler is support UTF-8 as an intrinsic type. For some probably good reasons internal representation seems to be UCS-4. So figuring out what an assign of a UTF-8 variable to a UCS-4 variable might do, or if a variable is a UTF-8 kind if reading from it with an internal write will do conversion as if it was a file with ENCODING set to a non-default value and such is probably not going to be something I can actually try anyway. But making a function to convert between UCS-4 and UTF-8 might actually have some usefulness and seems doable. So my take-away is to make myself one of those unless someone has one already. Since the story of the Tower of Babel is pretty old, I do not think I am going to solve all these issues this afternoon:>. Interesting though. Some issues I do not commonly come across myself, at least for the time-being.
So it all seems to work where slicing, and INDEX, LEN, and other intrinsics, and A format descriptors all work like you would intuitively expect, at least with gfortran.
Still questions, but once the data is UCS-4 it is as easy to work with as CHARACTER variables, which is quite nice. Not really sure how many other Fortran compilers are supporting “actual” unicode instead of working with byte streams of UTF-8 characters, but I really like being able to use the intrinsics dealing with character variables basically as easily as with ASCII.
I have learned a few things about working with unicode in Fortran making this example, but other than that “caveat emptor”, but for anyone interested ..
module utf8_decoder
implicit none
private
public :: utf8_to_codepoints
public :: utf8_to_ucs4
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
contains
subroutine utf8_to_codepoints(utf8, out, err)
character(len=*), intent(in) :: utf8
integer, allocatable, intent(out) :: out(:) ! Unicode code points
integer, intent(out) :: err
integer :: n_out
integer :: i, len8, b1, b2, b3, b4
integer :: cp, nbytes
integer :: temp(4*len(utf8)) ! big enough to hold all of utf8 if each single byte required four
err = 0
len8 = len_trim(utf8)
i = 1
n_out = 0
do while (i <= len8)
b1 = iachar(utf8(i:i))
if (b1 < 0) b1 = b1 + 256
select case (b1)
case (0:127)
cp = b1
nbytes = 1
case (192:223)
if (i+1 > len8) then; err = 1; return; endif
b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
if (iand(b2, 192) /= 128) then; err = 2; return; endif
cp = iand(b1, 31)
cp = ishft(cp,6) + iand(b2,63)
nbytes = 2
case (224:239)
if (i+2 > len8) then; err = 1; return; endif
b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
b3 = iachar(utf8(i+2:i+2)); if (b3 < 0) b3 = b3 + 256
if (iand(b2, 192) /= 128 .or. iand(b3, 192) /= 128) then; err = 2; return; endif
cp = iand(b1, 15)
cp = ishft(cp,6) + iand(b2,63)
cp = ishft(cp,6) + iand(b3,63)
nbytes = 3
case (240:247)
if (i+3 > len8) then; err = 1; return; endif
b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
b3 = iachar(utf8(i+2:i+2)); if (b3 < 0) b3 = b3 + 256
b4 = iachar(utf8(i+3:i+3)); if (b4 < 0) b4 = b4 + 256
if (iand(b2,192)/=128 .or. iand(b3,192)/=128 .or. iand(b4,192)/=128) then
err = 2; return
endif
cp = iand(b1, 7)
cp = ishft(cp,6) + iand(b2,63)
cp = ishft(cp,6) + iand(b3,63)
cp = ishft(cp,6) + iand(b4,63)
nbytes = 4
case default
err = 3
return
end select
if (n_out >= size(temp)) then
err = 4
return
endif
n_out = n_out + 1
temp(n_out) = cp
i = i + nbytes
enddo
allocate(out(n_out))
out = temp(1:n_out)
end subroutine utf8_to_codepoints
function utf8_to_ucs4_via_io(string) result(corrected)
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=(4*len(string)),kind=ucs4) :: line
integer :: lun
open(newunit=lun,encoding='UTF-8',status='scratch')
write(lun,'(A)')string
rewind(lun)
read(lun,'(A)')line
close(lun)
corrected=trim(line)
end function utf8_to_ucs4_via_io
function utf8_to_ucs4(string) result(corrected)
! return a string of kind ucs4
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
integer, allocatable :: codepoints(:)
integer :: i, n
integer :: err
call utf8_to_codepoints(string,codepoints,err)
n=size(codepoints)
allocate(character(len=n,kind=ucs4) :: corrected)
do i=1,n
corrected(i:i)=char(codepoints(i),kind=ucs4)
enddo
end function utf8_to_ucs4
end module
program test_utf8
use iso_fortran_env, only : output_unit
use utf8_decoder, only : utf8_to_ucs4, utf8_to_codepoints
implicit none
character(len=100) :: s
integer, allocatable :: codepoints(:)
integer :: i, err
intrinsic selected_char_kind
integer, parameter :: ucs4 = selected_char_kind ('ISO_10646')
character(len=*),parameter :: not_ascii = "ľščťžýáߥʮ™√🙂"
character(len=:,kind=ucs4),allocatable :: corrected
character(len=1,kind=ucs4) :: letter
s = "Héllo 🌍" ! UTF-8 input string
call utf8_to_codepoints(s, codepoints, err)
if (err /= 0) then
print *, "Decode error:", err
else
print *, "Code points:"
do i = 1, size(codepoints)
write(*,'("U+",Z8.8)') codepoints(i)
enddo
endif
open (output_unit, encoding='UTF-8')
call utf8_to_codepoints(not_ascii,codepoints,err)
write(*,*)'SIZE:',size(codepoints), merge('PASSED','FAILED',size(codepoints)==14)
write(*,'(*(a))')char(codepoints,kind=ucs4)
write(*,'("UNICODE NOTATION:",T20,*("U+",z6.6:," "))')codepoints
write(*,'("HTML:",T20,*("&#x",z0,";":))')codepoints
write(*,'("C/PYTHON:",T20,*("\U",z8.8:))')codepoints
write(*,*)'properly converted to ucs4'
corrected=utf8_to_ucs4(not_ascii)
write(*,*)'CORRECTED:',corrected
write(*,*)'LEN CORRECTED:',len(corrected)
write(*,*)'STORAGE_SIZE:',storage_size(corrected)
write(*,*)'so now slicing, intrinsics, ... work!!'
letter=corrected(14:14)
write(*,*)'14th character',letter
write(*,*)'index position',index(corrected,letter)
write(*,*)'variable= &'
write(*,'((" char(int(z''",z0,"''))",:,", &"))')(corrected(i:i),i=1,len(corrected))
end program
```
### Output
Once you have the data encoded properly as 4-byte characters (ie. as UCS-4) it is then easy to properly write the strings out as HTML, slice out particular glyphs,
get the index where a character is found without having to fiddle around with "raw" utf-8 glyphs that can be anywhere from 1 to 4 bytes per glyph. Still lots of questions about how you want to sort unicode glyphs, which direction to print the strings, and so on but that is just part of the nature of the many writing methods used around the world.
```text
Code points:
U+00000048
U+000000E9
U+0000006C
U+0000006C
U+0000006F
U+00000020
U+0001F30D
SIZE: 14 PASSED
ľščťžýáߥʮ™√🙂
UNICODE NOTATION: U+00013E U+000161 U+00010D U+000165 U+00017E U+0000FD U+0000E1 U+0000DF U+000104 U+000118 U+0000AE U+002122 U+00221A U+01F642
HTML: ľščťžýáߥʮ™√🙂
C/PYTHON: \U0000013E\U00000161\U0000010D\U00000165\U0000017E\U000000FD\U000000E1\U000000DF\U00000104\U00000118\U000000AE\U00002122\U0000221A\U0001F642
properly converted to ucs4
CORRECTED:ľščťžýáߥʮ™√🙂
LEN CORRECTED: 14
STORAGE_SIZE: 448
so now slicing, intrinsics, ... work!!
14th character🙂
index position 14
variable= &
char(int(z'13E')), &
char(int(z'161')), &
char(int(z'10D')), &
char(int(z'165')), &
char(int(z'17E')), &
char(int(z'FD')), &
char(int(z'E1')), &
char(int(z'DF')), &
char(int(z'104')), &
char(int(z'118')), &
char(int(z'AE')), &
char(int(z'2122')), &
char(int(z'221A')), &
char(int(z'1F642'))
```
That had been very common back in the day, e.g. Emacs has comments like
/* -*-mode: C; coding: latin-2; -*- */
You haven’t tested the utf8_to_usc4_via_io
routine in the test program, so I did it. It works fine, showing additional nice feature of encoding=’utf-8’
- it works both with default character and ucs-4
strings, apparently copying as-is the former and converting the latter. The modified function:
function utf8_to_ucs4_via_io(string) result(corrected)
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=len(string),kind=ucs4) :: line
integer :: lun
open(newunit=lun,file='utf8.out',encoding='UTF-8',status='replace')
write(lun,'(A)')string
rewind(lun)
read(lun,'(A)')line
close(lun)
corrected=trim(line)
open(newunit=lun,file='utf8r.out',encoding='UTF-8',status='replace')
write(lun,'(A)') corrected
close(lun)
end function utf8_to_ucs4_via_io
generates two identical, UTF-8 files. I have also removed quadrupling the length of string
in the declaration of line
as I presume the kind=ucs4
already makes a single character in line
4B long. I will later check how deep is the conformance of this I/O conversion with Unicode Standard. In particular, how it processes ill-formed UTF-8/UTF-32 sequences/codepoints.
Windows users may find some additional pieces of info here. The discussion was mostly about dealing with special characters (French éèêë, among others) on Windows and not relying on compiler extension like ucs4 for gfortran.
We used to require any file format, particularly binary files, that our production codes used to have a magic string at the beginning, and with a description of the file format statements or structure including endian and what platform it was generated on and word-sizes and mantissa, …
At a time where basically each platform had its own binary file format this was very useful. As the number of commonly encountered native binary formats consolidated, and utf-8 gradually being supported in text and HTML and the use of file suffixes becoming more standardized and creeping in from the MSWindows platforms that faded into the background but is still used in a few places.
It recently came in really handy when someone needed to resurrect some old information and was confronted with some old binary files they needed to read they could not make any sense of. Ended up they had the magic tag in them, and the old records indicated they were Cray COS binary files and had their structure defined. With a little bit of bit fiddling they were totally usable after all.
The other lesson we learned long ago was to highly encouraged archived data be converted to text whenever reasonable
But Fortran utf-8 source files are not standard-conforming, just close to it. The code itself of course has to be written in the Fortran character set. The comments work iout to be able to be utf-8 without much risk but using utf-8 constant strings is totally not defined but potentially more and more desirable as utf-8 is the de-facto standard text file type in more and more cases.
Thanks for the feedback. That routine is the closest to a built-in Fortran method for converting the pseudo-utf-8 ASCII strings into Fortran unicode I could think of. Keeping the unicode constants out of the source file completely and keeping the strings in a file is of course the truly standard-conforming method, but there are just too many times I want to have the strings in the source file to not look at how viable this might be. Ultimately it would be best to have utf-8 string constants defined in the standard, so playing around with it might also lead to some ideas on what solutions there might be.
I captured the code in a github project with CD/CI tests, and fpm and cmake builds in case anyone
is interested in the future.
It passes tests on gfortran and flang-new; but the output for ascii bytes representing utf-8 written to stdout differ. The gfortran results are more consistent and I think preferable, but if you use the routines in M_utf8 and the ucs-4 characters you get the expected characters. ifx, lfortran, and nvfortran to not appear to support ucs-4 yet. I would be interested in results from other compilers.
AMD flang does not support ucs-4.
the document indicates NAG does, with extensions to fully support utf-8 source files. Not sure about IBM XL and Cray, … but flang-new worked enough to pass the simple tests, but I only accessed it throuth the github CD/CI interface. I still have questions about what is standard and what is not concerning ucs4 constants. For example, as least ASCII seems to be automatically converted to 4-byte characters with something like
character(kind=ucs4,len=*),parameter :: str=ucs4_’abcde’
but utf-8 characters taking more than one byte do not work; and not sure what might happen with 1byte characters above 127 that are often used for latin1 and latin2 characters, and so on.
is there any kind of automatic “promotion” and are assignments between kinds allowed? It looks like some was allowed in 2003 but then retracted (?),
when the output file encoded=’utf-8’ one compiler still prints ascii and utf-8 characters stored in ascii bytes as well as ucs4 characters, while another only gets the output right if the variables written are ucs4; have not even tried stream I/O with utf8 files and I think it is an extension that some I/O statements can have character variables of different kinds on them, and so on.
So if you make all your variables ucs4 and read and wrte just to utf8-encoded files it is very easy and all the intrinsics work (had a few problems with some field descriptors like T though); and that is useful for a lot of problems. It gets muddier when the input source contains utf-8 characters and figuring out exactly how internal reads and writes should work and what an assignment should or should not do, and whether you can use the // operator when some values are ascii or default and others are ucs4 and so on, so it would be nice to try as many compilers as possible that support unicode.
Interestingly if you convert the strings to code integers as one of the example procedures does even on compilers like ifx that do not support unicode and then essentially work with the integers like C historically did with ASCII, you can get by pretty well, my at least for now sorting out whether there is a better way than “char(int(z’FFFF’),kind=ucs4)” on every character to reliably place unicode into source files, which works but is painfully verbose, and hard to read in the code.
unicode was easy to avoid in most Fortran-centric environments for quite a while, but it hardly seems feasible to not need to interface to utf-8 files and to have utf-8 messages produced by codes now. Does anyone know the timetables the compilers that do not support ucs4 are on to provide it? I mean, everything has emojis in it now