How to use utf-8 in gfortran?

weixing1531 · July 25, 2025, 2:40am

Warning: command-line option ‘-finput-chartset=utf-8’ is valid for C/C++/ObjC/ObjC++ but not for Fortran.
How to fix it?

urbanjost · July 25, 2025, 4:02am

For Fortran if the compiler supports it you specify UTF-8 using standard code, no special compiler options are required in general.

program demo_selected_char_kind
use iso_fortran_env
implicit none

intrinsic date_and_time,selected_char_kind

! set some aliases for common character kinds
! as the numbers can vary from platform to platform

integer, parameter :: default = selected_char_kind ("default")
integer, parameter :: ascii =   selected_char_kind ("ascii")
integer, parameter :: ucs4  =   selected_char_kind ('ISO_10646')
integer, parameter :: utf8  =   selected_char_kind ('utf-8')

! assuming ASCII and UCS4 are supported (ie. not equal to -1)
! define some string variables
character(len=26, kind=ascii ) :: alphabet
character(len=30, kind=ucs4  ) :: hello_world
character(len=30, kind=ucs4  ) :: string

   write(*,*)'ASCII     ',&
    & merge('Supported    ','Not Supported',ascii /= -1)
   write(*,*)'ISO_10646 ',&
    & merge('Supported    ','Not Supported',ucs4 /= -1)
   write(*,*)'UTF-8     ',&
    & merge('Supported    ','Not Supported',utf8 /= -1)

   if(default.eq.ascii)then
       write(*,*)'ASCII is the default on this processor'
   endif

   ! for constants the kind precedes the value, somewhat like a
   ! BOZ constant
   alphabet = ascii_"abcdefghijklmnopqrstuvwxyz"
   write (*,*) alphabet

   hello_world = ucs4_'Hello World and Ni Hao -- ' &
                 // char (int (z'4F60'), ucs4)     &
                 // char (int (z'597D'), ucs4)

   ! an encoding option is required on OPEN for non-default I/O
   if(ucs4 /= -1 )then
      open (output_unit, encoding='UTF-8')
      write (*,*) trim (hello_world)
   else
      write (*,*) 'cannot use utf-8'
   endif

   call create_date_string(string)
   write (*,*) trim (string)

contains

! The following produces a Japanese date stamp.

subroutine create_date_string(string)
intrinsic date_and_time,selected_char_kind
integer,parameter :: ucs4 = selected_char_kind("ISO_10646")
character(len=1,kind=ucs4),parameter :: &
   nen =   char(int( z'5e74' ),ucs4), & ! year
   gatsu = char(int( z'6708' ),ucs4), & ! month
   nichi = char(int( z'65e5' ),ucs4)    ! day
character(len= *, kind= ucs4) string
integer values(8)
    call date_and_time(values=values)
    write(string,101) values(1),nen,values(2),gatsu,values(3),nichi
101 format(*(i0,a))
end subroutine create_date_string

end program demo_selected_char_kind

 ASCII     Supported
 ISO_10646 Supported
 UTF-8     Not Supported
 ASCII is the default on this processor
 abcdefghijklmnopqrstuvwxyz
 Hello World and Ni Hao -- 你好
 2025年7月25日

so for gfortran it looks like you want to use ISO_10646 or you will have to handle things at a lower level yourself with UTF-8, most likely using stream I/O. There have been previous discussions on this topic. See

unicode tutorial

for one

msz59 · July 31, 2025, 9:42pm

I found that a bit confusing. ‘UTF-8’ is not mentioned as a name of character kind in the Standard, as is the ‘ISO_10646’. On the other hand, ‘UTF-8’ is a standard option to encodingin openstatement. What follows in the Standard, namely

12.5.6.9 ENCODING= specifier in the OPEN statement
The scalar-default-char-expr shall evaluate to UTF-8 or DEFAULT. The ENCODING= specifier is permitted only for a connection for formatted input/output. The value UTF-8 specifies that the encoding form of the file is UTF-8 as specified in ISO/IEC 10646. Such a file is called a Unicode file, and all characters therein are of ISO 10646 character kind. The value UTF-8 shall not be specified if the processor does not support the ISO 10646 character kind. […]

is even more confusing as it clearly states (unless I totally miss it) that a file written with encoding=’utf-8’should contain ISO 10646 characters, that is all 32-bit entities. Which is obviously not what utf-8 stands for. Sure enough, the gfortranimplementation writes true utf-8 as expected by common sense but apparently not by the Standard.

For the skeptics, the definition of the above mentioned ‘ISO 10646 character’, from the same F2023 Draft:

3.93 ISO 10646 character
character whose representation method corresponds to UCS-4 in ISO/IEC 10646

RonShepard · July 31, 2025, 10:07pm

All I know about utf-8 is that it is a variable-width character set. Can someone fill in some of the details about fortran support for all of these character sets so we can follow the discussion at least a little.

urbanjost · August 1, 2025, 12:57am

Some of the confusion has to do with the standards themselves, but first - the internal representation defined by the kind and the encoding of the output are two separate attributes.

The encoding of the file is specified as utf-8 but that does not in any way mean the data is stored internally as utf-8. The open statement says to translate the date to utf-8 on output and to translate it from utf-8 on input The KIND says how the resulting unicode data is to be stored internally by the running program.

UTF-8 is by far the most common unicode encoding supported for files, terminal emulators, and applications. It is more compact than UTF-32 so if you are going to output unicode characters it is almost a given the file encoding will be UTF-8.

But how you store the unicode character data internally might be done better with UCS -4 (aka. UTF-32) because every character is the same size (4 bytes). In Fortran all the intrinsics are of fixed size (so far) and a lot of optimizations are easier when data sizes are a fixed size); so most implementations of Fortran Unicode support the characters internally as UCS-4byte values.

Of course ASCII(all characters are 1 byte) and UTF-8(each character/glyph is from 1 to 4 bytes, but in many cases the smaller sizes are very common) could save a lot of memory, so the compiler might want to support UTF-8 internal encoding as well. But gfortran does not currently do that.

So you have a selection of how the unicode characters are stored internally, and a selection of how the characters are encoded when read and written.

One point of confusion is that iso 10646 defines multiple encodings including ASCII, ucs-4, and utf-8 as different ways to store data representing Unicode glyphs, but the name sometimes gets used when really more specifically just dealing with one of the encodings, which can lead to confusion.

UCS-4 is a good match for Fortran for internally representing unicode characters when considering efficiency except for memory use and is generally easier to implement because it is fixed size, but utf-8 does not care about the endian of the platform and usually takes far less bytes to store and transfer so that is almost always how unicode is stored in files.

Think of how numbers are internally represented as binary data, but you often read and write it as ASCII text. In the same way, unicode might be represented internally is some arbitrary manner, just input and output as UTF-8. But by the standard, the internal representation is defined to be either UTF-32 or UTF-8. Probably at some point UTF-16 will show up.

Release-Candidate · August 1, 2025, 11:44am

This is not what -finput-chartset=utf-8 is about.
This (gcc) compiler options sets the sources (the text in e.g. .f90 files) encoding and has nothing (well, directly) to do with string handling in Fortran code.

-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does
not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line
option. Currently the command-line option takes precedence if there’s a conflict. charset can be any encoding supported by the system’s “iconv”
library routine.

urbanjost · August 1, 2025, 4:16pm

The OP never describes the underlying need for UTF-8. It would seem it is either to read and write unicode characters outside of the ASCII character set range, or to allow such glyphs to be used as symbol names. The Fortran character set which must be used to compose all code is a subset of ASCII, so if that is the desire it is clearly non-standard. Even if a compiler supported non-ASCII symbol names it would be highly non-portable. The other more likely use is to read and write arbitrary unicode characters. The technique that is on more solid ground is to use the UCS-4 internal representation if the target compiler is gfortran. You can generally get away with handling unicode as byte streams and handle it yourself if no unicode encoding is supported by the compiler, but it gets confusing if you put anything but ASCII in the source file so I would generally not recommend using unicode glyphs directly in string constants in the code either; although if you understand one “character” may actually be from one to four bytes as far as Fortran is concerned you can make that work. Stream I/O usually makes it a little bit more portable. So without more info one possible scenaro is not valid – no, do not try and use utf-8 input files containing non-ascii symbol names. The other possibility that it is desired to do full unicode I/O is discussed above; as it often confuses new users and because it is easy to get something to “work” in one environment that will not work in others if you do not follow the somewhat non-intuitive standard-supported usage.

Release-Candidate · August 1, 2025, 4:30pm

Normally (in languages other than Fortran) that is mainly to be able to (correctly) read string literals. Something like

char *not_ascii = "ľščťžýáßĄĘ®™√🙂";

urbanjost · August 1, 2025, 6:00pm

So the goal is to be able to type unicode characters into constants in
the Fortran source file, which is by definition a subset of ASCII (the Fortran character set). Since ASCII is a subset of the unicode set, basically you are saying you want the source file to be a UTF-8 file. But that would store the unicode as a stream of bytes. Most Fortran compiler default character sets all byte values to be used, not just ASCII. So you can very likely get away with that.
But the only built-in conversion from a bunch of bytes representing utf8 to ucs4,
which is what gfortran currently supports, is to actually read and write the data.
That works as demonstrated below but is a pretty costly way to go about it.

Might be able to do it with an internal read and write but I do not think so but did not try.  Interesting that gfortran has a built-in conversion between ucs4 and utf8 in order to do READ and WRITE but as far as I know does not expose that publicly. Maybe using kind and char/ichar might work, never tried that. Interesting. If it ends up char/ichar/internal I/O cannot solve that, perhaps a new intrinsic is in order. 

I might play with this. If anyone else figures out a better way to do ucs4-utf8 conversion please post it.



program demo_selected_char_kinducs
use iso_fortran_env
implicit none
intrinsic date_and_time,selected_char_kind
integer, parameter :: ucs4  =   selected_char_kind ('ISO_10646')
character(len=*),parameter:: not_ascii = "ľščťžýáßĄĘ®™√🙂"
character(len=:,kind=ucs4),allocatable :: corrected
integer :: i

   write(*,*)'treated as ASCII'
   write(*,*)'NOT_ASCII:',not_ascii
   write(*,*)'LEN NOT_ASCII:',len(not_ascii)
   write(*,*)'STORAGE_SIZE:',storage_size(not_ascii)

   write(*,*)'properly converted to ucs4'
   corrected=utf8_to_utf32(not_ascii)
   write(*,*)'CORRECTED:',corrected
   write(*,*)'LEN CORRECTED:',len(corrected)
   write(*,*)'STORAGE_SIZE:',storage_size(corrected)

   do i=1,len(corrected)
      write(*,'("z''",z0,"''")')corrected(i:i)
   enddo

contains

function utf8_to_utf32(string) result(corrected)
! limited to max of 255 unicode characters
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=(4*len(string)),kind=ucs4) :: line
integer :: lun
   open(newunit=lun,encoding='UTF-8',status='scratch')
   write(lun,'(A)')string
   rewind(lun)
   read(lun,'(A)')line
   close(lun)
   corrected=trim(line)
   open (output_unit, encoding='UTF-8')
end function utf8_to_utf32

end program demo_selected_char_kind

output

 treated as ASCII
 NOT_ASCII:ľščťžýáßĄĘ®™√🙂
 LEN NOT_ASCII:          32
 STORAGE_SIZE:         256
 properly converted to ucs4
 CORRECTED:ľščťžýáßĄĘ®™√🙂
 LEN CORRECTED:          14
 STORAGE_SIZE:         448
z'13E'
z'161'
z'10D'
z'165'
z'17E'
z'FD'
z'E1'
z'DF'
z'104'
z'118'
z'AE'
z'2122'
z'221A'
z'1F642'

Release-Candidate · August 1, 2025, 6:34pm

The real problem (when you need to set the compiler option) is when the file is encoded in e.g. ISO8859-2 (the substring "ľščťžýá" is) but the current locale is e.g. set to UTF-8 (which is the same as the ISO8859-1 codepage, see ISO/IEC 8859-2 - Wikipedia). If the compiler chooses to decode the file and the string as UTF-8, it uses the wrong unicode scalar values, the one from ISO8859-1 instead of ISO8859-2. The result of this would be "µ¹è»¾ýá".

VS Studio (without Code) has been a source of “fun” by using UTF-16 encoded files Developer Community.

urbanjost · August 1, 2025, 8:11pm

Well, avoiding mojibake was part of the reason for unicode, and even though UTF-16 looks like it is going to help carry on the tradition of making this harder than it could be I cannot think of a bullet-proof way to always guess the encoding correctly. Probably not going to get anyone to require a magic string at the beginning of each source file that something like the file(1) command uses so I think I am just going to try to sort out some of the UCS-4 UTF-8 coding and not worry about extended ASCII and box characters and mostly-Eastern versus mostly-Central-Western character sets and all the other encodings. But so far I do not think any Fortran compiler is support UTF-8 as an intrinsic type. For some probably good reasons internal representation seems to be UCS-4. So figuring out what an assign of a UTF-8 variable to a UCS-4 variable might do, or if a variable is a UTF-8 kind if reading from it with an internal write will do conversion as if it was a file with ENCODING set to a non-default value and such is probably not going to be something I can actually try anyway. But making a function to convert between UCS-4 and UTF-8 might actually have some usefulness and seems doable. So my take-away is to make myself one of those unless someone has one already. Since the story of the Tower of Babel is pretty old, I do not think I am going to solve all these issues this afternoon:>. Interesting though. Some issues I do not commonly come across myself, at least for the time-being.

urbanjost · August 2, 2025, 12:14am

So it all seems to work where slicing, and INDEX, LEN, and other intrinsics, and A format descriptors all work like you would intuitively expect, at least with gfortran.

Still questions, but once the data is UCS-4 it is as easy to work with as CHARACTER variables, which is quite nice. Not really sure how many other Fortran compilers are supporting “actual” unicode instead of working with byte streams of UTF-8 characters, but I really like being able to use the intrinsics dealing with character variables basically as easily as with ASCII.

I have learned a few things about working with unicode in Fortran making this example, but other than that “caveat emptor”, but for anyone interested ..

module utf8_decoder
implicit none

private
public :: utf8_to_codepoints
public :: utf8_to_ucs4

integer, parameter                :: ucs4  =   selected_char_kind ('ISO_10646')

contains

subroutine utf8_to_codepoints(utf8, out, err)
character(len=*), intent(in)      :: utf8
integer, allocatable, intent(out) :: out(:) ! Unicode code points
integer, intent(out)              :: err

integer                           :: n_out
integer                           :: i, len8, b1, b2, b3, b4
integer                           :: cp, nbytes
integer                           :: temp(4*len(utf8)) ! big enough to hold all of utf8 if each single byte required four

   err = 0
   len8 = len_trim(utf8)
   i = 1
   n_out = 0

   do while (i <= len8)
      b1 = iachar(utf8(i:i))
      if (b1 < 0) b1 = b1 + 256

      select case (b1)
      case (0:127)
         cp = b1
         nbytes = 1
      case (192:223)
         if (i+1 > len8) then; err = 1; return; endif
         b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
         if (iand(b2, 192) /= 128) then; err = 2; return; endif
         cp = iand(b1, 31)
         cp = ishft(cp,6) + iand(b2,63)
         nbytes = 2
      case (224:239)
         if (i+2 > len8) then; err = 1; return; endif
         b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
         b3 = iachar(utf8(i+2:i+2)); if (b3 < 0) b3 = b3 + 256
         if (iand(b2, 192) /= 128 .or. iand(b3, 192) /= 128) then; err = 2; return; endif
         cp = iand(b1, 15)
         cp = ishft(cp,6) + iand(b2,63)
         cp = ishft(cp,6) + iand(b3,63)
         nbytes = 3
      case (240:247)
         if (i+3 > len8) then; err = 1; return; endif
         b2 = iachar(utf8(i+1:i+1)); if (b2 < 0) b2 = b2 + 256
         b3 = iachar(utf8(i+2:i+2)); if (b3 < 0) b3 = b3 + 256
         b4 = iachar(utf8(i+3:i+3)); if (b4 < 0) b4 = b4 + 256
         if (iand(b2,192)/=128 .or. iand(b3,192)/=128 .or. iand(b4,192)/=128) then
           err = 2; return
         endif
         cp = iand(b1, 7)
         cp = ishft(cp,6) + iand(b2,63)
         cp = ishft(cp,6) + iand(b3,63)
         cp = ishft(cp,6) + iand(b4,63)
         nbytes = 4
      case default
         err = 3
         return
      end select

      if (n_out >= size(temp)) then
         err = 4
         return
      endif
      n_out = n_out + 1
      temp(n_out) = cp
      i = i + nbytes
   enddo

   allocate(out(n_out))
   out = temp(1:n_out)

end subroutine utf8_to_codepoints

function utf8_to_ucs4_via_io(string) result(corrected)
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=(4*len(string)),kind=ucs4) :: line
integer :: lun
   open(newunit=lun,encoding='UTF-8',status='scratch')
   write(lun,'(A)')string
   rewind(lun)
   read(lun,'(A)')line
   close(lun)
   corrected=trim(line)
end function utf8_to_ucs4_via_io

function utf8_to_ucs4(string) result(corrected)
! return a string of kind ucs4
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable   :: corrected
integer, allocatable                     :: codepoints(:)
integer                                  :: i, n
integer                                  :: err
   call utf8_to_codepoints(string,codepoints,err)
   n=size(codepoints)
   allocate(character(len=n,kind=ucs4) :: corrected)
   do i=1,n
      corrected(i:i)=char(codepoints(i),kind=ucs4)
   enddo
end function utf8_to_ucs4

end module

program test_utf8
use iso_fortran_env, only : output_unit
use utf8_decoder, only : utf8_to_ucs4, utf8_to_codepoints
implicit none
character(len=100)                     :: s
integer, allocatable                   :: codepoints(:)
integer                                :: i, err
intrinsic selected_char_kind
integer, parameter                     :: ucs4  =   selected_char_kind ('ISO_10646')
character(len=*),parameter             :: not_ascii = "ľščťžýáßĄĘ®™√🙂"
character(len=:,kind=ucs4),allocatable :: corrected
character(len=1,kind=ucs4)             :: letter   

   s = "Héllo 🌍"  ! UTF-8 input string

   call utf8_to_codepoints(s, codepoints, err)

   if (err /= 0) then
     print *, "Decode error:", err
   else
     print *, "Code points:"
     do i = 1, size(codepoints)
       write(*,'("U+",Z8.8)') codepoints(i)
     enddo
   endif

   open (output_unit, encoding='UTF-8')
   call utf8_to_codepoints(not_ascii,codepoints,err)
   write(*,*)'SIZE:',size(codepoints), merge('PASSED','FAILED',size(codepoints)==14)
   write(*,'(*(a))')char(codepoints,kind=ucs4)
   write(*,'("UNICODE NOTATION:",T20,*("U+",z6.6:," "))')codepoints
   write(*,'("HTML:",T20,*("&#x",z0,";":))')codepoints
   write(*,'("C/PYTHON:",T20,*("\U",z8.8:))')codepoints

   write(*,*)'properly converted to ucs4'
   corrected=utf8_to_ucs4(not_ascii)
   write(*,*)'CORRECTED:',corrected
   write(*,*)'LEN CORRECTED:',len(corrected)
   write(*,*)'STORAGE_SIZE:',storage_size(corrected)
   write(*,*)'so now slicing, intrinsics, ... work!!'
   letter=corrected(14:14)
   write(*,*)'14th character',letter
   write(*,*)'index position',index(corrected,letter)

   write(*,*)'variable= &'
   write(*,'(("  char(int(z''",z0,"''))",:,", &"))')(corrected(i:i),i=1,len(corrected))

end program
```
### Output
Once you have the data encoded properly as 4-byte characters (ie. as UCS-4) it is then easy to properly write the strings out as HTML, slice out particular glyphs,
get the index where a character is found without having to fiddle around with "raw" utf-8 glyphs that can be anywhere from 1 to 4 bytes per glyph.  Still lots of questions about how you want to sort unicode glyphs, which direction to print the strings, and so on but that is just part of the nature of the many writing methods used around the world. 

```text
 Code points:
U+00000048
U+000000E9
U+0000006C
U+0000006C
U+0000006F
U+00000020
U+0001F30D
 SIZE:          14 PASSED
ľščťžýáßĄĘ®™√🙂
UNICODE NOTATION:  U+00013E U+000161 U+00010D U+000165 U+00017E U+0000FD U+0000E1 U+0000DF U+000104 U+000118 U+0000AE U+002122 U+00221A U+01F642
HTML:              &#x13E;&#x161;&#x10D;&#x165;&#x17E;&#xFD;&#xE1;&#xDF;&#x104;&#x118;&#xAE;&#x2122;&#x221A;&#x1F642;
C/PYTHON:          \U0000013E\U00000161\U0000010D\U00000165\U0000017E\U000000FD\U000000E1\U000000DF\U00000104\U00000118\U000000AE\U00002122\U0000221A\U0001F642
 properly converted to ucs4
 CORRECTED:ľščťžýáßĄĘ®™√🙂
 LEN CORRECTED:          14
 STORAGE_SIZE:         448
 so now slicing, intrinsics, ... work!!
 14th character🙂
 index position          14
 variable= &
  char(int(z'13E')), &
  char(int(z'161')), &
  char(int(z'10D')), &
  char(int(z'165')), &
  char(int(z'17E')), &
  char(int(z'FD')), &
  char(int(z'E1')), &
  char(int(z'DF')), &
  char(int(z'104')), &
  char(int(z'118')), &
  char(int(z'AE')), &
  char(int(z'2122')), &
  char(int(z'221A')), &
  char(int(z'1F642'))
```

Release-Candidate · August 2, 2025, 7:57am

That had been very common back in the day, e.g. Emacs has comments like

/* -*-mode: C; coding: latin-2; -*- */

msz59 · August 2, 2025, 9:52am

You haven’t tested the utf8_to_usc4_via_io routine in the test program, so I did it. It works fine, showing additional nice feature of encoding=’utf-8’- it works both with default character and ucs-4strings, apparently copying as-is the former and converting the latter. The modified function:

function utf8_to_ucs4_via_io(string) result(corrected)
character(len=*),intent(in) :: string
character(len=:,kind=ucs4),allocatable :: corrected
character(len=len(string),kind=ucs4) :: line
integer :: lun
   open(newunit=lun,file='utf8.out',encoding='UTF-8',status='replace')
   write(lun,'(A)')string
   rewind(lun)
   read(lun,'(A)')line
   close(lun)
   corrected=trim(line)
   open(newunit=lun,file='utf8r.out',encoding='UTF-8',status='replace') 
   write(lun,'(A)') corrected
   close(lun)
end function utf8_to_ucs4_via_io

generates two identical, UTF-8 files. I have also removed quadrupling the length of string in the declaration of lineas I presume the kind=ucs4already makes a single character in line4B long. I will later check how deep is the conformance of this I/O conversion with Unicode Standard. In particular, how it processes ill-formed UTF-8/UTF-32 sequences/codepoints.

davidpfister · August 2, 2025, 11:57am

Windows users may find some additional pieces of info here. The discussion was mostly about dealing with special characters (French éèêë, among others) on Windows and not relying on compiler extension like ucs4 for gfortran.

urbanjost · August 2, 2025, 12:26pm

We used to require any file format, particularly binary files, that our production codes used to have a magic string at the beginning, and with a description of the file format statements or structure including endian and what platform it was generated on and word-sizes and mantissa, …

At a time where basically each platform had its own binary file format this was very useful. As the number of commonly encountered native binary formats consolidated, and utf-8 gradually being supported in text and HTML and the use of file suffixes becoming more standardized and creeping in from the MSWindows platforms that faded into the background but is still used in a few places.

It recently came in really handy when someone needed to resurrect some old information and was confronted with some old binary files they needed to read they could not make any sense of. Ended up they had the magic tag in them, and the old records indicated they were Cray COS binary files and had their structure defined. With a little bit of bit fiddling they were totally usable after all.

The other lesson we learned long ago was to highly encouraged archived data be converted to text whenever reasonable

But Fortran utf-8 source files are not standard-conforming, just close to it. The code itself of course has to be written in the Fortran character set. The comments work iout to be able to be utf-8 without much risk but using utf-8 constant strings is totally not defined but potentially more and more desirable as utf-8 is the de-facto standard text file type in more and more cases.

urbanjost · August 2, 2025, 12:35pm

Thanks for the feedback. That routine is the closest to a built-in Fortran method for converting the pseudo-utf-8 ASCII strings into Fortran unicode I could think of. Keeping the unicode constants out of the source file completely and keeping the strings in a file is of course the truly standard-conforming method, but there are just too many times I want to have the strings in the source file to not look at how viable this might be. Ultimately it would be best to have utf-8 string constants defined in the standard, so playing around with it might also lead to some ideas on what solutions there might be.

urbanjost · August 2, 2025, 11:52pm

I captured the code in a github project with CD/CI tests, and fpm and cmake builds in case anyone

is interested in the future.

It passes tests on gfortran and flang-new; but the output for ascii bytes representing utf-8 written to stdout differ. The gfortran results are more consistent and I think preferable, but if you use the routines in M_utf8 and the ucs-4 characters you get the expected characters. ifx, lfortran, and nvfortran to not appear to support ucs-4 yet. I would be interested in results from other compilers.

Harper · August 3, 2025, 9:32pm

AMD flang does not support ucs-4.

urbanjost · August 3, 2025, 10:53pm

the document indicates NAG does, with extensions to fully support utf-8 source files. Not sure about IBM XL and Cray, … but flang-new worked enough to pass the simple tests, but I only accessed it throuth the github CD/CI interface. I still have questions about what is standard and what is not concerning ucs4 constants. For example, as least ASCII seems to be automatically converted to 4-byte characters with something like

   character(kind=ucs4,len=*),parameter  :: str=ucs4_’abcde’

but utf-8 characters taking more than one byte do not work; and not sure what might happen with 1byte characters above 127 that are often used for latin1 and latin2 characters, and so on.

is there any kind of automatic “promotion” and are assignments between kinds allowed? It looks like some was allowed in 2003 but then retracted (?),

when the output file encoded=’utf-8’ one compiler still prints ascii and utf-8 characters stored in ascii bytes as well as ucs4 characters, while another only gets the output right if the variables written are ucs4; have not even tried stream I/O with utf8 files and I think it is an extension that some I/O statements can have character variables of different kinds on them, and so on.

So if you make all your variables ucs4 and read and wrte just to utf8-encoded files it is very easy and all the intrinsics work (had a few problems with some field descriptors like T though); and that is useful for a lot of problems. It gets muddier when the input source contains utf-8 characters and figuring out exactly how internal reads and writes should work and what an assignment should or should not do, and whether you can use the // operator when some values are ascii or default and others are ucs4 and so on, so it would be nice to try as many compilers as possible that support unicode.

Interestingly if you convert the strings to code integers as one of the example procedures does even on compilers like ifx that do not support unicode and then essentially work with the integers like C historically did with ASCII, you can get by pretty well, my at least for now sorting out whether there is a better way than “char(int(z’FFFF’),kind=ucs4)” on every character to reliably place unicode into source files, which works but is painfully verbose, and hard to read in the code.

unicode was easy to avoid in most Fortran-centric environments for quite a while, but it hardly seems feasible to not need to interface to utf-8 files and to have utf-8 messages produced by codes now. Does anyone know the timetables the compilers that do not support ucs4 are on to provide it? I mean, everything has emojis in it now

Topic		Replies	Views
How do I file-read French special characters like é etc? Help	46	2423	January 22, 2024
Using Unicode Characters in Fortran Tutorials	35	6698	January 20, 2025
Culture setting / inoculation against squiggles Help	35	1283	June 2, 2023
Could someone please correct this code? Help	12	486	February 6, 2024
Fortran Monthly Call: July 2020 Announcements	26	1954	July 17, 2020

How to use utf-8 in gfortran?

output

Related topics