Using Unicode Characters in Fortran

I found that there’s very little information around on how to use Unicode characters in Fortran so I did a small writeup covering my experiences with the topic. The examples are available here if anyone want’s to try them out: GitHub - plevold/unicode-in-fortran

Introduction

  • WARNING
    The following examples are based on my own experiences and testing.I’m neither a Unicode expert nor a compiler maintainer. If you find anything wrong with the examples please open an issue.

Using Unicode characters in you programs is not necessarily hard. There is however very little information about Fortran and Unicode available. This repository is a collection of examples and some explanations on how to use Unicode in Fortran.

Most of what is written here is based on recommendations from the UTF-8 Everywhere Manifesto. I would highly recommend that you read that as well to get a better understanding of what Unicode is and is not.

Compilers

The examples used here have been verified to work on the following compiler/OS combinations:

Compiler Version Operating System Status
gfortran 9.3.0 Linux :white_check_mark:
10.3.0 Windows 10 :white_check_mark:
ifort 2021.5.0 Linux :white_check_mark:

Creating and Printing Unicode Strings

First, make sure that

  • Your terminal emulator is set to UTF-8.
  • Your source file encoding is set to UTF-8.

With the notable exception of Windows CMD and PowerShell, UTF-8 is robably the default encoding in your terminal. If you’re using Windows CMD or PowerShell you need to use a modern terminal emulator like Windows Terminal and follow the instructions here. If that’s too much hassle you can consider switching to Git for Windows instead which will give you a nice Bash terminal on Windows.

With that in place insert unicode characters directly into a string literal in your source code. If you’re using Visual Studio Code there’s an extension that can help you with inserting Unicode characters in your source files. Using escape sequences like \u1F525 requires setting special compiler flags and different compilers seems to handle this somewhat differently. Unless you know for sure that you want to stick with one compiler forever I would not recommend doing this.

If you’re storing it in a variable, use the default character kind or c_char form iso_c_binding. Do not try to use e.g. selected_char_kind('ISO_10646') to create “wide” (longer than one byte) character elements. For one thing, Intel Fortran does as of this writing not support this. Also if you’re going to pass character arguments to procedures you’ll either have to do conversion between the default and the ISO_10646 character kinds or you need to have two versions of each procedure that might need to accept both wide and default character kinds. As we will later see, this is never really needed so you will
only create extra work for yourself.

Example:

program write_to_console
    implicit none
    character(len=:), allocatable :: chars

    chars = 'Fortran is 💪, 😎, 🔥!'
    write(*,*) chars
end program

This should output

❯ fpm run --example write_to_console
 Fortran is 💪, 😎, 🔥!

As we can see from in output from the example above the emojis are printed like we inserted them in the source file.

Determining the Length of a Unicode String

Some might be confused by that

program unicode_len
    implicit none
    character(len=:), allocatable :: chars

    chars = 'Fortran is 💪, 😎, 🔥!'
    write(*,*) len(chars)
    if (len(chars) /= 28) error stop
end program

outputs

❯ fpm run --example unicode_len
          28

while if we manually count the number of character we see in the string literal then we end up 19 character. This is because in Unicode what we perceive as one character might consist of multiple bytes. This is referred to as a grapheme cluster and is crucial when rendering text. Determining the number of grapheme clusters and their width when rendered on the screen is a complex task which we will not go into here. For more information see the UTF-8 Everywhere Manifesto and It’s Not Wrong that “:man_facepalming:t3:”.length == 7.

We’re mainly concerned about storing the characters in memory though, as our terminal emulator or text editor takes care of displaying the results on our screen. For this it is useful to think of the character variable as a sequence of bytes rather than a sequence of what we perceive as one character. When len(chars) == 28 that means that we need 28 elements in our variable to store the string.

Searching for Substrings

Substrings can be searched for using the regular index intrinsic just like strings with just ASCII characters:

program unicode_index
    implicit none
    character(len=:), allocatable :: chars
    integer :: i

    chars = '📐: 4.0·tan⁻¹(1.0) = π'
    i = index(chars, 'n')
    write(*,*) i, chars(i:i)
    if (i /= 14) error stop
    i = index(chars, '¹')
    if (i /= 18) error stop
    write(*,*) i, chars(i:i + len('¹') - 1)
end program

outputs

❯ fpm run --example unicode_index
          14 n
          18 ¹

There is no need for any special handling thanks to the design of Unicode:

Also, you can search for a non-ASCII, UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array — there is no need to mind code point boundaries. This is thanks to another design feature of UTF-8 — a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
UTF-8 Everywhere Manifesto

Keep in mind though that what looks like a single character (a grapheme cluster) might be more than one byte long so chars(i:i) will not necessarily output the complete match.

Reading and Writing to File

Reading and writing Unicode characters from and to a file is as easy as writing ASCII text:

program file_io
    implicit none

    ! Write to file
    block
        character(len=:), allocatable :: chars
        integer :: unit

        chars = 'Fortran is 💪, 😎, 🔥!'
        open(newunit=unit, file='file.txt')
        write(unit, '(a)') chars
        write(*, '(a)') ' Wrote line to file: "' // chars // '"'
        close(unit)
    end block

    ! Read back from the file
    block
        character(len=100) :: chars
        integer :: unit

        open(newunit=unit, file='file.txt', action='read')
        read(unit, '(a)') chars
        write(*,'(a)') 'Read line from file: "' // trim(chars) // '"'
        close(unit)
        if (trim(chars) /= 'Fortran is 💪, 😎, 🔥!') error stop
    end block

end program

The open statement in Fortran allows to one to specify encoding='UTF-8'. In testing with ifort and gfortran however this does not seem to have any impact on the file written. Specifying encoding does for example not seem to add a Byte Order Mark (BOM) neither with gfortran nor ifort.

Conclusion

We’ve seen that using Unicode characters in Fortran is actually not that hard! One need to remember that what we perceive as a character is not necessarily a single element in our character variables. Apart from that using Unicode characters in Fortran should really be quite straight forward.

15 Likes

Thanks for this write-up. The principles of UNICODE are indeed not that hard, but there are some very nasty areas, like surrogate pairs and characters to change the direction of reading, where it gets ugly. (I have read a paper about the latter where the authors demonstrated that you could use such direction changes to hide the actual source code). As for BOMs, that seems to be a typical Windows thing.

2 Likes

I’m not sure that some of this is good advice. It seems what you are doing here is akin to stuffing double precision reals into an array of single precision reals… It sort of works under some circumstances, but I wouldn’t recommend it. I think using the selected_char_kind('ISO_10646') is the correct way. See my JSON-Fortran library, which does support unicode. And yes, it isn’t currently supported by ifort (what gives, Intel?), and yes, you have to write multiple versions of routines (but that’s the same way you have to do for different real kinds, so it is not unexpected).

Consider this file (‘unicode.txt’):

:grinning::sunglasses::weary:

And the following code:

program test

use iso_fortran_env

implicit none

integer,parameter :: CK = selected_char_kind('ISO_10646')

character(kind=CK,len=3) :: s
integer :: iunit

open(output_unit,encoding='utf-8')

open(newunit=iunit,file='unicode.txt',status='OLD',encoding='UTF-8')

read(iunit,'(A)') s

write(output_unit,*) s
write(output_unit,*) 'len(s) = ', len(s)
write(output_unit,*) 's(1:1) = ', s(1:1)

end program test

This prints:

😀😎😩
 len(s) =            3
 s(1:1) = 😀

So, notice how the length is 3 and the slicing works correctly.

But, I don’t think Fortran actually supports unicode in source files. For example, when I try to do this:

s = CK_'😀😎😩'

I get the warning “CHARACTER expression will be truncated in assignment (3/12) at (1) [-Wcharacter-truncation]” and s(1:1) will print as gibberish.

2 Likes

Attention @greenrongreen - please see above. Any particular hurdle that prevents IFORT from supporting the wider character set? The standard acknowledged ISO 10646 back with the Fortran 2003 revision and enabled a mechanism to support it: IFORT has long “claimed” Fortran 2003 compliance and this is a common enough need among the users that its absence is felt with IFORT.

2 Likes

Thanks for the feedback @jacobwilliams. If your interested in this topic I’d highly recommend to read the UTF-8 Everywhere Manifesto which covers this in much more detail than what I did. Here’s a relevant quote from their conclusion:

In particular, we believe that adding wchar_t to the C++ standard was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. ‘If this accepts text, it should be Unicode compatible’—and with UTF-8, it is easy to achieve.

selected_char_kind('ISO_10646') in Fortran would be similar to wchar_t in C++.

Only speculation from my side, but perhaps Intel has acknowledged the issue with wide character types and because of this don’t bothered implementing support for it?

Hello,

This issue was raised also some time ago in c.l.f (with some current responders).
As wisely mentioned ie by jacobwilliams, the length of a UTF-8 string is not
inherently supported. See ie the user defined ‘ulen’ function in my gist:
https:// gist.github.com/drikosev/d35956f266ff7af49074e7e669cd34df

Another issue for Fortraners is that the 2003 standard acknowledged ISO 10646.
Also one of the responders had informed me that my example wasn’t very portable.

Not sure how difficult would be for Fortran implementers to support transparently
a UTF-8 aware LEN intrinsic that would simplify a lot the current situation.

Ev. Drikos

1 Like

I’m not quite sure I understood what your ulen function is trying to achieve. Do you want to count the number of bytes in the character sequence, the number of grapheme clusters or the width of the text displayed on screen?

The number of bytes can be computed easily with len(chars) (multiplied by a constant if using non-default character kinds).

Counting the number of grapheme clusters is not that straight forward, but can be done if you find the right algorithm and port it to Fortran or make a C interface . I think for example the Rust crate unicode-segmentation will do that for you.

Determining the width of a string boils down to determining the width of each grapheme cluster. Even for monospaced fonts this is a non-trivial task. Take for example the following string:

|😎|⋮|

The characters are (here is a nice tool to determine that):

U+007C : VERTICAL LINE {vertical bar, pipe}
U+1F60E : SMILING FACE WITH SUNGLASSES
U+007C : VERTICAL LINE {vertical bar, pipe}
U+22EE : VERTICAL ELLIPSIS
U+007C : VERTICAL LINE {vertical bar, pipe}

If we try to align this with punctuation marks

|😎|
|...|
|⋮|
|.|

we see that even for a monospaced font

  • The SMILING FACE WITH SUNGLASSES emoji is slightly shorter that three punctuation marks
  • The VERTICAL ELLIPSIS is slightly shorter than one punctuation mark

This is even further complicated by the fact that if the monospace font in use does not have a character the application or (most likely) the OS will fallback to another font. Because of this it might even be that you’re seeing a different width of the characters above than what I’m doing!

It counts the number of characters represented, which is what the LEN intrinsic does.

Ev. Drikos

Ok, so I assume your trying to count the number of grapheme clusters then.

I tried to extract the ulen-function and apply it to a Fortran character string, but I think I’m doing something wrong. At least I’m not able to produce any meaningful output with it. Any thoughts?

module ulen_mod
    implicit none

    public ulen

contains

    integer function ulen(chars)
        character(len=*), intent(in) :: chars

        integer :: i

        ulen = 0
        do i = 1, len(chars)
            ulen = ulen + ulen_single(chars(i:i))
        end do
    end function


    function ulen_single(ch) result(ulen)
        use iso_c_binding, only: c_char
        implicit none

        character(kind=c_char), intent(in) ::ch
        integer :: ulen, ich

        ulen=0
        ich = ichar(ch)

        if ( ich < int(Z'80') ) THEN
            ulen=1
        else if ( (ich > ( int(Z'C0') + 1)) .and. ( ich < int(Z'E0') )) THEN
            ulen=2
        else if ( ich < int(Z'F0') ) THEN
            ulen=3
        else if ( ich <= int(Z'F4') ) THEN
            ulen=4
        else
            ulen=1   !assume we process larger sequqences, 1 by 1 bytes
        end if


    end function
end module

program main
    use ulen_mod, only: ulen

    write(*,*) ulen('abc'), len('abc') ! 3 grapgheme clusters
    write(*,*) ulen('😎'), len('😎') ! 1 grapheme cluster
    write(*,*) ulen('a̐éö̲'), len('a̐éö̲') ! 3 grapheme clusters
end program

Neither by ifx 2022.0.0 :frowning:

Admittedly, my reply wasn’t complying with the documentation of the ‘ulen’ function, which counts the number of bytes. So, in your program the ‘ulen’ function could be something like that:

integer function ulen(chars)
    character(len=*), intent(in) :: chars
    integer :: i, j, bytes

    ulen = 0
    bytes= len(chars)
    if ( bytes == 0 ) then
        ulen = 0
        return
    end if
    i = 1
    do, while ( i <= bytes )
           j = ulen_single(chars(i:i))
        i = i + j
        ulen = ulen + 1
    end do
end function 

The results are displayed below

       3           3
       1           4
       5           9

The following Bash script ie prints the same results for the last string in your program:

#!/bin/bash
string_variable_name=‘a̐éö̲’
charlen=${#string_variable_name}
echo $charlen

Thanks! That makes more sense than my naive attempt :slight_smile:

With the updated example:

module ulen_mod
    implicit none

    public ulen

contains

    integer function ulen(chars)
        character(len=*), intent(in) :: chars
        integer :: i, j, bytes

        ulen = 0
        bytes= len(chars)
        if ( bytes == 0 ) then
            ulen = 0
            return
        end if
        i = 1
        do, while ( i <= bytes )
            j = ulen_single(chars(i:i))
            i = i + j
            ulen = ulen + 1
        end do
    end function


    function ulen_single(ch) result(ulen)
        use iso_c_binding, only: c_char
        implicit none

        character(kind=c_char), intent(in) ::ch
        integer :: ulen, ich

        ulen=0
        ich = ichar(ch)

        if ( ich < int(Z'80') ) THEN
            ulen=1
        else if ( (ich > ( int(Z'C0') + 1)) .and. ( ich < int(Z'E0') )) THEN
            ulen=2
        else if ( ich < int(Z'F0') ) THEN
            ulen=3
        else if ( ich <= int(Z'F4') ) THEN
            ulen=4
        else
            ulen=1   !assume we process larger sequqences, 1 by 1 bytes
        end if


    end function
end module

program main
    use ulen_mod, only: ulen

    write(*,*) ulen('abc'), len('abc') ! 3 grapgheme clusters
    write(*,*) ulen('😎'), len('😎') ! 1 grapheme cluster
    write(*,*) ulen('a̐éö̲'), len('a̐éö̲') ! 3 grapheme clusters
end program

I get:

           3           3
           1           4
           7          11

So for the last string we get different results which is odd. The answer is wrong in both cases though as there’s 3 and not 5 or 7 grapheme clusters. The example was taken from the unicode-segmentation docs which correctly splits it into 3 parts (admittedly I haven’t verified this myself).

I wonder if there’s ever any need for doing these calculations in Fortran though?

What I think would be interesting in some cases is to compute the width of a string so that output can be aligned when using a monospaced font e.g. in a terminal. As I previously mentioned this is a very challenging task so I don’t know how feasible it is.

Thanks, I switched from Firefox to Safari and now I see ‘7 11’ in the last line, also ‘7’ by the Bash script.
Note that I never spoke about ‘grapheme clusters’, only you did it, which of course may be what a user can have in mind. Sample code in my gists wouldn’t go that far. It counts only valid UTF-8 sequences that Unicode characters consists of, and in fact I’ve restricted it up to 4 bytes (just skipped longer ones).

Ev. Drikos

The problem is that, at least as far as I understand, a “character” is a very vague term and not precisely defined in Unicode. I’ll highly reccoment the Characters section of the UTF-8 Everywhere Manifesto. Some relevant quotes:

(…)

  • User-perceived character — Whatever the end user thinks of as a character. This notion is language dependent. For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
  • Grapheme cluster — A sequence of coded characters that ‘should be kept together’.[§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection.
    (…)

‘Character’ may refer to any of the above. The Unicode Standard uses it as a synonym for coded character.[§3.4] When a programming language or a library documentation says ‘character’, it typically means a code unit. When an end user is asked about the number of characters in a string, he will count the user-perceived characters. A programmer might count characters as code units, code points, or grapheme clusters, according to the level of the programmer’s Unicode expertise. For example, this is how Twitter counts characters. In our opinion, a string length function should not necessarily return one for the string ‘:koala:’ to be considered Unicode-compliant.

(Emphasis on the last sentence added by me)

Just for the record, in Java with the JDK 1.8 I see 7 characters. Also, with this Fortran code (gfortran), I see 7 characters as demonstrated below. But I guess it could also consist of 5 characters only if I’d used the precomposed ‘é’ and ‘ö’.

program test
    use iso_fortran_env
    implicit none
    
    integer,parameter :: CK = selected_char_kind('ISO_10646')

    !see also https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
    !see also https://www.compart.com/en/unicode/U+006F
    character(kind=CK,len=7) :: s = CHAR(Z'0061', KIND=CK) //& !a
                                  & CHAR(Z'0310', KIND=CK) //& !◌̐ -> a̐
                                ! & CHAR(Z'00E9', KIND=CK) //& !é (precomposed)
                                  & CHAR(Z'0065', KIND=CK) //& !e (decomposed)
                                  & CHAR(Z'0301', KIND=CK) //& !◌́
                                ! & CHAR(Z'00F6', KIND=CK) //& !ö  (precomposed)
                                  & CHAR(Z'006F', KIND=CK) //& !o (decomposed)
                                  & CHAR(Z'0308', KIND=CK) //& !Combining Diaeresis
                                  & CHAR(Z'0332', KIND=CK)     !Combining Low Line
    
    open(output_unit,encoding='utf-8')
    
    write(output_unit,*) s
    write(output_unit,*) 'len(s) = ', len(s)
    write(output_unit,*) 's(1:1) = ', s(1:1)
    
end program test

I get the results

 a̐éö̲
 len(s) =            7
 s(1:1) = a

I personally think the Fortran standard has taken the right step with ISO 10646 which is UCS that is rather close to Unicode but it is not strictly the same as Unicode, if I recall correctly. UCS is a better place in terms of character sets and it can be any of the popular encodings, UTF-8 if a Fortran processor so chooses.

With respect to Fortran or any standards-based language for that matter with multiple possible processor implementations, the details will always be in the processor-dependent category. It’s up to the processors to converge to a good place, UTF-8 might just be it. It will be highly beneficial if IFORT steps up with character sets beyond its default.

1 Like

UCS is just the standardization of the Code Points, and their names, of Unicode. In addition to specifying the code points and their names, Unicode also maintains a large data base of the properties of the code points, i.e., are they letters, symbols, numbers,…, if letters are they upper, lower, title case, or uncased, if they are numbers what are their values, if they represent composites of other code points what are the other code points, …

No - just perceived lack of demand, and resource constraints. gfortran developers (the few that there are) tend to chase the “shiny” things even when there are large, known gaps in support for the standard (I don’t think gfortran really does all of F2003 yet.) I expect Intel will get to alternate character sets eventually. It does have a nice library for dealing with multinational character sets.

I don’t think I agree with you on this. As long the processor interprets the contents of a string literal as a sequence of bytes (which is indeed is) and don’t try to do any conversion then I think one can use whatever encoding one wish.

Where one will be processor dependent is if one wishes to use escape sequences to insert code point (e.g. \u0041 instead of a), but this does not seem to be covered by the Fortran standard?

Encoding matters at the application boundaries, e.g. when outputting text to a terminal or a file. Here, Unicode, and particularly UTF-8, seems to be very well supported.

UTF-8 code points can be between 1 and 4 bytes long. As such they fit very well into the default character kind where one element is one byte. The notion that one element corresponds to one user perceived character or grapheme cluster is false when using Unicode, regardless the size of the element. This is why I prefer the default character kind over selected_char_kind('ISO_10646') which, at least when using gfortran, is 4 bytes long:

    integer, parameter :: ucs2 = selected_char_kind('ISO_10646')
    write(*,*) sizeof('a'), sizeof(ucs2_'a')
                    1                    4