I found that there’s very little information around on how to use Unicode characters in Fortran so I did a small writeup covering my experiences with the topic. The examples are available here if anyone want’s to try them out: GitHub - plevold/unicode-in-fortran
The following examples are based on my own experiences and testing.I’m neither a Unicode expert nor a compiler maintainer. If you find anything wrong with the examples please open an issue.
Using Unicode characters in you programs is not necessarily hard. There is however very little information about Fortran and Unicode available. This repository is a collection of examples and some explanations on how to use Unicode in Fortran.
Most of what is written here is based on recommendations from the UTF-8 Everywhere Manifesto. I would highly recommend that you read that as well to get a better understanding of what Unicode is and is not.
The examples used here have been verified to work on the following compiler/OS combinations:
First, make sure that
- Your terminal emulator is set to UTF-8.
- Your source file encoding is set to UTF-8.
With the notable exception of Windows CMD and PowerShell, UTF-8 is robably the default encoding in your terminal. If you’re using Windows CMD or PowerShell you need to use a modern terminal emulator like Windows Terminal and follow the instructions here. If that’s too much hassle you can consider switching to Git for Windows instead which will give you a nice Bash terminal on Windows.
With that in place insert unicode characters directly into a string literal in your source code. If you’re using Visual Studio Code there’s an extension that can help you with inserting Unicode characters in your source files. Using escape sequences like
\u1F525 requires setting special compiler flags and different compilers seems to handle this somewhat differently. Unless you know for sure that you want to stick with one compiler forever I would not recommend doing this.
If you’re storing it in a variable, use the default character kind or
iso_c_binding. Do not try to use e.g.
selected_char_kind('ISO_10646') to create “wide” (longer than one byte) character elements. For one thing, Intel Fortran does as of this writing not support this. Also if you’re going to pass character arguments to procedures you’ll either have to do conversion between the default and the
ISO_10646 character kinds or you need to have two versions of each procedure that might need to accept both wide and default character kinds. As we will later see, this is never really needed so you will
only create extra work for yourself.
program write_to_console implicit none character(len=:), allocatable :: chars chars = 'Fortran is 💪, 😎, 🔥!' write(*,*) chars end program
This should output
❯ fpm run --example write_to_console Fortran is 💪, 😎, 🔥!
As we can see from in output from the example above the emojis are printed like we inserted them in the source file.
Some might be confused by that
program unicode_len implicit none character(len=:), allocatable :: chars chars = 'Fortran is 💪, 😎, 🔥!' write(*,*) len(chars) if (len(chars) /= 28) error stop end program
❯ fpm run --example unicode_len 28
while if we manually count the number of character we see in the string literal then we end up 19 character. This is because in Unicode what we perceive as one character might consist of multiple bytes. This is referred to as a grapheme cluster and is crucial when rendering text. Determining the number of grapheme clusters and their width when rendered on the screen is a complex task which we will not go into here. For more information see the UTF-8 Everywhere Manifesto and It’s Not Wrong that “”.length == 7.
We’re mainly concerned about storing the characters in memory though, as our terminal emulator or text editor takes care of displaying the results on our screen. For this it is useful to think of the character variable as a sequence of bytes rather than a sequence of what we perceive as one character. When
len(chars) == 28 that means that we need 28 elements in our variable to store the string.
Substrings can be searched for using the regular
index intrinsic just like strings with just ASCII characters:
program unicode_index implicit none character(len=:), allocatable :: chars integer :: i chars = '📐: 4.0·tan⁻¹(1.0) = π' i = index(chars, 'n') write(*,*) i, chars(i:i) if (i /= 14) error stop i = index(chars, '¹') if (i /= 18) error stop write(*,*) i, chars(i:i + len('¹') - 1) end program
❯ fpm run --example unicode_index 14 n 18 ¹
There is no need for any special handling thanks to the design of Unicode:
Also, you can search for a non-ASCII, UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array — there is no need to mind code point boundaries. This is thanks to another design feature of UTF-8 — a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
— UTF-8 Everywhere Manifesto
Keep in mind though that what looks like a single character (a grapheme cluster) might be more than one byte long so
chars(i:i) will not necessarily output the complete match.
Reading and writing Unicode characters from and to a file is as easy as writing ASCII text:
program file_io implicit none ! Write to file block character(len=:), allocatable :: chars integer :: unit chars = 'Fortran is 💪, 😎, 🔥!' open(newunit=unit, file='file.txt') write(unit, '(a)') chars write(*, '(a)') ' Wrote line to file: "' // chars // '"' close(unit) end block ! Read back from the file block character(len=100) :: chars integer :: unit open(newunit=unit, file='file.txt', action='read') read(unit, '(a)') chars write(*,'(a)') 'Read line from file: "' // trim(chars) // '"' close(unit) if (trim(chars) /= 'Fortran is 💪, 😎, 🔥!') error stop end block end program
open statement in Fortran allows to one to specify
encoding='UTF-8'. In testing with
gfortran however this does not seem to have any impact on the file written. Specifying
encoding does for example not seem to add a Byte Order Mark (BOM) neither with
We’ve seen that using Unicode characters in Fortran is actually not that hard! One need to remember that what we perceive as a character is not necessarily a single element in our character variables. Apart from that using Unicode characters in Fortran should really be quite straight forward.