I found that there’s very little information around on how to use Unicode characters in Fortran so I did a small writeup covering my experiences with the topic. The examples are available here if anyone want’s to try them out: GitHub - plevold/unicode-in-fortran
Introduction
- WARNING
The following examples are based on my own experiences and testing.I’m neither a Unicode expert nor a compiler maintainer. If you find anything wrong with the examples please open an issue.
Using Unicode characters in you programs is not necessarily hard. There is however very little information about Fortran and Unicode available. This repository is a collection of examples and some explanations on how to use Unicode in Fortran.
Most of what is written here is based on recommendations from the UTF-8 Everywhere Manifesto. I would highly recommend that you read that as well to get a better understanding of what Unicode is and is not.
Compilers
The examples used here have been verified to work on the following compiler/OS combinations:
Compiler | Version | Operating System | Status |
gfortran | 9.3.0 | Linux | |
10.3.0 | Windows 10 | ||
ifort | 2021.5.0 | Linux |
Creating and Printing Unicode Strings
First, make sure that
- Your terminal emulator is set to UTF-8.
- Your source file encoding is set to UTF-8.
With the notable exception of Windows CMD and PowerShell, UTF-8 is robably the default encoding in your terminal. If you’re using Windows CMD or PowerShell you need to use a modern terminal emulator like Windows Terminal and follow the instructions here. If that’s too much hassle you can consider switching to Git for Windows instead which will give you a nice Bash terminal on Windows.
With that in place insert unicode characters directly into a string literal in your source code. If you’re using Visual Studio Code there’s an extension that can help you with inserting Unicode characters in your source files. Using escape sequences like \u1F525
requires setting special compiler flags and different compilers seems to handle this somewhat differently. Unless you know for sure that you want to stick with one compiler forever I would not recommend doing this.
If you’re storing it in a variable, use the default character kind or c_char
form iso_c_binding
. Do not try to use e.g. selected_char_kind('ISO_10646')
to create “wide” (longer than one byte) character elements. For one thing, Intel Fortran does as of this writing not support this. Also if you’re going to pass character arguments to procedures you’ll either have to do conversion between the default and the ISO_10646
character kinds or you need to have two versions of each procedure that might need to accept both wide and default character kinds. As we will later see, this is never really needed so you will
only create extra work for yourself.
Example:
program write_to_console
implicit none
character(len=:), allocatable :: chars
chars = 'Fortran is 💪, 😎, 🔥!'
write(*,*) chars
end program
This should output
❯ fpm run --example write_to_console
Fortran is 💪, 😎, 🔥!
As we can see from in output from the example above the emojis are printed like we inserted them in the source file.
Determining the Length of a Unicode String
Some might be confused by that
program unicode_len
implicit none
character(len=:), allocatable :: chars
chars = 'Fortran is 💪, 😎, 🔥!'
write(*,*) len(chars)
if (len(chars) /= 28) error stop
end program
outputs
❯ fpm run --example unicode_len
28
while if we manually count the number of character we see in the string literal then we end up 19 character. This is because in Unicode what we perceive as one character might consist of multiple bytes. This is referred to as a grapheme cluster and is crucial when rendering text. Determining the number of grapheme clusters and their width when rendered on the screen is a complex task which we will not go into here. For more information see the UTF-8 Everywhere Manifesto and It’s Not Wrong that “”.length == 7.
We’re mainly concerned about storing the characters in memory though, as our terminal emulator or text editor takes care of displaying the results on our screen. For this it is useful to think of the character variable as a sequence of bytes rather than a sequence of what we perceive as one character. When len(chars) == 28
that means that we need 28 elements in our variable to store the string.
Searching for Substrings
Substrings can be searched for using the regular index
intrinsic just like strings with just ASCII characters:
program unicode_index
implicit none
character(len=:), allocatable :: chars
integer :: i
chars = '📐: 4.0·tan⁻¹(1.0) = π'
i = index(chars, 'n')
write(*,*) i, chars(i:i)
if (i /= 14) error stop
i = index(chars, '¹')
if (i /= 18) error stop
write(*,*) i, chars(i:i + len('¹') - 1)
end program
outputs
❯ fpm run --example unicode_index
14 n
18 ¹
There is no need for any special handling thanks to the design of Unicode:
Also, you can search for a non-ASCII, UTF-8 encoded substring in a UTF-8 string as if it was a plain byte array — there is no need to mind code point boundaries. This is thanks to another design feature of UTF-8 — a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.
— UTF-8 Everywhere Manifesto
Keep in mind though that what looks like a single character (a grapheme cluster) might be more than one byte long so chars(i:i)
will not necessarily output the complete match.
Reading and Writing to File
Reading and writing Unicode characters from and to a file is as easy as writing ASCII text:
program file_io
implicit none
! Write to file
block
character(len=:), allocatable :: chars
integer :: unit
chars = 'Fortran is 💪, 😎, 🔥!'
open(newunit=unit, file='file.txt')
write(unit, '(a)') chars
write(*, '(a)') ' Wrote line to file: "' // chars // '"'
close(unit)
end block
! Read back from the file
block
character(len=100) :: chars
integer :: unit
open(newunit=unit, file='file.txt', action='read')
read(unit, '(a)') chars
write(*,'(a)') 'Read line from file: "' // trim(chars) // '"'
close(unit)
if (trim(chars) /= 'Fortran is 💪, 😎, 🔥!') error stop
end block
end program
The open
statement in Fortran allows to one to specify encoding='UTF-8'
. In testing with ifort
and gfortran
however this does not seem to have any impact on the file written. Specifying encoding
does for example not seem to add a Byte Order Mark (BOM) neither with gfortran
nor ifort
.
Conclusion
We’ve seen that using Unicode characters in Fortran is actually not that hard! One need to remember that what we perceive as a character is not necessarily a single element in our character variables. Apart from that using Unicode characters in Fortran should really be quite straight forward.