(I use the GFortran compiler) I get squiggles in my text when I use language specific characters. Is there a general way to stop that from happening? Patrick,
What software do you use exactly? Is that in your IDE? The gfortran compiler is unlikely to be the culprit on its own.
I am talking about my own software. In C there is “using System.Globalization”. I was wondering if something similar exists in Fortran.
In C you see this sort of thing:
public class SamplesCultureInfo
public static void Main()
// Creates and initializes the CultureInfo which uses the international sort. CultureInfo myCIintl = new CultureInfo("es-ES", false); // Creates and initializes the CultureInfo which uses the traditional sort. CultureInfo myCItrad = new CultureInfo(0x040A, false);
Ah, that makes it clear. That would actually be C++, by the way. Of old, there has been the “locale”. I do not know if there is a Fortran library that does similar things. (The code you showed seems to me to be specific to the MicroSoft compilers, but it is not specifically my area of expertise.)
You may want to open standard output in UTF-8 mode to print unicode characters:
! unicode.f90 program main use, intrinsic :: iso_fortran_env, only: output_unit implicit none integer, parameter :: u = selected_char_kind('ISO_10646') open (output_unit, encoding='utf-8') print '(a)', '大海航行靠舵手' end program main
$ gfortran -o unicode unicode.f90 $ ./unicode 大海航行靠舵手
Hi Mr Universe, this code produced text without squiggles. Is this about okay? Should anything be different apart frim the blsjt? Patrick,
use, intrinsic :: iso_fortran_env, only: output_unit
integer, parameter :: u = selected_char_kind(‘ISO_10646’)
write(banana,“(a)”)“Hôtel Chez Frédérque”
end program main
I am not sure your test is doing what you think. The following works for me with gfortran 11.3.0 under cygwin and ifort 2021.3.0 under Windows 10Pro. It shows that sometimes things “just work”.
program main implicit none integer unit character(len=*), parameter :: string = 'Hôtel Chez Frédérque' open (newunit=unit,file='funny.txt',action='write') write(unit,"(a)") string end program main
gfortran generates the following. ifort is identical, except generates \r\n carriage control by default.
$ cat funny.txt Hôtel Chez Frédérque $ od -c funny.txt 0000000 H 303 264 t e l C h e z F r 303 251 0000020 d 303 251 r q u e \n 0000031
However, it depends on the encoding of the string in the Fortran source file. When I pasted your code above into Visual Studio the encoding was munged. When i used a different editor - either emacs or Notepad++ - the utf-8 encoding was preserved. I could then open the source file in VS and it all worked.
Using “od -c” or a utility that shows the encoding of the Fortran source code and the output file may be enlightening.
Note that Intel Fortran only supports a single character KIND. You can use it for UTF-8 data, but multi-byte characters are stored in (not surprisingly) multiple bytes. I work on software used in around 100 countries and users have (almost) no problems with labels in their native languages. We programmers just have to accept that the displayed length is not always the length of the string. Our GUI is written in C#, but some text is written from Fortran.
Edit: I reproduced the issue of pasting the code into VS. The Fortran source file was encoded as “ISO-8859 text”. The working version is “Unicode text, UTF-8 text”.
Yes, in MS Windows it is quite possible that text editor will save the source file in various non-Unicode encodings. In my country that could be CP1250, in Spain it will be a different one.
You can see the problem by changing the encoding on the source code. The only changes are to the accented characters in the string - the rest of the program is ASCII.
$ iconv -f utf-8 -t iso-8859-1 charkind01.f90 > charkind02.f90 $ file charkind*.f90 charkind01.f90: Unicode text, UTF-8 text charkind02.f90: ISO-8859 text $ gfortran -Wall -g -std=f2008 -Wextra -Wall -fcheck=all charkind02.f90 -o charkind02.exe $ ./charkind02.exe $ cat funny.txt H▒tel Chez Fr▒d▒rque $ od -c funny.txt 0000000 H 364 t e l C h e z F r 351 d 351 0000020 r q u e \n 0000025
Per the standard, the code shall be along the following lines:
integer, parameter :: CK = selected_char_kind('ISO_10646') integer :: lun character(kind=CK, len=*), parameter :: string = CK_'𨉟呐㗂越' open(newunit=lun, encoding="utf-8", file='funny.txt', action='write') write(lun, fmt=*) string end
Processor implementors can explain whether and how their implementations can process the conforming code and the expected program behavior. Intel, for example, has decided not to support this into the foreseeable future which is a real shame given how widely
UTF-8 is used globally. What does Intel Inc., say, have against Vietnamese here, Intel’s software team does not want to help Intel do business in Vietnam or what?
I do not understand this string - what is CK doing in it?
If and when you stick around with Fortran long enough and care to follow the standard - it takes a bit of effort and attention - the standard and its details are as not readily obvious as you appear to demand in replies to your posts from other readers who donating their time - you can read through this, especially NOTE 2:
Intel Fortran handles non-European strings - at least those that are written left-to-right - adequately for many application. We find the biggest weakness is console I/O and editors. Everyone needs to be aware of locale and file encoding issues.
integer unit character(len=*), parameter :: & string = 'Mình nói tiếng Việt (𨉟呐㗂越, "I speak Vietnamese")' open (newunit=unit,file='funny3.txt',encoding="utf-8",action='write') write(unit,"(a)") string end
$ cat funny3.txt Mình nói tiếng Việt (𨉟呐㗂越, "I speak Vietnamese")
My question has been answered by David.
If you look two lines before the parameter definition, you will see that CK is a character kind value. Then the parameter definition uses a character string literal of that kind. Other data types in fortran, like integer and real, have the kind value after the literal value (e.g. 1_int32, or 1.0_real32), but characters are the other way, the kind value is before the literal.
This line does not conform, it’s a processor extension from Intel which is strange since they only intend to support a single kind for the character type but which they consider as an ASCII kind when it is not.
I am not a expert in the Fortran standard, but I have a good working knowledge. I don’t see why the code is non-conformant. Happy to learn more but I have no alternative but to continue working with code like the above.
Neither gfortran (with -std=f2018) nor ifort (with /stand:f18) warn about non-standard code.
The compiler treats the character variable as a sequence of bytes. The editor and console display the code as utf-8 but the compiler doesn’t know (or care) about that. The programmer needs to be careful not to destroy the utf-8 encoding when manipulating substrings and shouldn’t expect sensible results from collating sequences and the like.
It is a different matter if your editor inserts a byte order mark.
True, but the individual bytes that form an UTF-8 sequence do not necessarily map to existing characters in the default character kind (which is only required to support the Fortran character set, which is not even the full ASCII set).
So yes, it works in practice (as long as you are doing only basic stuff like this), and it has even chances to work with all existing compilers. Still, it’s formally not standard compliant, and as such there is no 100% garantee that it always works.
Yes. I agree. I am not sure if it is standard compliance or quality of implementation, but it doesn’t matter.
It does matter when it comes to writing code that behaves the way the language says that it should behave. You previously also made this statement:
which also shows that the details do matter. On my computer, which is MacOS, not Windows, your string prints “correctly” (I think), but when I do something as simple as
len(string) to ask how long the string is, then I get nonsense. Some of the displayed characters appear to be encoded in 8 bits and other are encoded in 16 bits. I guess that is the way that character encoding is supposed to work, but how could you write any kind of portable code by just ignoring those details?
I should add that I cannot tell if my compiler is really doing the right thing. I’m guessing in part by how the string appears in my browser when reading this thread and comparing that to what is printed by the compiler.