How do I file-read French special characters like é etc?

Patrick · October 13, 2023, 1:33pm

J:>type set.bat
set path=c:\users\pader\gcc\bin;%path%
J:>gfortran test.f90

J:>a
Úlise is a girls name like frÚdÚrique

RonShepard · October 13, 2023, 3:46pm

Fortran has implicit conversion rules for comparing integer, real, and logical variables of the same type and different kinds, and also in some cases for converting between types. But it does not have rules for implicit conversion between different kinds of character type. The programmer must do those conversions manually, and then compare characters of the same kind.

The details describing logical expressions are in section 10.1 of the standard. I find this section among the most tedious and the most difficult to read.

Patrick · October 13, 2023, 5:26pm

My conclusion is that Fortran as is, is not the best choice for text processing. I have abandoned my F application in favor of it written in C. IMO the quality of recent “improvements” of the F language do not do justice to the elegance of the F language as it was. Rather than tell every string what the coding should be, one statement at the top of the program could tell the compiler what to do. The “kind” solution is real messy. The point is that If many instances are affected, you do not modify at the level of instances, you do so in a generalizing way, like “culture”, in another language. Modifying at the level of instances is not so smart.

art-rasa · October 13, 2023, 9:24pm

Thank you, interesting. Opening the terminal in Unicode mode made chars_a print correctly. I’m still wondering why chars_b is printed as different characters. GFortran also produces warnings for chars_b. Maybe wide character initialization is better done with the hex code way.

CHARACTER expression at (1) is being truncated (3/1) [-Wcharacter-truncation]
iso10646_array.f90:12:14:
   12 |         '年', '月', '日' &
      |             1

everythingfunctional · October 16, 2023, 2:30pm

Based on the warning message, you are not giving the correct kind to your characters. It should be like

character(len=1, kind=utf), parameter :: my_chars(*) = [utf_"年", utf_"月", utf_"日"]

Character literals are always the default kind unless a kind is specified with kind_"my string". In your case what the compiler sees there is a sequence of 4 bytes, the default character is 1 byte, and so it thinks that is a string of 4 characters, but you’re assigning it to a variable with character len=1, so it truncates it, hence the warning.

There is a proposal for F202Y that should enable something like this. It is a bit controversial, but basically there will be a statement like (syntax and keywords TBD)

DEFAULT(REAL=SELECTED_REAL_KIND(30))

Background in J3 paper 23-199r1.

Patrick · October 22, 2023, 1:16pm

Removed, sorry.

gronki · October 22, 2023, 1:48pm

Honestly, I had no idea that Fortran had utf-8 support. Kudos

Patrick · January 14, 2024, 4:56pm

No, my question has not been answeref.

davidpfister · January 14, 2024, 7:55pm

Hi @Patrick, I went through the same problem some time ago. I will try to summarize what I found out.
First of all, Fortran had no problem reading or writing French special characters. Try to write a text with accent in a file, read it and write it again in another file. The accents should be properly displayed. For this, just make sure that you open the file with utf-8 encoding.
The problems appear when you want to:

write to the console
use inquire with non ascii characters in the path
use open with non ascii characters in the path.

The problem is that these three situations expect a string encoded with your local code page.
To know which cp is the default one on your system simply run chcp in cmd. Mine returns 850 (DOS Western Europe). Since the console is expecting a string with local cp and your string is utf8, you get gibberish characters.

If you are on windows you can write a bit of C to call the windows API and transform your utf8 string to utf16 and then utf16 to local cp. that should do the trick.

PS: do not expect any help from the debugger as it usually displays string in local cp.

davidpfister · January 14, 2024, 8:20pm

The C code should be something like the following

extern wchar_t* _utf8_to_utf16(const char* utf8string, int nchars) {
		int requiredSize;
		int writtenSize;
		wchar_t* result = NULL;

		SetLastError(0);

		requiredSize = 1 +
			MultiByteToWideChar(CP_UTF8,
				0,
				utf8string,
				nchars,
				NULL,
				0 // if 0, func returns the size of the required buffer (in wchar_t)
			);

		result = (wchar_t*)calloc(requiredSize, sizeof(wchar_t));

		writtenSize =
			MultiByteToWideChar(CP_UTF8,
				0,
				utf8string,
				nchars,
				result,
				requiredSize
			);

		result[writtenSize] = 0;

		return result;
	}

	extern char* _utf16_to_local(const wchar_t* utf16string, int nchars) {

		int requiredSize;
		int writtenSize;
		char* result = NULL;

		requiredSize = 1 +
			WideCharToMultiByte(CP_ACP,
				0,
				utf16string,
				nchars,
				NULL,
				0,
				NULL,
				NULL
			);

		result = (char*)calloc(requiredSize, 1);

		writtenSize =
			WideCharToMultiByte(CP_ACP,
				0,
				utf16string,
				nchars,
				result,
				requiredSize,
				NULL,
				NULL
			);

		result[writtenSize] = 0;

		return result;
	}

	extern char* to_local_codepage(const char* utf8string, int nchars) {

		char* local_string = NULL;
		wchar_t const* utf16tmp = _utf8_to_utf16(utf8string, nchars);

		local_string = _utf16_to_local(utf16tmp, -1);

		return local_string;
	}

Patrick · January 15, 2024, 9:58am

I would like to remain in a Fortran environment, without an excursion to c - in order to read from a file and to print its contents, either on screen or to a printer, and having no problems with characters like é, è, â. There does not seem to be a way to do this. I find this very surprising as there must surely be programers in France manipulating texts with Fortran code.

So to answer your question: NO my query has not been effectively responded to.

davidpfister · January 15, 2024, 10:53am

Then you may need to set the code page to 65001 (i.e. utf8) explicitly (run > chcp 65001). There is also the possibility to do it permanently. See this post on SE.
Alternatively, you can do the same as the C code above from Fortran directly by calling the Win32 API (see About WideCharToMultiByte (Win32 API) - Intel Community)
I did not try it myself though.

Patrick · January 15, 2024, 4:55pm

I tried your code (program utf) and it seems to work okay. I will need to incorporate your coding into an application that I abandonned due to problems with special text characters, to see how that goes.

vmagnin · January 15, 2024, 5:53pm

I never really cared about writing French characters, because it is useless for my research activities.

But mixing all your codes and remarks, I have successfully read an UTF-8 file utf_in.txt containing those lines:

Lire des caractères accentués n'est pas difficile. Inutile d'apprendre ça par cœur.
éèçàâäùëêïîüûöô

then written successfully them into utf_out.txt and in the terminal.

program utf
    use iso_fortran_env, only: output_unit
    implicit none

    integer, parameter :: ucs4 = selected_char_kind('iso_10646')
    character(len=100, kind=ucs4) :: string

    open(10, file='utf_in.txt',  encoding='utf-8')
    open(11, file='utf_out.txt', encoding='utf-8')
    open(output_unit, encoding='utf-8')

    ! First line:
    read( 10, '(a)') string
    write(11, '(a)') string
    write(output_unit, '(a)') string
    ! Second line:
    read( 10, '(a)') string
    write(11, '(a)') string
    write(output_unit, '(a)') string

    close(10)
    close(11)
end program utf

My Linux terminal is configured with those environment variables:

$ env | grep -i lang
LANGUAGE=fr
LANG=fr_FR.UTF-8

Everything is OK with GFortran:

$ gfortran fr.f90 && ./a.out
Lire des caractères accentués n'est pas difficile. Inutile d'apprendre ça par cœur.                 
éèçàâäùëêïîüûöô

Note that the Fortran 2018 standard says in section 16.9.168 SELECTED_CHAR_KIND (NAME):

If NAME has the value ISO_10646, then the result has a value equal to that of the kind type parameter of the ISO 10646 character kind (corresponding to UCS-4 as specified in ISO/IEC 10646) if the processor supports such a kind; otherwise the result has the value −1.

With ifx 2024.0.2, I have got an error:

$ ifx fr.f90 && ./a.out
fr.f90(8): error #6684: This is an incorrect value for a kind type parameter in this context.   [UCS4]
    character(len=100, kind=ucs4) :: string
----------------------------^
compilation aborted for fr.f90 (code 1)

Does is not support ISO 10646 or is there an option? A print *, ucs4 gives -1.

vmagnin · January 15, 2024, 6:12pm

In that dump, we can verify that ASCII characters are coded on one byte, and French characters on two bytes:

$ file utf_out.txt
utf_out.txt: Unicode text, UTF-8 text
$ hexdump -C utf_out.txt
00000000  4c 69 72 65 20 64 65 73  20 63 61 72 61 63 74 c3  |Lire des caract.|
00000010  a8 72 65 73 20 61 63 63  65 6e 74 75 c3 a9 73 20  |.res accentu..s |
00000020  6e 27 65 73 74 20 70 61  73 20 64 69 66 66 69 63  |n'est pas diffic|
00000030  69 6c 65 2e 20 49 6e 75  74 69 6c 65 20 64 27 61  |ile. Inutile d'a|
00000040  70 70 72 65 6e 64 72 65  20 c3 a7 61 20 70 61 72  |pprendre ..a par|
00000050  20 63 c5 93 75 72 2e 20  20 20 20 20 20 20 20 20  | c..ur.         |
00000060  20 20 20 20 20 20 20 20  0a c3 a9 c3 a8 c3 a7 c3  |        ........|
00000070  a0 c3 a2 c3 a4 c3 b9 c3  ab c3 aa c3 af c3 ae c3  |................|
00000080  bc c3 bb c3 b6 c3 b4 20  20 20 20 20 20 20 20 20  |.......         |
00000090  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
000000d0  20 20 20 20 20 20 20 20  20 20 20 20 0a           |            .|
000000dd

The first one is “è” at the end of the first line, and is coded by the bytes C3 A8, accordingly to:
https://www.compart.com/en/unicode/U+00E8

HTML Entity:

è

è

è

UTF-8 Encoding: 0xC3 0xA8

UTF-16 Encoding: 0x00E8

UTF-32 Encoding: 0x000000E8

And its code in iso-8859-15 is E8.

davidpfister · January 15, 2024, 6:33pm

I will quote @sblionel from a thread on the Intel forum

Intel Fortran doesn’t yet support other character kinds. The standard does not require support for ISO_10646.
UTF-8 is the default (and only supported specific) value for ENCODING.

The link is here.
I guess it’s still the case for ifx. So your only chance is to convert your utf8 string to local code page

vmagnin · January 15, 2024, 7:32pm

But in the Fortran 2018 standard I read:

12.5.6.9 ENCODING= specifier in the OPEN statement
The scalar-default-char-expr shall evaluate to UTF-8 or DEFAULT. The ENCODING= specifier is permitted only for a connection for formatted input/output. The value UTF-8 specifies that the encoding form of the file is UTF-8 as specified in ISO/IEC 10646. Such a file is called a Unicode file, and all characters therein are of ISO 10646 character kind. The value UTF-8 shall not be specified if the processor does not support the ISO 10646 character kind. The value DEFAULT specifies that the encoding form of the file is processor dependent. If this specifier is omitted in an OPEN statement that initiates a connection, the default value is DEFAULT.

So, from a practical point of view, I understand that UTF-8 can be considered as a synonym of ISO 10646.

Or by “UTF-8 is the default” should I understand “ASCII is the default”? Indeed ASCII characters are coded on 8 bits, or rather 7, the left bit being zero in a one byte UTF-8 character: 0xxxxxxx, but 1 if there are more bytes. In true UTF-8, characters can thus be coded with one to four bytes (for example two bytes for French characters), as needed for each character of the string.

davidpfister · January 15, 2024, 8:11pm

From what I understand, Fortran does read utf8 string by default. That’s why you can read and then write to files special characters without problems.
I just think that Intel compilers do not support the ucs4 character kinds like gfortran does.

vmagnin · January 15, 2024, 8:30pm

Indeed, if I replace
character(len=100, kind=ucs4) :: string
simply by
character(len=100) :: string
in my program, ifx does read my UTF-8 file, does print the French characters in terminal and does write correctly the UTF-8 utf_out.txt file…

$ ifx fr.f90 && ./a.out
Lire des caractères accentués n'est pas difficile. Inutile d'apprendre ça par cœur.    
éèçàâäùëêïîüûöô

But the program then does not work anymore correctly with GFortran…

Things are not yet clear in my mind…

vmagnin · January 15, 2024, 8:35pm

The first line of my utf_in.txt file has 83 characters. If I had a print *, len(trim(string)) for that line, I obtain:

83 with GFortran and kind=ucs4
83 with GFortran without kind=ucs4 (but the French characters are not printed correctly)
87 with ifx without kind=ucs4 (else it does not compile)

Well, for the moment I just feel more confused…

Topic		Replies	Views
Could someone please correct this code? Help	12	467	February 6, 2024
Code that baffles me. Could someone please explain? Help	11	639	June 20, 2023
Reading variable length characters from file Help	9	2257	October 11, 2021
Trouble with reading floats from ascii file Help	21	648	April 14, 2024
Using Unicode Characters in Fortran Tutorials	35	6345	January 20, 2025

How do I file-read French special characters like é etc?

Related topics