Handling files in directories whose names are not entirely ASCII

Arjen · November 18, 2022, 1:45pm

I have run into a nasty problem that concerns directories and files with names that contain non-ASCII (UNICODE) characters on Windows. My program cannot open such files: it claims they do not exist. A client reported this problem.

First sigh:
I have tried to reproduce the problem, but failed utterly, because the various methods I have tried all give different results - the UNICODE character in question is encoded in three bytes, it looks like an ordinary hyphen, but it is not. When I copy it to use in a command to create a directory with that name it gets converted into separate characters. The resulting directories therefore have very different names.

Second sigh:
As I cannot reproduce the problem, I do not know if it is compiler-specific or if it has to do with the lack of support for such file names in general.

So, in the off-chance someone knows how to deal with this, can you advise me what to do? (The compilers I use are Intel Fortran oneAPI and gfortran)

plevold · November 18, 2022, 3:01pm

I think your fate might be in the hands of the compiler developers and more specifically how they transfer the file name from the Fortran open statement to the Windows API calls.

The Windows API has two sets of functions for file system operations - one that operates paths in ASCII characters (for example CreateFileA) and one that operates on paths in “wide” characters (for example CreateFileW). The “wide” versions expects UTF-16 encoded strings where one codepoint is 2 bytes long.

GFortran supports wide characters by using selected_char_kind ('ISO_10646'): SELECTED_CHAR_KIND (The GNU Fortran Compiler)

I haven’t tried this, but it might be that the GFortran runtime makes the correct Windows API calls if you use that for your filename variable. Intel does currently not support wide characters AFAIK.

Side note: Be aware that the GFortran documentation is selected_char_kind is very confusing. It currently states that

Currently, supported character sets include “ASCII” and “DEFAULT”, which are equivalent, and “ISO_10646” (Universal Character Set, UCS-4) which is commonly known as Unicode.

which is plain wrong. UCS-4 is not the same as Unicode! I should probably have submitted a pull request fixing this, but I’ve forgotten about it…

Side note 2: If you want to learn more about Unicode without becoming even more confused I’d recommend this article: http://utf8everywhere.org/

Arjen · November 18, 2022, 3:09pm

Well, using a UNICODE character string did come to mind, but the annoying problem is that I do not seem to be able to reproduce the exact name. I know the byte sequence of the character I need, but it never shows up properly in the resulting directory name.

But even if a UNICODE string works, as far as I know Intel Fortran does not support that, or does it?

plevold · November 18, 2022, 3:23pm

Where do you get the filename from in the first place? If you read it from a UTF-8 encoded text file (which is a quite common encoding for text files) then you need to convert that into UCS-4 when you assign it to your UCS-4 variable. Even though the Unicode code point for the character is the same, the binary representation is different in UTF-8 and UCS-4.

Note that it seems like I was wrong in my previous post. UCS-4 seems to be 4 bytes long like URF-32 and not 2 bytes. That means that if GFortran is going to use CreateFileW from the Windows API it needs to convert you UCS-4 encoded filename into UTF-16 for this to work properly. Really not sure if that’s the case.

No, I’m pretty sure Intel Fortran does not support UCS-4.

Also note that both Intel and GFortran technically supports Unicode because you can use the default character kind to store UTF-8 encoded strings. I wrote a post about this a while back: Using Unicode Characters in Fortran You’ll still run into problems with the Windows filesystem though because it doesn’t use UTF-8.

Arjen · November 18, 2022, 3:26pm

I got an error report from a client. The output file from the program showed the complete path name of the file and from that I could immediately recognise the problem. However, copying it into a batch file to create the directory and several other methods led to directories with a different sequence of characters, mostly three characters instead of a single one. Annoying and confusing.

plevold · November 18, 2022, 7:51pm

That sounds a lot like encoding issues. If your text editor saved the file in UTF-8, but batch expects the encoding to be something different then the resulting filename would likely be different. I don’t know what encoding batch expects, but most text editors shows what encoding is being used in the lower right corner. There’s actually no exact way to tell what encoding is being used for a text file, but most editors are quite good at guessing.

If it’s actually the case that the compiler makes a caIl to the ASCII version of the Windows API - which I believe is the case - then I don’t think you’ll get any further. If I understand the Windows developer docs correctly then there’s no way to reach a file with Unicode characters in the path this way.

EDIT: I just found this thread on the Intel forums which seems to have a solution by doing some string conversions with some Windows API functions. They claim it works, haven’t tested it though. https://community.intel.com/t5/Intel-Fortran-Compiler/Unicode-characters-in-file-name-in-OPEN-statement/td-p/832640

Arjen · November 19, 2022, 12:05pm

Oh, that might be an alternative. I will have to check that. But a first glance at the solution shows that it relies on the ability to represent the name in the local codepage encoding. That is actually only possible for a handful of characters, whereas UNICODE characters run in the hundreds of millions.

drikosev · November 22, 2022, 10:41am

It’s confusing but interesting, as it fails also in PowerShell Console ie.

If you have the option to run the Bash Script below, ie in Cygwin or Windows Subsystem for Linux (WSL), the file name should be displayed properly in both the Explorer and PowerShell Console ( Windows 8.1 or later)

#!/bin/bash
# file name: hyphen-U2010cyg.sh
# in Cygwin the sh extension is important
# 

#The following line creates dir: ‐U2010a
mkdir   "‐U2010a"

#The following line creates dir: ‐U2010b
printf  '\xE2\x80\x90U2010b'  | xargs mkdir

drikosev · November 23, 2022, 7:18am

IMHO also, GNU Fortran likely calls CreateFile (libgfortran/io/unix.c), which
is the ASCII version.

Not sure if one can interpose it and call CreateFileW as in NIX systems:
[Bug fortran/61261] [OOP] Segfault on source-allocating polymorphic variables

Topic		Replies	Views
Using Unicode Characters in Fortran Tutorials	35	6409	January 20, 2025
Curious increase in file data after upgrade Help	4	423	December 6, 2022
Accessing a file in user's home directory Help	5	1545	February 28, 2021
How do I file-read French special characters like é etc? Help	46	2359	January 22, 2024
Could someone please correct this code? Help	12	470	February 6, 2024

Handling files in directories whose names are not entirely ASCII

Related topics