Odd, I do not get the length difference. Crazy that in 2024 we are still struggling with character encoding.
Here is another test you could try. If you create a file with special characters in the name and put the path in your utf_in.txt. You could test if inquire
and open
are working.
I did the test on windows with ifort and it failed. These functions require a path encoded in the local code page which in my case is 850.
In my case, it works with a file named utfé_in.txt
under Linux, both with GFortran and Intel ifx.
Anyway, concerning ifx, if we follow the standard, we should not use encoding='utf-8'
:
The value UTF-8 shall not be specified if the processor does not support the ISO 10646 character kind.
And therefore, it should be avoided if you want your code to work with any compiler…
In the sentence “Intel Fortran doesn’t yet support other character kinds”, I think the important word is yet. Unicode is like IPv6, you must be patient… It was released in 1991, but it took at least 20 years before the web pages were generally correctly encoded and decoded…
Happily, it is not so important for most of us as the lingua franca of science is English.
@Patrick, I have been playing with the Win32 interface directly from Fortran and managed to port the C code I reported before.
module win32
use, intrinsic :: iso_c_binding
implicit none
private
public :: to_local_codepage
integer, parameter :: UINT = c_int, DWORD = c_long, WCHAR = c_int16_t
integer, parameter :: BOOL = c_int, HANDLE = c_intptr_t
integer(UINT), parameter :: CP_ACP = 0
integer(UINT), parameter :: CP_MACCP = 2
integer(UINT), parameter :: CP_OEMCP = 1
integer(UINT), parameter :: CP_SYMBOL = 42
integer(UINT), parameter :: CP_THREAD_ACP = 3
integer(UINT), parameter :: CP_UTF7 = 65000
integer(UINT), parameter :: CP_UTF8 = 65001
integer(DWORD), parameter :: MB_COMPOSITE = 2
integer(DWORD), parameter :: MB_ERR_INVALID_CHARS = 8
integer(DWORD), parameter :: MB_PRECOMPOSED = 1
integer(DWORD), parameter :: MB_USEGLYPHCHARS = 4
integer(DWORD), parameter :: WC_COMPOSITECHECK = 512
integer(DWORD), parameter :: WC_ERR_INVALID_CHARS = 128
integer(DWORD), parameter :: WC_NO_BEST_FIT_CHARS = 1024
integer(DWORD), parameter :: WC_DEFAULTCHAR = 64
integer(DWORD), parameter :: WC_DISCARDNS = 16
integer(DWORD), parameter :: WC_SEPCHARS = 32
integer(c_int), parameter :: MB_ABORTRETRYIGNORE = 2
integer(c_int), parameter :: MB_CANCELTRYCONTINUE = 6
integer(c_int), parameter :: MB_HELP = 16384
integer(c_int), parameter :: MB_OK = 0
integer(c_int), parameter :: MB_OKCANCEL = 1
integer(c_int), parameter :: MB_RETRYCANCEL = 5
integer(c_int), parameter :: MB_YESNO = 4
integer(c_int), parameter :: MB_YESNOCANCEL = 3
integer(c_int), parameter :: MB_ICONEXCLAMATION = 48
integer(c_int), parameter :: MB_ICONWARNING = 48
integer(c_int), parameter :: MB_ICONINFORMATION = 64
integer(c_int), parameter :: MB_ICONASTERISK = 64
integer(c_int), parameter :: MB_ICONQUESTION = 32
integer(c_int), parameter :: MB_ICONSTOP = 16
integer(c_int), parameter :: MB_ICONERROR = 16
integer(c_int), parameter :: MB_ICONHAND = 16
integer(c_int), parameter :: MB_DEFBUTTON1 = 0
integer(c_int), parameter :: MB_DEFBUTTON2 = 256
integer(c_int), parameter :: MB_DEFBUTTON3 = 512
integer(c_int), parameter :: MB_DEFBUTTON4 = 768
integer(c_int), parameter :: MB_APPLMODAL = 0
integer(c_int), parameter :: MB_SYSTEMMODAL = 4096
integer(c_int), parameter :: MB_TASKMODAL = 8192
integer(c_int), parameter :: MB_DEFAULT_DESKTOP_ONLY = 131072
integer(c_int), parameter :: MB_RIGHT = 524288
integer(c_int), parameter :: MB_RTLREADING = 1048576
integer(c_int), parameter :: MB_SETFOREGROUND = 65536
integer(c_int), parameter :: MB_TOPMOST = 262144
integer(c_int), parameter :: MB_SERVICE_NOTIFICATION = 2097152
interface
function MultiByteToWideChar(CodePage, dwFlags, lpMultiByteStr, &
cbMultiByte,lpWideCharStr,cchWideChar) bind(C,name='MultiByteToWideChar')
import
implicit none
!GCC$ ATTRIBUTES STDCALL :: MultiByteToWideChar
!GCC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpWideCharStr
!DEC$ ATTRIBUTES STDCALL :: MultiByteToWideChar
!DEC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpWideCharStr
integer(c_int) MultiByteToWideChar
integer(UINT), value :: CodePage
integer(DWORD), value :: dwFlags
character(kind=C_CHAR), optional :: lpMultiByteStr(*)
integer(c_int), value :: cbMultiByte
integer(WCHAR), optional :: lpWideCharStr(*)
integer(c_int), value :: cchWideChar
end function
end interface
interface
function WideCharToMultiByte(CodePage, dwFlags, lpWideCharStr, &
cchWideChar, lpMultiByteStr, cbMultiByte, lpDefaultChar, &
lpUsedDefaultChar) bind(C,name='WideCharToMultiByte')
import
implicit none
!GCC$ ATTRIBUTES STDCALL :: WideCharToMultiByte
!GCC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpDefaultChar, lpUsedDefaultChar
!DEC$ ATTRIBUTES STDCALL :: WideCharToMultiByte
!DEC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpDefaultChar, lpUsedDefaultChar
integer(c_int) WideCharToMultiByte
integer(UINT), value :: CodePage
integer(DWORD), value :: dwFlags
integer(WCHAR) lpWideCharStr(*)
integer(c_int), value :: cchWideChar
character(kind=C_CHAR), optional :: lpMultiByteStr(*)
integer(c_int), value :: cbMultiByte
character(kind=C_CHAR), optional :: lpDefaultChar
integer(BOOL), optional :: lpUsedDefaultChar
end function
end interface
contains
function utf8_to_utf16(utf8string, nchars) result(res)
character(len=*, kind=c_char), intent(in) :: utf8string
integer(c_int), intent(in) :: nchars
integer(WCHAR), allocatable :: res(:)
!private
integer(c_int) :: requiredSize
integer(c_int) :: writtenSize
requiredSize = 1 + MultiByteToWideChar(CP_UTF8, &
0, &
utf8string, &
nchars, &
cchWideChar = 0)
allocate(res(requiredSize))
writtenSize = MultiByteToWideChar(CP_UTF8, &
0, &
utf8string, &
nchars, &
res, &
requiredSize)
res(writtenSize) = 0
end function
function utf16_to_local(utf16string, nchars) result(res)
integer(WCHAR), intent(in) :: utf16string(*)
integer(c_int), intent(in) :: nchars
character(len=1, kind=c_char), allocatable :: res(:)
!private
integer(c_int) :: requiredSize
integer(c_int) :: writtenSize
requiredSize = 1 + WideCharToMultiByte(CP_ACP, &
0, &
utf16string, &
nchars, &
cbMultiByte = 0)
allocate(res(requiredSize))
writtenSize = WideCharToMultiByte(CP_ACP, &
0, &
utf16string, &
nchars, &
res, &
requiredSize)
res(writtenSize) = c_null_char
end function
function to_local_codepage(str) result(res)
character(*), intent(in) :: str
!private
character(:), allocatable :: res
integer(c_int) :: str_length
integer(WCHAR), allocatable :: utf16tmp(:)
character(len=1, kind=c_char), allocatable :: local_string(:)
integer :: i
if (str == '') then
res = str
else
utf16tmp = utf8_to_utf16(str//c_null_char, len(str) + 1)
local_string = utf16_to_local(utf16tmp, -1)
allocate(character(size(local_string)) :: res)
do i = 1, size(local_string)
if (local_string(i) == c_null_char) exit
res(i:i) = local_string(i)
end do
if (i <= len(res)) res(i:) = ' '
res = trim(res)
end if
end function
end module
And so, if you create or use an existing ‘string’ implementation and override the uddtio you could get something in Fortran that would adapt the encoding based on the context (console, file, internal file,…)
From my perspective, the intrinsic Unicode manipulation in Fortran standard (which is based on USC4, i.e. UTF-32) is far from useful. So I wrote a simple library which provides some basic UTF-8 operations St-Maxwell/utf8-f
In the UTF-8 encoding, the character strings are stored as the byte string, and the number of bytes representing a character can vary from 1 to 4. So the indexing of character is quite complex, and you can see that the iteration is not done through a do loop but through an iterator.
I just read that ucs-4 is basically utf-32 (fixed-width 4 octets encoded characters, should have been obvious from the name), and as such it is not strictly equivalent to utf-8.
Since you open the file with utf-8 encoding (by default), I do not know how gfortran handles the conversion utf-8/utf-32 under the hood. That might explain the count difference.
Nice lib, I did not know it existed. Thanks.
That won’t solve the OP’s problem though.
From what I understand, the intrinsic open
/inquire
functions do a system call to whatever functions in the os to open/create a file.
On Windows, that would certainly be CreateFileA
. It seems that using gfortran and ucs-4 string, you end up calling CreateFileW
which expects wide character sets. Since Intel does not support ucs-4, you have to change encoding manually, or create your own wrapper around CreateFileW
. Not very convenient… but at least, with all codes posted here, there are solutions.
Hi, thank you for your code! As soon as I have time I will experiment with it. I must however admit that I am puzzled by Fortran running into problems needing plasters. Patrick.