How do I file-read French special characters like é etc?

:thinking: Odd, I do not get the length difference. Crazy that in 2024 we are still struggling with character encoding.
Here is another test you could try. If you create a file with special characters in the name and put the path in your utf_in.txt. You could test if inquire and open are working.
I did the test on windows with ifort and it failed. These functions require a path encoded in the local code page which in my case is 850.

1 Like

In my case, it works with a file named utfé_in.txt under Linux, both with GFortran and Intel ifx.

Anyway, concerning ifx, if we follow the standard, we should not use encoding='utf-8':

The value UTF-8 shall not be specified if the processor does not support the ISO 10646 character kind.

And therefore, it should be avoided if you want your code to work with any compiler…

In the sentence “Intel Fortran doesn’t yet support other character kinds”, I think the important word is yet. Unicode is like IPv6, you must be patient… It was released in 1991, but it took at least 20 years before the web pages were generally correctly encoded and decoded…

Happily, it is not so important for most of us as the lingua franca of science is English.

@Patrick, I have been playing with the Win32 interface directly from Fortran and managed to port the C code I reported before.

module win32
   use, intrinsic :: iso_c_binding
   
   implicit none
   
   private 
   
   public :: to_local_codepage
   
   integer, parameter :: UINT = c_int, DWORD = c_long, WCHAR = c_int16_t
   integer, parameter :: BOOL = c_int, HANDLE = c_intptr_t
   integer(UINT), parameter :: CP_ACP = 0
   integer(UINT), parameter :: CP_MACCP = 2
   integer(UINT), parameter :: CP_OEMCP = 1
   integer(UINT), parameter :: CP_SYMBOL = 42
   integer(UINT), parameter :: CP_THREAD_ACP = 3
   integer(UINT), parameter :: CP_UTF7 = 65000
   integer(UINT), parameter :: CP_UTF8 = 65001
   integer(DWORD), parameter :: MB_COMPOSITE = 2
   integer(DWORD), parameter :: MB_ERR_INVALID_CHARS = 8
   integer(DWORD), parameter :: MB_PRECOMPOSED = 1
   integer(DWORD), parameter :: MB_USEGLYPHCHARS = 4
   integer(DWORD), parameter :: WC_COMPOSITECHECK = 512
   integer(DWORD), parameter :: WC_ERR_INVALID_CHARS = 128
   integer(DWORD), parameter :: WC_NO_BEST_FIT_CHARS = 1024
   integer(DWORD), parameter :: WC_DEFAULTCHAR = 64
   integer(DWORD), parameter :: WC_DISCARDNS = 16
   integer(DWORD), parameter :: WC_SEPCHARS = 32
   integer(c_int), parameter :: MB_ABORTRETRYIGNORE = 2
   integer(c_int), parameter :: MB_CANCELTRYCONTINUE = 6
   integer(c_int), parameter :: MB_HELP = 16384
   integer(c_int), parameter :: MB_OK = 0
   integer(c_int), parameter :: MB_OKCANCEL = 1
   integer(c_int), parameter :: MB_RETRYCANCEL = 5
   integer(c_int), parameter :: MB_YESNO = 4
   integer(c_int), parameter :: MB_YESNOCANCEL = 3
   integer(c_int), parameter :: MB_ICONEXCLAMATION = 48
   integer(c_int), parameter :: MB_ICONWARNING = 48
   integer(c_int), parameter :: MB_ICONINFORMATION = 64
   integer(c_int), parameter :: MB_ICONASTERISK = 64
   integer(c_int), parameter :: MB_ICONQUESTION = 32
   integer(c_int), parameter :: MB_ICONSTOP = 16
   integer(c_int), parameter :: MB_ICONERROR = 16
   integer(c_int), parameter :: MB_ICONHAND = 16
   integer(c_int), parameter :: MB_DEFBUTTON1 = 0
   integer(c_int), parameter :: MB_DEFBUTTON2 = 256
   integer(c_int), parameter :: MB_DEFBUTTON3 = 512
   integer(c_int), parameter :: MB_DEFBUTTON4 = 768
   integer(c_int), parameter :: MB_APPLMODAL = 0
   integer(c_int), parameter :: MB_SYSTEMMODAL = 4096
   integer(c_int), parameter :: MB_TASKMODAL = 8192
   integer(c_int), parameter :: MB_DEFAULT_DESKTOP_ONLY = 131072
   integer(c_int), parameter :: MB_RIGHT = 524288
   integer(c_int), parameter :: MB_RTLREADING = 1048576
   integer(c_int), parameter :: MB_SETFOREGROUND = 65536
   integer(c_int), parameter :: MB_TOPMOST = 262144
   integer(c_int), parameter :: MB_SERVICE_NOTIFICATION = 2097152
   
   interface
      function MultiByteToWideChar(CodePage, dwFlags, lpMultiByteStr, &
         cbMultiByte,lpWideCharStr,cchWideChar) bind(C,name='MultiByteToWideChar')
         import
         implicit none
         !GCC$ ATTRIBUTES STDCALL :: MultiByteToWideChar
		 !GCC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpWideCharStr
         !DEC$ ATTRIBUTES STDCALL :: MultiByteToWideChar
		 !DEC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpWideCharStr
         integer(c_int) MultiByteToWideChar
         integer(UINT), value :: CodePage
         integer(DWORD), value :: dwFlags
         character(kind=C_CHAR), optional :: lpMultiByteStr(*)
         integer(c_int), value :: cbMultiByte
         integer(WCHAR), optional :: lpWideCharStr(*)
         integer(c_int), value :: cchWideChar
      end function
   end interface
   
   interface
      function WideCharToMultiByte(CodePage, dwFlags, lpWideCharStr, &
									cchWideChar, lpMultiByteStr, cbMultiByte, lpDefaultChar, &
									lpUsedDefaultChar) bind(C,name='WideCharToMultiByte')
         import
         implicit none
         !GCC$ ATTRIBUTES STDCALL :: WideCharToMultiByte
		 !GCC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpDefaultChar, lpUsedDefaultChar
         !DEC$ ATTRIBUTES STDCALL :: WideCharToMultiByte
		 !DEC$ ATTRIBUTES ALLOW_NULL :: lpMultiByteStr, lpDefaultChar, lpUsedDefaultChar
         integer(c_int) WideCharToMultiByte
         integer(UINT), value :: CodePage
         integer(DWORD), value :: dwFlags
         integer(WCHAR) lpWideCharStr(*)
         integer(c_int), value :: cchWideChar
         character(kind=C_CHAR), optional :: lpMultiByteStr(*)
         integer(c_int), value :: cbMultiByte
         character(kind=C_CHAR), optional :: lpDefaultChar
         integer(BOOL), optional :: lpUsedDefaultChar
      end function
   end interface

contains
	
	function utf8_to_utf16(utf8string, nchars) result(res)
		character(len=*, kind=c_char), intent(in) :: utf8string
		integer(c_int), intent(in) :: nchars
		integer(WCHAR), allocatable :: res(:)
		!private
		integer(c_int) :: requiredSize
		integer(c_int) :: writtenSize
		
		requiredSize = 1 + MultiByteToWideChar(CP_UTF8, &
								0, & 
								utf8string, &
								nchars, &
								cchWideChar = 0)
		
		allocate(res(requiredSize))
		
		writtenSize = MultiByteToWideChar(CP_UTF8, &
											0, &
											utf8string, &
											nchars, &
											res, &
											requiredSize)
		
		res(writtenSize) = 0
	end function
	
	function utf16_to_local(utf16string, nchars) result(res)
		integer(WCHAR), intent(in) :: utf16string(*)
		integer(c_int), intent(in) :: nchars
		character(len=1, kind=c_char), allocatable :: res(:)
		!private
		integer(c_int) :: requiredSize
		integer(c_int) :: writtenSize
		
		requiredSize = 1 + WideCharToMultiByte(CP_ACP, &
								0, & 
								utf16string, &
								nchars, &
								cbMultiByte = 0)
		
		allocate(res(requiredSize))
		
		writtenSize = WideCharToMultiByte(CP_ACP, &
											0, &
											utf16string, &
											nchars, &
											res, &
											requiredSize)
		
		res(writtenSize) = c_null_char
	end function

    function to_local_codepage(str) result(res)
        character(*), intent(in) :: str
        !private
        character(:), allocatable :: res
        integer(c_int) :: str_length
		integer(WCHAR), allocatable :: utf16tmp(:)
		character(len=1, kind=c_char), allocatable :: local_string(:)
		integer :: i
        if (str == '') then
            res = str
		else
			utf16tmp = utf8_to_utf16(str//c_null_char, len(str) + 1)
            local_string = utf16_to_local(utf16tmp, -1)

            allocate(character(size(local_string)) :: res)
			do i = 1, size(local_string)
				if (local_string(i) == c_null_char) exit
				res(i:i) = local_string(i)
			end do
			if (i <= len(res)) res(i:) = ' '
			res = trim(res)
		end if
		
    end function
	
end module

And so, if you create or use an existing ‘string’ implementation and override the uddtio you could get something in Fortran that would adapt the encoding based on the context (console, file, internal file,…)

From my perspective, the intrinsic Unicode manipulation in Fortran standard (which is based on USC4, i.e. UTF-32) is far from useful. So I wrote a simple library which provides some basic UTF-8 operations St-Maxwell/utf8-f


In the UTF-8 encoding, the character strings are stored as the byte string, and the number of bytes representing a character can vary from 1 to 4. So the indexing of character is quite complex, and you can see that the iteration is not done through a do loop but through an iterator.

5 Likes

I just read that ucs-4 is basically utf-32 (fixed-width 4 octets encoded characters, should have been obvious from the name), and as such it is not strictly equivalent to utf-8.
Since you open the file with utf-8 encoding (by default), I do not know how gfortran handles the conversion utf-8/utf-32 under the hood. That might explain the count difference.

1 Like

Nice lib, I did not know it existed. Thanks.
That won’t solve the OP’s problem though.
From what I understand, the intrinsic open/inquire functions do a system call to whatever functions in the os to open/create a file.
On Windows, that would certainly be CreateFileA. It seems that using gfortran and ucs-4 string, you end up calling CreateFileW which expects wide character sets. Since Intel does not support ucs-4, you have to change encoding manually, or create your own wrapper around CreateFileW. Not very convenient… but at least, with all codes posted here, there are solutions.

Hi, thank you for your code! As soon as I have time I will experiment with it. I must however admit that I am puzzled by Fortran running into problems needing plasters. Patrick.

1 Like