How do I file-read French special characters like é etc?

How do I file-read French special characters like é etc? I would like something as simple as encoding=‘UTF-8’. Surely there must be French Fortran programmers coding file reads?

I am having a little problem positioning my thank you’s in the right place - i.e. under your contributions. So again. Brad and Grofz, thank you very much.

But, in all honesty, you can see what characters are available with a program like.

program chars
    use iso_fortran_env, only: int64
    implicit none
    integer, parameter :: ASCII_KIND = selected_char_kind('ASCII')
    integer, parameter :: ISO_KIND = selected_char_kind('ISO_10646')
    integer(int64) :: i

    print '(A)', "ASCII Character Values"
    do i = 1_int64, 2_int64**storage_size(ASCII_KIND_' ')
        print '(I3,", ",A1)', i, char(i, kind=ASCII_KIND)
    end do

    print *
    print '(A)', "ISO 10646 Character Values"
    do i = 1_int64, 2_int64**storage_size(ISO_KIND_' ')
        print '(I10,", ",A1)', i, char(i, kind=ISO_KIND)
    end do
end program

I.e.

$ gfortran chars.f90 -o chars
$ ./chars | less
ASCII Character Values
  1, ^A
  2, ^B
  3, ^C
  4, ^D
  5, ^E
  6, ^F
  7, ^G
  8,
  9,    
 10, 

 11, ^K
 12, ^L
 13, 
 14, ^N
 15, ^O
 16, ^P
...
ISO 10646 Character Values
         1, ^A
         2, ^B
         3, ^C
         4, ^D
         5, ^E
         6, ^F
         7, ^G
         8,
         9,     
        10, 

        11, ^K
        12, ^L
        13, 
        14, ^N
        15, ^O
        16, ^P
...
2 Likes

The following code reads (and prints to screen) line by line from an UTF-8 encoded text file. Does this help?

program test_read_utf8
  use iso_fortran_env, only : output_unit
  implicit none
  integer, parameter :: UTF=selected_char_kind('ISO_10646')
  character(kind=UTF,len=200) :: line
  integer :: fid, ios
   
  open(newunit=fid, encoding='utf-8', file='a.tex')
  open(output_unit, encoding='utf-8')
   
  do
    read(fid,'(a)',iostat=ios) line
    if (ios /= 0) exit
    print *, trim(line)
  end do
  close(fid)
end program
1 Like

Grofz, thank you also very much for the code you posted in response to my query regarding reading special characters! I will study it tomorrow. Patrick.

Brad, thank you too very much for the code you posted! I will study it tomorrow. Patrick.

In case it’s helpful I came across this the other day: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

In particular if you need to do manual decoding of utf-8 from bytes, I found the “How many bytes are in UTF-8?” section helpful.

2 Likes

If you are on… that popular “operating system”, I had some (solvable) trouble with UTF, mainly because it still uses UTF-16, at least to an extent. I actually had to implement a module, effective only there, to deal with the equivalent of wchar_t some libraries use.

The link provided by @rbitr is a must read indeed.

Thanks for that interesting link. I miss those ill-decoded web pages we had for so long in our browsers :grinning:. It’s interesting to consider how long it can take for a standard or protocol like Unicode or IPv6 to be fully adopted (well, it reminds me modern Fortran vs. FORTRAN 77)…

The talk by Dylan Beattie posted earlier in thread was very interesting and entertaining. “Plain text” might not be as plain as we think.

The Japanese even have their own term for those garbled web pages, Mojibake.

1 Like

Then I propose “ill-baked” in English… :slightly_smiling_face:

Be happy it’s only about a few special accented characters in French, and a few special letters in Nordic languages or Spanish. In my language 63% of the alphabet is the same as in Latin (at least in capital form,) and even those are not treated as ASCII characters. Webpages and (even a simple terminal) were a mess for at least two decades. We had to use special plugin programs, and not all of them were compatible with eachother, so we were at the mercy of the web designer and what plugin they used. Only with the broad adoption of UTF-8 the problem was really solved.

For more complicated languages, where the concept of an “alphabet” does not really exist, you can have thousands of characters. It is a miracle it even works in the first place. I’m guessing “Mojibake” should very well be a thing even today.

2 Likes

This seems like a nice thread to ask a question about wide characters in Fortran. Moderator(s) may move my question to appropriate thread, if necessary.

What’s the correct way to save wide (ISO_10646) characters in an array? I tried a similar way as mentioned on Fortran-lang “types and kinds” but I get just question marks when I try to print the individual array elements. After some experimentation I found when the array is built as character(len=4, kind=ucs4) I get the expected output. Why len=4? Shouldn’t kind=ucs4 already define each array element to the of correct format/length? Or did I make some mistake with my code.

program iso10646_array
    implicit none
    
    integer, parameter :: ucs4 = selected_char_kind('iso_10646')
    character(len=4, kind=ucs4), parameter :: chars_a(*) = [ &
        char(int( z'5e74' ),ucs4), & ! year
        char(int( z'6708' ),ucs4), & ! month
        char(int( z'65e5' ),ucs4)  & ! day 
    ]
    character(len=4, kind=ucs4), parameter :: chars_b(*) = [ &
        '年', '月', '日' &
    ]
    
    call show(chars_a) ! printed as question marks
    call show(chars_b) ! printed correctly
    
contains
    subroutine show(a)
        character(len=*, kind=ucs4), intent(in) :: a(:)
        integer :: i
        print *, 'Printing array of size', size(a)
        do i = 1, size(a)
            print *, '-->', i, a(i)
        end do
    end subroutine
end program

Output is:

 Printing array of size           3
 -->           1 ?   
 -->           2 ?   
 -->           3 ?   
 Printing array of size           3
 -->           1 年 
 -->           2 月 
 -->           3 日 


------------------
(program exited with code: 0)
Press return to continue
$ gfortran --version
GNU Fortran (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1)

As illustrated above, it appears your terminal is not expecting UTF-8 output encoding by default. Try the following:

program iso10646_array
    use iso_fortran_env, only: output_unit
    implicit none
    
    integer, parameter :: ucs4 = selected_char_kind('iso_10646')
    character(len=1, kind=ucs4), parameter :: chars_a(*) = [ &
        char(int( z'5e74' ),ucs4), & ! year
        char(int( z'6708' ),ucs4), & ! month
        char(int( z'65e5' ),ucs4)  & ! day 
    ]
    character(len=1, kind=ucs4), parameter :: chars_b(*) = [ &
        '年', '月', '日' &
    ]
    open(output_unit, encoding="UTF-8")
    call show(chars_a) ! printed as question marks
    call show(chars_b) ! printed correctly
    
contains
    subroutine show(a)
        character(len=*, kind=ucs4), intent(in) :: a(:)
        integer :: i
        print *, 'Printing array of size', size(a)
        do i = 1, size(a)
            print *, '-->', i, a(i)
        end do
    end subroutine
end program
 Printing array of size           3
 -->           1 年
 -->           2 月
 -->           3 日
 Printing array of size           3
 -->           1 å
 -->           2 æ
 -->           3 æ

Hoping that I am inserting my message at the right place.

Anyway:
I am running into a bunch of errors applying the gratefully received advice, q.v.

having coded: character(kind=UTF,len=800) :: string

the first error reported is:

. . . . . 47 | 10 do while (string(1:1).NE.'')
| 1
. . . . . Error: Operands of comparison operator ‘.ne.’ at (1) are CHARACTER(
,4)/CHARACTER(1)

I presume the identity for both string mismatches.

Perhaps what is wrong here, might lead me to understand all the other errors thrown up upon compile.

Please help,
Patrick.

Hi. Try following you say. Well this is the outcome of your test code:

Printing array of size 3
→ 1 Õ╣┤
→ 2 µ£ê
→ 3 µùÑ
Printing array of size 3
→ 1 ├Ñ
→ 2 ├ª
→ 3 ├ª

Might I please have your thoughts?

The error message is pointing out a kind mismatch. Try

10 do while (string(1:1) /= utf_' ')

The actual symbol that appears on your screen depends on the font. It may also depend on the encoding supported by your terminal emulator.

Removed

I will not comment on your last sentences, but I tried reading a file containing á and è, which was saved as UTF explicitly, and got perfectly fine results. This was with gfortran and Intel Fortran oneAPI on both Linux and Windows. The only problem was that sometimes the terminal was not cooperating, but when redirecting the output of my silly program to a file, the contents of the file showed up as expected.

Here is the source code:


! utf.f90 --
!     Experiment with UTF
!
program utf
    implicit none

    character(len=20) :: string

    open( 10, file = 'utf.txt', encoding = 'utf-8' )
    read( 10, '(a)' ) string
    write(*,*) string
end program utf

The main problem as far as a proof of concept is that it is not doing much with the content of the string.

1 Like