How do I file-read French special characters like é etc?

Patrick · October 10, 2023, 6:13am

How do I file-read French special characters like é etc? I would like something as simple as encoding=‘UTF-8’. Surely there must be French Fortran programmers coding file reads?

I am having a little problem positioning my thank you’s in the right place - i.e. under your contributions. So again. Brad and Grofz, thank you very much.

everythingfunctional · October 10, 2023, 1:31pm

But, in all honesty, you can see what characters are available with a program like.

program chars
    use iso_fortran_env, only: int64
    implicit none
    integer, parameter :: ASCII_KIND = selected_char_kind('ASCII')
    integer, parameter :: ISO_KIND = selected_char_kind('ISO_10646')
    integer(int64) :: i

    print '(A)', "ASCII Character Values"
    do i = 1_int64, 2_int64**storage_size(ASCII_KIND_' ')
        print '(I3,", ",A1)', i, char(i, kind=ASCII_KIND)
    end do

    print *
    print '(A)', "ISO 10646 Character Values"
    do i = 1_int64, 2_int64**storage_size(ISO_KIND_' ')
        print '(I10,", ",A1)', i, char(i, kind=ISO_KIND)
    end do
end program

I.e.

$ gfortran chars.f90 -o chars
$ ./chars | less
ASCII Character Values
  1, ^A
  2, ^B
  3, ^C
  4, ^D
  5, ^E
  6, ^F
  7, ^G
  8,
  9,    
 10, 

 11, ^K
 12, ^L
 13, 
 14, ^N
 15, ^O
 16, ^P
...
ISO 10646 Character Values
         1, ^A
         2, ^B
         3, ^C
         4, ^D
         5, ^E
         6, ^F
         7, ^G
         8,
         9,     
        10, 

        11, ^K
        12, ^L
        13, 
        14, ^N
        15, ^O
        16, ^P
...

grofz · October 10, 2023, 1:50pm

The following code reads (and prints to screen) line by line from an UTF-8 encoded text file. Does this help?

program test_read_utf8
  use iso_fortran_env, only : output_unit
  implicit none
  integer, parameter :: UTF=selected_char_kind('ISO_10646')
  character(kind=UTF,len=200) :: line
  integer :: fid, ios
   
  open(newunit=fid, encoding='utf-8', file='a.tex')
  open(output_unit, encoding='utf-8')
   
  do
    read(fid,'(a)',iostat=ios) line
    if (ios /= 0) exit
    print *, trim(line)
  end do
  close(fid)
end program

Patrick · October 10, 2023, 8:39pm

Grofz, thank you also very much for the code you posted in response to my query regarding reading special characters! I will study it tomorrow. Patrick.

Patrick · October 10, 2023, 8:44pm

Brad, thank you too very much for the code you posted! I will study it tomorrow. Patrick.

rbitr · October 10, 2023, 11:05pm

In case it’s helpful I came across this the other day: The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

In particular if you need to do manual decoding of utf-8 from bytes, I found the “How many bytes are in UTF-8?” section helpful.

Pap · October 11, 2023, 7:22am

If you are on… that popular “operating system”, I had some (solvable) trouble with UTF, mainly because it still uses UTF-16, at least to an extent. I actually had to implement a module, effective only there, to deal with the equivalent of wchar_t some libraries use.

The link provided by @rbitr is a must read indeed.

vmagnin · October 11, 2023, 7:52am

Thanks for that interesting link. I miss those ill-decoded web pages we had for so long in our browsers . It’s interesting to consider how long it can take for a standard or protocol like Unicode or IPv6 to be fully adopted (well, it reminds me modern Fortran vs. FORTRAN 77)…

art-rasa · October 11, 2023, 8:00am

The talk by Dylan Beattie posted earlier in thread was very interesting and entertaining. “Plain text” might not be as plain as we think.

art-rasa · October 11, 2023, 8:03am

The Japanese even have their own term for those garbled web pages, Mojibake.

vmagnin · October 11, 2023, 8:05am

Then I propose “ill-baked” in English…

Pap · October 11, 2023, 8:26am

Be happy it’s only about a few special accented characters in French, and a few special letters in Nordic languages or Spanish. In my language 63% of the alphabet is the same as in Latin (at least in capital form,) and even those are not treated as ASCII characters. Webpages and (even a simple terminal) were a mess for at least two decades. We had to use special plugin programs, and not all of them were compatible with eachother, so we were at the mercy of the web designer and what plugin they used. Only with the broad adoption of UTF-8 the problem was really solved.

For more complicated languages, where the concept of an “alphabet” does not really exist, you can have thousands of characters. It is a miracle it even works in the first place. I’m guessing “Mojibake” should very well be a thing even today.

art-rasa · October 11, 2023, 10:37am

This seems like a nice thread to ask a question about wide characters in Fortran. Moderator(s) may move my question to appropriate thread, if necessary.

What’s the correct way to save wide (ISO_10646) characters in an array? I tried a similar way as mentioned on Fortran-lang “types and kinds” but I get just question marks when I try to print the individual array elements. After some experimentation I found when the array is built as character(len=4, kind=ucs4) I get the expected output. Why len=4? Shouldn’t kind=ucs4 already define each array element to the of correct format/length? Or did I make some mistake with my code.

program iso10646_array
    implicit none
    
    integer, parameter :: ucs4 = selected_char_kind('iso_10646')
    character(len=4, kind=ucs4), parameter :: chars_a(*) = [ &
        char(int( z'5e74' ),ucs4), & ! year
        char(int( z'6708' ),ucs4), & ! month
        char(int( z'65e5' ),ucs4)  & ! day 
    ]
    character(len=4, kind=ucs4), parameter :: chars_b(*) = [ &
        '年', '月', '日' &
    ]
    
    call show(chars_a) ! printed as question marks
    call show(chars_b) ! printed correctly
    
contains
    subroutine show(a)
        character(len=*, kind=ucs4), intent(in) :: a(:)
        integer :: i
        print *, 'Printing array of size', size(a)
        do i = 1, size(a)
            print *, '-->', i, a(i)
        end do
    end subroutine
end program

Output is:

 Printing array of size           3
 -->           1 ?   
 -->           2 ?   
 -->           3 ?   
 Printing array of size           3
 -->           1 年 
 -->           2 月 
 -->           3 日 


------------------
(program exited with code: 0)
Press return to continue

$ gfortran --version
GNU Fortran (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1)

everythingfunctional · October 11, 2023, 1:23pm

As illustrated above, it appears your terminal is not expecting UTF-8 output encoding by default. Try the following:

program iso10646_array
    use iso_fortran_env, only: output_unit
    implicit none
    
    integer, parameter :: ucs4 = selected_char_kind('iso_10646')
    character(len=1, kind=ucs4), parameter :: chars_a(*) = [ &
        char(int( z'5e74' ),ucs4), & ! year
        char(int( z'6708' ),ucs4), & ! month
        char(int( z'65e5' ),ucs4)  & ! day 
    ]
    character(len=1, kind=ucs4), parameter :: chars_b(*) = [ &
        '年', '月', '日' &
    ]
    open(output_unit, encoding="UTF-8")
    call show(chars_a) ! printed as question marks
    call show(chars_b) ! printed correctly
    
contains
    subroutine show(a)
        character(len=*, kind=ucs4), intent(in) :: a(:)
        integer :: i
        print *, 'Printing array of size', size(a)
        do i = 1, size(a)
            print *, '-->', i, a(i)
        end do
    end subroutine
end program

 Printing array of size           3
 -->           1 年
 -->           2 月
 -->           3 日
 Printing array of size           3
 -->           1 å
 -->           2 æ
 -->           3 æ

Patrick · October 11, 2023, 4:27pm

Hoping that I am inserting my message at the right place.

Anyway:
I am running into a bunch of errors applying the gratefully received advice, q.v.

having coded: character(kind=UTF,len=800) :: string

the first error reported is:

. . . . . 47 | 10 do while (string(1:1).NE.'')
| 1
. . . . . Error: Operands of comparison operator ‘.ne.’ at (1) are CHARACTER(,4)/CHARACTER(1)

I presume the identity for both string mismatches.

Perhaps what is wrong here, might lead me to understand all the other errors thrown up upon compile.

Please help,
Patrick.

Patrick · October 11, 2023, 5:22pm

Hi. Try following you say. Well this is the outcome of your test code:

Printing array of size 3
→ 1 Õ╣┤
→ 2 µ£ê
→ 3 µùÑ
Printing array of size 3
→ 1 ├Ñ
→ 2 ├ª
→ 3 ├ª

Might I please have your thoughts?

everythingfunctional · October 11, 2023, 6:39pm

The error message is pointing out a kind mismatch. Try

10 do while (string(1:1) /= utf_' ')

everythingfunctional · October 11, 2023, 6:41pm

The actual symbol that appears on your screen depends on the font. It may also depend on the encoding supported by your terminal emulator.

Patrick · October 12, 2023, 3:03pm

Removed

Arjen · October 13, 2023, 11:58am

I will not comment on your last sentences, but I tried reading a file containing á and è, which was saved as UTF explicitly, and got perfectly fine results. This was with gfortran and Intel Fortran oneAPI on both Linux and Windows. The only problem was that sometimes the terminal was not cooperating, but when redirecting the output of my silly program to a file, the contents of the file showed up as expected.

Here is the source code:


! utf.f90 --
!     Experiment with UTF
!
program utf
    implicit none

    character(len=20) :: string

    open( 10, file = 'utf.txt', encoding = 'utf-8' )
    read( 10, '(a)' ) string
    write(*,*) string
end program utf

The main problem as far as a proof of concept is that it is not doing much with the content of the string.

Topic		Replies	Views
Culture setting / inoculation against squiggles Help	35	1262	June 2, 2023
Using Unicode Characters in Fortran Tutorials	35	6523	January 20, 2025
Could someone please correct this code? Help	12	475	February 6, 2024
Advent of Code 2024 Announcements	37	1068	December 27, 2024
How to use utf-8 in gfortran? Help	11	303	August 2, 2025

How do I file-read French special characters like é etc?

Related topics