Encoding and decoding unicode from bytes

rbitr · March 10, 2024, 7:46pm

Edit, hopefully for clarity:

Is there a better way to convert from bytes to a character array, for unicode chatacters:

integer(1), allocable :: bytes
...
open(UNIT=5, FILE="tempfile.xxx", FORM="UNFORMATTED",&
                &ACCESS="STREAM", STATUS="REPLACE", POSITION="REWIND", ACTION="WRITE")
        write(5) bytes
        close(5)

        allocate(character(len=j-1) :: tmp_str)

        open(UNIT=5, FILE="tempfile.xxx", FORM="UNFORMATTED",&
                &ACCESS="STREAM", STATUS="OLD", POSITION="REWIND", ACTION="READ")
        read(5) tmp_str
        close(5)

Original Post:
I’m working on a Fortran implementation of byte-level character encoding along the lines of this python example: gpt-2/src/encoder.py at master · openai/gpt-2 · GitHub

Unicode strings are broken into bytes, these bytes are encoded into (unicode) characters according to a dictionary.

I’ve implemented encoding and decoding but it’s been really cumbersome. I’m interested in any hints that could streamline this, and in particular wondering about the decoding: the only way I found to be able to turn an array of bytes back into a unicode string was to write to a file and read it back again. Is there a better way to do this? See below and thanks! Compiles and runs with gfortran-13.

gist.github.com

https://gist.github.com/rbitr/3caffae3fcd4e7b116a04629621adb57

pt.f90

program pretokenize

        character(:), allocatable :: result, orig
        character(:), dimension(:), allocatable :: c_encoding
        integer :: j
        
        c_encoding = make_encoding()
        

        result = pre_tokenize('Andy99아마')

This file has been truncated. show original

Topic		Replies	Views
Should these various ways of converting anything to an array of bytes not produce the same answer? Help	13	554	January 19, 2023
Code that baffles me. Could someone please explain? Help	11	639	June 20, 2023
Using Unicode Characters in Fortran Tutorials	35	6393	January 20, 2025
Equivalent of str function (from python) in Fortran Help	10	1506	February 1, 2022
How do I file-read French special characters like é etc? Help	46	2356	January 22, 2024

Encoding and decoding unicode from bytes

Related topics