Edit, hopefully for clarity:
Is there a better way to convert from bytes to a character array, for unicode chatacters:
integer(1), allocable :: bytes
...
open(UNIT=5, FILE="tempfile.xxx", FORM="UNFORMATTED",&
&ACCESS="STREAM", STATUS="REPLACE", POSITION="REWIND", ACTION="WRITE")
write(5) bytes
close(5)
allocate(character(len=j-1) :: tmp_str)
open(UNIT=5, FILE="tempfile.xxx", FORM="UNFORMATTED",&
&ACCESS="STREAM", STATUS="OLD", POSITION="REWIND", ACTION="READ")
read(5) tmp_str
close(5)
Original Post:
I’m working on a Fortran implementation of byte-level character encoding along the lines of this python example: gpt-2/src/encoder.py at master · openai/gpt-2 · GitHub
Unicode strings are broken into bytes, these bytes are encoded into (unicode) characters according to a dictionary.
I’ve implemented encoding and decoding but it’s been really cumbersome. I’m interested in any hints that could streamline this, and in particular wondering about the decoding: the only way I found to be able to turn an array of bytes back into a unicode string was to write to a file and read it back again. Is there a better way to do this? See below and thanks! Compiles and runs with gfortran-13.