Curious increase in file data after upgrade

garynewport · December 4, 2022, 11:32am

I have recently upgraded my PC; well, built a new one. I then installed Windows 11 and mingw64, using GCC 12.2.0

I run my code that I have run under my old Windows installation, on Mac OSx and on Linux (Fedora) and the resultant file is double in size.

The curious thing is that when I open the file up in a text editor, it is exactly the same as before. Obviously, there is a new set of characters appearing but I cannot seem to find them at all.

I have now run two versions of my code, in case changes I made have generated the error, but no. I’m unsure if I had MingGW-W-64 as my GCC installation before, and assuming it is something about this version of GCC?

Any ideas, anyone?

msz59 · December 4, 2022, 2:33pm

Windows used to have UTF-16 encoding for text files, so the file size was roughly twice the number of characters. From what I checked on my wife’s laptop with W11, it is rather UTF-8 now, so it should make the files smaller, opposite of what you see. I don’t know/use Windows so no expertise, but maybe it is somehow configurable.

JohnCampbell · December 5, 2022, 3:49am

I am not aware of this being the case.

My experience is Windows (up to Win 10) uses UTF-16 for file names (2-byte coding), but UTF-8 for text files ( ie 1-byte coding with 7-bit ASCII text plus sign bit extension ).
I have never selected a 2-byte character kind in Fortran so am not aware of either UTF-8 or UTF-16 extension to the ASCII character set I use in Windows.
I have never observed text files containing a 2-byte encoding, while I have received and reviewed a variety of files from third parties. I regularly receive files with UTF-16 file names, mainly .pdf files.

Changing the default character KIND in GCC 12.2.0 would be a significant issue and I would expect a flurry of complaints.

msz59 · December 5, 2022, 4:51pm

You’re probably right, my fault. I’ve probably mistaken the file names and text content. I wander when did they introduce utf8, as the oldest versions had single-byte encodings, all those Windows- LatinN, leading to all sorts of confusion.

garynewport · December 6, 2022, 10:10am

msz59 was not too far from the truth; it’s to do with using PowerShell (which I wasn’t initially aware I was using).

PowerShell does not use ASCII (UTF-8) but Unicode (UTF-16) when creating files; hence the doubling of the file size!

At least I am not suffering of additional characters, etc. Unfortunately, Python appears to be reading the file in as 8-bit, not recognising the change in character set. I have a fix.

Topic		Replies	Views
Handling files in directories whose names are not entirely ASCII Help	8	1075	November 23, 2022
File_storage_size Question Help	6	457	September 25, 2022
File size of compiled Fortran programs - Windows/Linux Help	12	1061	November 30, 2021
How do I file-read French special characters like é etc? Help	46	2354	January 22, 2024
Problem after reading binary file with windows gfortran Compilers	1	191	February 14, 2024

Curious increase in file data after upgrade

Related topics