Delphi, decoding files to strings and finding line endings: some links, some history on Windows NT and UTF/UCS encodings

A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:

[WayBack] Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed … – David Heffernan – Google+ where he compares the speed with Python (which runs circles around equivalent Delphi code)
[WayBack] I just read this in TEncoding.GetBufferEncoding: function ContainsPreamble(const Buffer, Signature: array of Byte): Boolean; var I: Integer; … – David Heffernan – Google+ (worse: Delphi 209 and up have 3 different implementations of the ContainsPreamble function)

Code comparison:

Python:

with open(filename, 'r', encoding='utf-16-le') as f:
  for line in f:
    pass

Delphi:

for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do
  ;

This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.

Some tips and observations from the links:

Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
Good old text files are slow as well, even with a changed SetTextBuffer
When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
Supporting various encodings is important
EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow

Though TLineReader is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.

Encodings in use

It doesn’t help that on the Windows Console, various encodings are used:

Most tools still use Code Page 437 (from the good old DOS days, also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US.)
Unicode aware applications often use UTF-8
A minority (but growing because of PowerShell: [WayBack] utf 8 – Changing PowerShell’s default output encoding to UTF-8 – Stack Overflow) of tools uses UTF-16 because Windows Unicode support started with UCS-2 with an in-memory little-endian representation (.REG files, MSXML, SQL Server Management Studio)

Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow

Encoding history

+A. Bouchez I’m with +David Heffernan here:

At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.

UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993

Some references:

[WayBack] Windows NT and VMS: The Rest of the Story | IT Pro

[WayBack] Windows NT – Wikipedia

[WayBack] History: UTF-8 – Wikipedia

[WayBack] UCS History: Universal Coded Character Set – Wikipedia

[WayBack] UTF-1 – Wikipedia

–jeroen