A while back there were a few G+ threads sprouted by David Heffernan on decoding big files into line-ending splitted strings:
- [WayBack] Having been a little underwhelmed by the performance of TStreamReader when reading huge text files line by line, I attempted to roll my own. I managed … – David Heffernan – Google+ where he compares the speed with Python (which runs circles around equivalent Delphi code)
- [WayBack] I just read this in TEncoding.GetBufferEncoding: function ContainsPreamble(const Buffer, Signature: array of Byte): Boolean; var I: Integer; … – David Heffernan – Google+ (worse: Delphi 209 and up have 3 different implementations of the ContainsPreamble function)
Code comparison:
Python:
with open(filename, 'r', encoding='utf-16-le') as f: for line in f: passDelphi:
for Line in TLineReader.FromFile(filename, TEncoding.Unicode) do ;
This spurred some nice observations and unfounded statements on which encodings should be used, so I posted a bit of history that is included below.
Some tips and observations from the links:
- Good old text files are not “good” with Unicode support, neither are TextFile Device Drivers; nobody has written a driver supporting a wide range of encodings as of yet.
- Good old text files are slow as well, even with a changed SetTextBuffer
- When using the TStreamReader, the decoding takes much more time than the actual reading, which means that [WayBack] Faster FileStream with TBufferedFileStream • DelphiABall does not help much
- TStringList.LoadFromFile, though fast, is a memory allocation dork and has limits on string size
- Delphi RTL code is not what it used to be: pre-Delphi Unicode RTL code is of far better quality than Delphi 2009 and up RTL code
- Supporting various encodings is important
- EBCDIC days: three kinds of spaces, two kinds of hyphens, multiple codepages
- Strings are just that: strings. It’s about the encoding from/to the file that needs to be optimal.
- When processing large files, caching only makes sense when the file fits in memory. Otherwise caching just adds overhead.
- On Windows, if you read a big text file into memory, open the file in “sequential read” mode, to disable caching. Use the FILE_FLAG_SEQUENTIAL_SCAN flag under Windows, as stated at [WayBack] How do FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS affect how the operating system treats my file? – The Old New Thing
- Python string reading depends on the way you read files (ASCII or Unicode); see [WayBack] unicode – Python codecs line ending – Stack Overflow
Though TLineReader
is not part of the RTL, I think it is from [WayBack] For-in Enumeration – ADUG.
Encodings in use
It doesn’t help that on the Windows Console, various encodings are used:
- Most tools still use Code Page 437 (from the good old DOS days, also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US.)
- Unicode aware applications often use UTF-8
- A minority (but growing because of PowerShell: [WayBack] utf 8 – Changing PowerShell’s default output encoding to UTF-8 – Stack Overflow) of tools uses UTF-16 because Windows Unicode support started with UCS-2 with an in-memory little-endian representation (.REG files, MSXML, SQL Server Management Studio)
Good reading here is [WayBack] c++ – What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types? – Stack Overflow
Encoding history
+A. Bouchez I’m with +David Heffernan here:
At its release in 1993, Windows NT was very early in supporting Unicode. Development of Windows NT started in 1990 where they opted for UCS-2 having 2 bytes per character and had a non-required annex on UTF-1.
UTF-1 – that later evolved into UTF-8 – did not even exist at that time. Even UCS-2 was still young: it got designed in 1989. UTF-8 was outlined late 1992 and became a standard in 1993
Some references:
–jeroen