I wrote a tool that merge different text files (the files are small). Files can be ANSI (Latin1), UTF-8 with or without BOM. For files with a BOM Delphi detects correctly the charset of the file but for files without a BOM I must do some hackery to detect the charset (see GetFileCharset).
In the following Delphi code, I get 2 warnings (see comments at the end of the concerned lines):
uses
WideStrUtils;
function GetFileCharset(const Filename: String): TEncoding;
var
StreamReader: TStreamReader;
FallbackEncoding: TEncoding;
CurrLine: String;
begin
FallbackEncoding := TEncoding.ANSI;
try
StreamReader := TStreamReader.Create(Filename, FallbackEncoding, True);
try
Result := StreamReader.CurrentEncoding;
if StreamReader.CurrentEncoding = FallbackEncoding then
begin
while not StreamReader.EndOfStream do
begin
CurrLine := StreamReader.ReadLine;
if IsUTF8String(CurrLine) then //[dcc32 Warning]: W1058 Implicit string cast with potential data loss from 'string' to 'RawByteString'
begin
Result := TEncoding.UTF8;
break;
end;
end;
end;
finally
StreamReader.Close;
StreamReader.Free;
end;
except on E : Exception do
Result := FallbackEncoding;
end;
end;
StreamWriter := TStreamWriter.Create(OutputFile, False, TEncoding.UTF8);
try
StreamReader := TStreamReader.Create(InputFile, GetFileCharset(CurrFile), True);
try
while not StreamReader.EndOfStream do
StreamWriter.WriteLine(UTF8Encode(StreamReader.ReadLine)); //[dcc32 Warning]: W1057 Implicit string cast from 'RawByteString' to 'string'
finally
StreamReader.Close;
StreamReader.Free;
end;
finally
StreamWriter.Close;
StreamWriter.Free;
end;
#1 For the Implicit string cast warning I can easily do:
StreamWriter.WriteLine(String(UTF8Encode(StreamReader.ReadLine)));
but I'm wondering if there is a better way or if there is potential danger here?
#2 For the Implicit string cast with potential data loss, I'm not sure how to safely fix this.
#3 Is there a better way to detect the file charset over what I did?
For the 1st warning:
StreamReader.ReadLine()returns a UTF-16UnicodeStringthat has already been charset-decoded using the reader's assignedEncoding(which in your case will always beTEncoding.ANSI). So any encoding details about a line have already been lost before you ever see that line.IsUTF8String()takes in aRawByteStringand returns whether its 8-bit characters are encoded in UTF-8 or not. It is useless for a 16-bitUnicodeString.You are getting an implicit conversion when calling
IsUTF8String()with a 16-bit string instead of an 8-bit string.IsUTF8String()will not returnTruein your situation, as aUnicodeString-to-RawByteStringconversion will not produce a UTF-8 string (unless you manually setSystem.DefaultSystemCodePageto65001akaCP_UTF8beforehand). So, you can simply get rid of this test altogether from your code.For what you are attempting, you need to analyze the raw bytes of the file, not the decoded characters from
StreamReader.ReadLine(). Unfortunately, the RTL does not provide a class that can read a file line-by-line as 8-bit strings, so you are going to have to read and parse the file bytes yourself.Also, ASCII is a subset of both ANSI and UTF-8, ASCII characters are encoded the exact same in both, so you would need to analyze the whole file (or at least until you encounter a non-ASCII character) in order to determine the actual charset. Even then,
TEncoding.ANSIwill only match the OS user's default locale, so you may end up not properly detecting the file's real encoding if it is using a differnet non-UTF locale.You are best off using a pre-existing 3rd party library that handles this kind of detection for you.
For the 2nd warning:
StreamWriter.WriteLine()takes in a UTF-16UnicodeStringas input, but you are passing it a UTF-8RawByteStringinstead. Your workaround just makes that conversion explicit, but doesn't change the outcome. While this is a loss-less conversion, you don't actually need theUTF8Encode()at all. Just giveStreamWritertheUnicodeStringas-is and let it handle the conversion to UTF-8 for you. After all, that is what you asked it to do, by giving itTEncoding.UTF8in its constructor.