There are many questions on getting the file size of an std::fstream's file, but they all return the file size in bytes and are error prone if the file is open in another stream.
I want to know the file size in codepoints, not bytes.
Now std::fstream::seekg(0,std::ios::end) followed by std::fstream::tellg() only returns the length in bytes. This doesn't tell me how many UTF-16/32 characters are in the file. Divide the result by sizeof(wchar_t) I hear you say. Doesn't work for UTF-8 files and IS NOT portable.
Now, for the more technical minded, I have imbued the stream with my own std::codecvt class. std::codecvt has a member length() which, given two pointers into the stream calculates the length and returns either max or number of output characters. I would have thought that seeking on the file would seek by codecvt::intern_type rather than by the base char type.
I've looked into the fstream header and found that seek infact doesn't use the codecvt. And, on my version from VS2010, the codecvt::length() member is not even mentioned. Infact, on each call to codecvt::in(), a new string object is created and increased in size by 1 char each time in() returns partial. It doesn't instead call the codecvt::max_length() member and supply the call with an adequate buffer.
Is this just my implementation or can I expect others to do the same? Has std::fstream been rewritten for VS2012 to make full use of locales?
Basically, I'm fed up of having to write my own file handlers every time I use text files. I'm hoping to create an fstream derived class that will first read a files BOM, if present, and imbue the correct codecvt. Then convert those characters to char, wchar_t or whatever the code calls for. I'm also hoping to code it in such a way that if prior knowledge of the encoding is known, a locale can be specified on construction.
Would I be better off working directly on the internal buffer, in affect re-writing the fstream class or are there some tricks I'm unaware of?
If I understand you right, you expect that:
(which by inheritance is
basic_istream<CharT,Traits>::seekg), ought to perform the stream-positioning operation in units that are theintern_typeof whatevercodecvtwith which the stream is imbued.Template
basic_istreamis declared:In the declaration of the member function:
pos_typeisstd::char_traits<CharT>::pos_typewhich therefore is a type determined in any implementation solely by theCharTtemplate argument of thebasic_istreamclass and without reference to anycodecvt.A
basic_fstream<char>, for instance remains abasic_fstream<char>, and itspos_typeremainsbasic_fstream<char>::pos_type, regardless of the encoding that is chosen to read or write it.The declarations above are respectively as per C++11 Standard § 27.7.1 and § 27.7.2.1. The fact that
pos_typeis invariant with respect to any imbuedcodecvt, and hence also the behaviour ofseekg(pos_type), are therefore consequences of the Standard.Equivalent remarks apply for
basic_istream& seekg( off_type off, std::ios_base::seekdir dir).The
std::codecvt::intern_typeis the type of the elements of the internal sequence to which or from which the specified encoding will translate an external sequence of elements of typeextern_type. Theintern_typeis the element type of the "in-program" sequence and theextern_typeis the type of "in-file" sequence. Theintern_typehas got nothing to do with positioning operations on the file.If you must find out the size of a file in codepoints, and presuming that the possible encodings of interest are UTF-8, UTF-16 and UTF-32, then for the first two of these you have no choice but to read the entire file, because they are variable-length encodings, with a UTF-8 codepoint consuming 1-4 bytes and a UTF-16 codepoint consuming 2 or 4 bytes. UTF-32 is a fixed-length 4-byte encoding, so in that case you might compute the number of complete codepoints as the byte-length of the file, minus BOM-length if any, divided by 4, if you discount the possibility of encoding errors except at end-of-file.
For the variable length encodings, the simplest way of counting the codepoints will be with a template function parameterized by an indicator of the presumed encoding. It will read the file, first consuming the BOM, if any, in units of
charorchar16_tas appropriate, identifying each unit that is the lead element of a codepoint in the presumed encoding; verifying the presence of the number of subsequent elements required by the lead element, and incrementing the codepoint count if they are found.