write every possible char into a file

302 Views Asked by At

I want to write every character that exists into a file. I guess unicode has the most complete set of characters, but I can't quite tell. Can you help me out with this? I'm working in C++. This code seems to "only" write the ASCII set of chars (or am I wrong?). Thx for the help.

#include <iostream>
#include <fstream>

using namespace std;

int main(void) {

wofstream wOutStream;
wOutStream.open("myFile.txt");

wchar_t myChar = 0;
do {
    wOutStream << myChar << " ";
    myChar++;
} while (myChar != 0);

wOutStream.close();

cin.get();
return 0;
}
1

There are 1 best solutions below

2
On

This is quite an open-ended question, the exact answer depends on how ambitious you are. So I'm not going to post a program, but just list the basic steps:

  • Unicode allocates characters to numbers (called code points), for instance "A" is allocated to nr 65, commonly written in hexadecimal as U+0041. It also defines the name and a lot of other properties. For instance "A" is called "LATIN CAPITAL LETTER A", it's lower case version is "a", it's part of a left-to-right language etc.

  • But on it's own it does not specify how that character is written to a file. For that you have to pick an encoding. A common encoding are UTF-8, it should be easy to find the code to encode a code point to bytes. And if you open your text file, then your editor also needs to understand that encoding (shouldn't be a problem for UTF-8).

  • Specifically for C++, when writing UTF-8 I would open a narrow output stream (std::ofstream) and write the bytes. C++ in principle has no support for writing Unicode files. Your program would roughly look like this.

    for (unsigned int codePoint = 0; codePoint < 0x110000; ++codePoint)
    {
        std::string utf8 = encode_utf8(codePoint);
        outStream << utf8 << " ";
    }
    

    Maybe add a newline every 256 characters or so.

  • There are 17 planes of 2^16 code points. many commonly used characters are in the first plane. You can either print the first plane only (U+0000 to U+FFFF), or print all code points (U+0000 to U+10FFFF). Some of the planes don't have assigned characters yet.

  • Do you want to print the assigned code points only? In that case you have to download the list from the Unicode consortium and parse it. There is no formula which yields the allocated code points. Or as others pointed out, you can use a language with those tables built-in, like Python or Java.

  • And finally, some of the ranges are reserved for private use. You may choose to skip those as well.