So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.
- String as sequences of bytes and free functions that return vectors containing the code-points indices.
- A wrapper class that combines a string and a vector containing the indices.
Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.
I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.
After creating this class, I felt tempted to just wrap it in a std::vector
of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.
So, before posting some code, here's a more organized list of ideas.
- My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
- A string as a series of decomposed runes, thus making indexing and slicing O1.
- Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace:
mysring[0].is_whitespace()
- I still don't know how to handle graphemes.
Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.
My code:
class rune {
char data[4] {};
public:
rune(char c) {
data[0] = c;
}
// This constructor needs a string, a position and an offset!
rune(std::string const & s, size_t p, size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = s[p + i];
}
}
void swap(rune & other) {
rune t = *this;
*this = other;
other = t;
}
// Output as UTF8!
friend std::ostream & operator <<(std::ostream & output, rune input) {
for (size_t i = 0; i < 4; ++i) {
if (input.data[i] == '\0') {
return output;
}
output << input.data[i];
}
return output;
}
};
Error handling ideas:
I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0'
, then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'
. Simple and easy to use.
So, thoughts? Opinions? Different approaches?
Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)
It sounds like you're trying to reinvent the wheel.
There are, of course, two ways you need to think about text:
In some codebases, those two representations are the same (and all encodings are basically arrays of
char32_t
orunsigned int
). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.
For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a
rune
, I'd probably just call it acodepoint
), you'll want to think about your semantics.std::string
's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).codepoint
, with the following useful functions and constructorscode: