Leading/trailing whitespace insensitive traits for basic_string

76 Views Asked by At

I am doing a lot of parsing/processing, where leading/trailing whitespace and case insensitivity is given. So I made a basic char trait for std::basic_string(see below) to save myself some work.

The trait is not working, the problem is that basic_string's compare calls the traits compare and if evaluated to 0 it returns the difference in sizes. In basic_string.h it says ...If the result of the comparison is nonzero returns it, otherwise the shorter one is ordered first. Looks like they explicitly don't want me to do this...

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0? And, is there any workaround or do I have to roll my own string?

#include <cstring>
#include <iostream>

namespace csi{
template<typename T>
struct char_traits : std::char_traits<T>
{
    static int compare(T const*s1, T const*s2, size_t n){
        size_t n1(n);
        while(n1>0&&std::isspace(*s1))
            ++s1, --n1;
        while(n1>0&&std::isspace(s1[n1-1]))
            --n1;
        size_t n2(n);
        while(n2>0&&std::isspace(*s2))
            ++s2, --n2;
        while(n2>0&&std::isspace(s2[n2-1]))
            --n2;
        return strncasecmp(static_cast<char const*>(s1),
                           static_cast<char const*>(s2),
                           std::min(n1,n2));
    }
};
using string = std::basic_string<char,char_traits<char>>;
}

int main()
{
    using namespace csi;
    string s1 = "hello";
    string s2 = " HElLo ";
    std::cout << std::boolalpha
              << "s1==s2" << " " << (s1==s2) << std::endl;
}
2

There are 2 best solutions below

1
On BEST ANSWER

Converting data that has more than one possible representation into a "standard" or "normal" form is called canonicalization. With text it usually means removal of accents, cases, trimming white-space-characters and/or format-characters.

If canonicalization is done under the hood during each compare then it is fragile. For example how you test that it was done correctly both to s1 and s2? Also it is inflexible, for example you can not display its result or cache it for next compare. So it is both more robust and efficient to do that as explicit canonicalization step.

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0?

Traits compare is required to compare only n characters, so when you compare "hellow" and "hello" what it should return? It should return 0. You are in defective situation if you somehow ignore that n because the traits should work with std::string_view that is not zero-terminated. If the size compare is dropped then "hellow" and "hello" would compare equal that you likely don't want.

4
On

What is the reason for having this additional "shorter one" ordering if trait's compare returns 0?

That's simply how basic_string::compare() is defined.

And, is there any workaround or do I have to roll my own string?

It seems that your custom char_traits have to implement:

  • length(), returning length of the trimmed part of a string, and

  • move() and copy(), for copying that trimmed part


However, there's a potential problem which cannot be solved using custom traits. basic_string has constructors like basic_string(const CharT* s, size_type count, Allocator& alloc), or method overloads like assign or compare which take a C string and its length - in those cases Traits::length() won't be called. If anyone uses one of those methods, the string might contain trailing whitespaces or try to access characters beyond the end of the source string.

To solve this, it's possible to do something like this:

class TrimmedString
{
public:
    // expose only "safe" methods:
    void assign(const char* s) { m_str.assign(s); }

private:
    std::basic_sttring<char, CustomTraits> m_str;
};

Or this (might be simpler):

class TrimmedString : private std::basic_string<char, CustomTraits>
{
public:
    using BaseClass = std::basic_string<char, CustomTraits>; // for readability

    // make "safe" method public
    using BaseClass::length;
    using BaseClass::size;
    // etc.

    // wrappers for methods with "unsafe" overloads
    void assign(const char* s) { BaseClass::assign(s); }
};