Reading character entity reference tag content in C++

55 Views Asked by At

P.S. Just think that the content of the file is &#110 (with nothing else after or before; let's keep it as simple as possible). readCharacter() would return the correct decoded 'n' character, but it would also have reached the end of the file. So getTagContent() method would return the empty string, which is not the case.

P.S. 2 I found a solution, but it doesn't look really neat in my opinion. The if in the while loop in the getTagContentLength() method may look like this:

if (ch == '<' || is.eof())
{
    if (ch != EOF && ch != '<')
    {
        tagContent[i++] = ch;
    }

    break;
}

I am trying to achieve the following:

We have an HTML tag content, e.g. let the tag be <th>some value</th>.

When I invoke the method getTagContent(), is.get() would return the 's' symbol, so the first character of the content (I have handled that).

What I would want to be able to do as well, is to handle character entity references, so some value can be written as &#115ome value or &#115&#111&#109&#101&#32value. That's what the readCharacter() method is for.

char* getTagContent(std::istream& is, int maxTagContentLength)
{
    char* tagContent = new char[maxTagContentLength + 1];
    int i = 0;

    char ch;

    while (true)
    {
        ch = readCharacter(is);

        if (ch == '<' || is.eof())
        {
            break;
        }

        tagContent[i++] = ch;
    }

    tagContent[i] = '\0';

    return tagContent;
}

char readCharacter(std::istream& is)
{
    char ch = is.get();

    if (ch == '&' && is.peek() == '#')
    {
        is.get();

        char charEntityRef;
        int number = 0;

        while (true)
        {
            charEntityRef= is.get();

            if (is.eof())
            {
                break;
            }

            if (!isDigit(charEntityRef))
            {
                is.unget();             
                break;
            }

            number = number * 10 + charEntityRef- '0';
        }

        ch = (char)(number);
    }

    return ch;
}

I came across some problems though. Imagine we have the following content &#110&#105&#110&#101&#116&#101&#101&#110 which is the string nineteen. My code would return the string ninetee without the last n. The problem is that in the last iteration of the while loop in the getTagContent() method, the character would actually be exactly the last 'n' that's missing in the result, but the eof bit is raised in the readCharacter() method and it won't be written to the result (we will exit the loop because of the break statement).

I don't see how to fix it without messing up the logic (e.g. we need to stop exactly when we meet an opening tag, as that's when the tag content ends, and probably the closing tag follows).

1

There are 1 best solutions below

0
Remy Lebeau On

There are many problems with your code:

  1. you are not handling EOF correctly.

  2. you are not handling the terminating ; at the end of an entity correctly. It is part of the entity and should not be put back into the input stream.

  3. you are handling only entities that are decimal codes, but not entities that are hex codes or names.

  4. you have a buffer overflow in getTagContent() if the content is more than maxTagContentLength characters in length.

  5. getTagContent() will break prematurely if the content contains a entity for '<' (like &lt;). You need to check if a read character is the terminating '<' at the end of the content before you check for any entities in the content.

With that said, try something more like this:

std::string getTagContent(std::istream& is)
{
    std::string tagContent;
    std::string value;

    while (((value = readCharacterOrEntity(is)) != "") && (value != "<"))
    {
        tagContent += decodeCharacterEntity(value);
    }

    return tagContent;
}

std:string readCharacterOrEntity(std::istream& is)
{
    char result;

    char ch;
    if (is.get(ch))
    {
        if (ch == '&' && is.peek() == '#')
        {
            is.get();
            std::string value;
            if (std::getline(is, value, ';'))
                result = "&#" + value + ';';
        }
        else
        {
            // TODO: handle named entity that begins with just '&' and not '&#'...
            result = ch;
        }
    }

    return result;
}

std::string decodeCharacterEntity(const std::string &entity)
{
    if (entity.compare(0, 2, "&#") == 0)
    {
        int i;

        if (entity[2] == 'x')
            i = std::stoi(entity.substr(3, entity.size()-4), 16);
        else
            i = std::stoi(entity.substr(2, entity.size()-3), 10);

        // TODO: handle non-ASCII characters
        if (i > 127)
            return "?";

        return (char) i;
    }
    else if (entity[0] == '&')
    {
        std::string entity_name = entity.substr(2, entity.size()-1);
        if (entity_name == "lt")
            return "<";
        if (entity_name == "gt")
            return ">";
        // TODO: look up other names as needed...
        return ...;
    }
    else
        return value;
}

That being said, this is not a good way to parse HTML. You really should be using an actual HTML parser library. But if you can't/wont, then at least read the HTML into a larger membory buffer that you can tokenize better instead of processing 1 char at a time.