Correctly removing html entities from a string

Question

Correctly removing html entities from a string

1.9k Views Asked by Leonard At 09 January 2020 at 14:47

I have a problem with removing html entities from strings. I try System.Web.HttpUtility.HtmlDecode, and would like to see   being replaced with a regular space. Instead, a weird hex code is returned. I have read the following two topics and learned that this is most probably an encoding issue, but I can't find a way to solve it.

Removing HTML entities in strings

How do I remove all HTML tags from a string without knowing which tags are in it? ("I realize that...", Thierry_S)

The source string that should be stripped from html codes and entities is saved in a database with SQL_Latin1_General_CP1_CI_AI as collation, but for my unit test, I simply created a test string in Visual Studio, of which the encoding is not necessarily the same as the encoding of the data that is stored in the database.

My unit test asserts 'Not Equal' since the   is not replaced with a regular space. Initially, it returned 2C, but after lots of testing and trying to convert from some encoding to another, it now returns A0 even though I have removed all encoding changing code from my function.

My question is two-fold:

How can I make my unit test pass?
Am I testing correctly, since the database encoding could be different from the text I have manually typed in my unit test?

My function:

public static string StripHtml(string text)
{
    // Remove html entities like &nbsp;
    text = System.Net.WebUtility.HtmlDecode(text);

    // Init Html Agility Pack
    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(text);

    // Return without html tags
    return htmlDoc.DocumentNode.InnerText;
}

My unit test:

public void StripHtmlTest()
{
    // arrange
    string html = "<p>This is&nbsp;a very <b>fat, <i>italic</i> and <u>underlined</u> text,<!-- foo bar --> sigh.</p> And 6 < 9 but > 3.";
    string actual;
    string expected = "This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.";

    // act
    actual = StaticRepository.StripHtml(html);

    // assert
    Assert.AreEqual(expected, actual);
}

Test result:

Message: Assert.AreEqual failed. Expected:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>. Actual:<This is a very fat, italic and underlined text, sigh. And 6 < 9 but > 3.>.

Test result in HEX: Text

Original Q&A

There are 1 best solutions below

**vasil oreshenski** · Accepted Answer · 2020-01-09T16:02:10.413000

Well   is not a 'regular' space. When you are using System.Net.WebUtility.HtmlDecode it will return the textual representation of the named html entity which is ' '. It looks like regular whitespace but it has different meaning. The decimal representation of nbsp is actually 160 which in hex is A0, so your unit test and decoding are working correctly.
If you want to replace nbsp with regular whitespace you have several options, the easiest of which will be execute simple replace before the decoding:

// where the second argument is whitespace char with decimal representation 32
text = text.Replace("&nbsp;", " ");

About the initial running: The hex value 2C is 44 in decimal which is the symbol ','(comma). Is it possible that you just have looked at the wrong character ?

About sql collation: the latin general is capable of storing nbsp symbols so.. i think this is not a problem.

Correctly removing html entities from a string

There are 1 best solutions below

Related Questions in C#

Related Questions in ENCODING

Related Questions in HTMLDECODE

Trending Questions

Popular # Hahtags

Popular Questions