.NET fiddle/Visual Studio: Different results for regex replace on invalid XML character

611 Views Asked by At

I'm trying to filter invalid characters from an XML file, and have the following test project;

class Program
{
    private static Regex _invalidXMLChars = new Regex(@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]", RegexOptions.Compiled);

    static void Main(string[] args)
    {
        var text = "assd&#xF;abv";

        Console.WriteLine(_invalidXMLChars.IsMatch(text));
    }
}

This test project outputs the expected result (True) with .NET fiddle;

But when I try to implement the same code in my project, the invalid characters are not found and outputs "False".

How come this works in .NET fiddle, but not in my project?

Altering the source XML file is not an option

2

There are 2 best solutions below

0
On BEST ANSWER

Visual Studio is right. None of the characters &, #, x, F or ; are part of your Regex. However, in HTML &#xF; translates to the C# pendant \u000f which then is replaced due to the Regex definition \0xE-\0x1F.

Using \u000f in Visual Studio gives a match:

using System;
using System.Text.RegularExpressions;

public class Program
{
    private static Regex _invalidXMLChars = new Regex(@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]", RegexOptions.Compiled);

    public static void Main()
    {
        var text = "assd\u000fabv";
        Console.WriteLine(_invalidXMLChars.IsMatch(text));
    }
}
0
On

The regular expression does not apply because the string contains the escaped sequence that will render the "illegal" character later.

To filter this out, you will have to unescape the string before testing the regular expression:

static void Main(string[] args)
{
    var text = System.Web.HttpUtility.HtmlDecode("assd&#xF;abv");

    Console.WriteLine(_invalidXMLChars.IsMatch(text));
}

A second option would be to use the regular expression to match the escape sequence instead:

var text2 = "assd&#xF;abv";
var rx = new Regex(@"&#x[0-9A-F];");
Console.WriteLine(rx.IsMatch(text2));

Hope this helps!