Regex validation for memo field (client and server side) with few special tags

326 Views Asked by At

Have been going over this problem for two days without any real luck. I am using asp.net webapi2 with jquery ajax on client side.

I have an edit box for entering memo text, allowable characters are ^[©a-zA-Z0-9\u0900-\u097f,\.\s\-\'\"!?\(\)\[\]]+$ and two tags <LineBreak/> and <Link attr="value"/> (may be couple of more attributes in Link tag. The problem is that NO other tags are allowable - which means that even a simple <br/> should be prevented. This negative check is proving to be bit complicated.

Requesting help in formulating regex for javascript on client side and c# based DataAnnotation check on the server side.

2

There are 2 best solutions below

2
On

What you're attempting to do is sanitize user input, however, using JavaScript and Regex is the wrong way to go about it.

Don't concern yourself with validating user input on the front end, at least not yet, the focus should be validating it server side first and the best tool for the job is HtmlSanitizer. In their words:

HtmlSanitizer is a .NET library for cleaning HTML fragments and documents from constructs that can lead to XSS attacks.

HtmlSanitizer can be customized at several levels:

  • Configure allowed HTML tags through the property AllowedTags.
  • Configure allowed HTML attributes through the property AllowedAttributes.
  • Configure allowed CSS property names through the property AllowedCssProperties.
  • Configure allowed CSS at-rules through the property AllowedAtRules.
  • Configure allowed URI schemes through the property AllowedSchemes.
  • Configure HTML attributes that contain URIs (such as "src", "href" etc.)
  • Provide a base URI that will be used to resolve relative URIs against.
  • Cancelable events are raised before a tag, attribute, or style is removed.

I've mocked up a demo on dotnetfiddle.net using that library for you to play with

void Main()
{
    var allowedTags = new[]{"LineBreak", "Link"};
    var allowedAttributes = new[]{"attr"};
    var sanitizer = new HtmlSanitizer(allowedTags: allowedTags, allowedAttributes: allowedAttributes);
    //sanitizer.
    var html = @"<script>alert('xss')</script><div onload=""alert('xss')""" + @"style=""background-color: test"">Test<img src=""test.gif""" + @"style=""background-image: url(javascript:alert('xss')); margin: 10px""></div>
    <LineBreak></LineBreak>

    <Link attr=""v123""/>";
    var sanitized = sanitizer.Sanitize(html);
    Console.WriteLine(sanitized);
}

Edit

But would like to know why "regex is the wrong way to go about it".

Regex isn't made for this type of task, you need to be able to parse a html document, meaning parsing its tags, attributes and values within those attributes in a tree like structure to be able to properly sanitize it because there's just too many edge cases that's too difficult to cover with just Regex. Regex is better used for scraping data from a source that's already in a structure that is predictable, user input isn't one of those things.

Even though your use case is simple enough, you're still enabling users to type in HTML that will be re displayed to other users in its raw format so anything that you miss will give you a headache down the line.

Here's the XSS Filter Evasion Cheat Sheet from OWASP, if Regex could cover everything listed here, I would say fine, but it's such a difficult task to achieve that in Regex that it just doesn't make sense.

HtmlSanitizer on the other hand does cover the issues listed on that cheat sheet, it's also actively maintained and is specially built for exactly this sort of application, it's also not bulky by any means, it can handle large sanitization tasks with processing times in the 50-100ms range.

0
On

Managed to achieve this by a combination of RegularExpression data annotation which allows angle brackets (thereby custom tags)

[RegularExpression(@"([©a-zA-Z0-9\u0900-\u097f,\.\s\-\'\""!?\(\)\[\]\<\>\/]*)")]

and a ValidationAttribute class which checks for unwanted tags (other than LineBreak and Link)

public class CustomTagValidatorAttribute : ValidationAttribute
{
    protected override ValidationResult IsValid(object value, ValidationContext validationContext)
    {
        Regex re = new Regex(@"(<(?!(LineBreak\s*|Link\s+[\s\w\'\""\=]*)\/?>))", RegexOptions.Multiline);
        return re.Match(value.ToString()).Length == 0 ? ValidationResult.Success : new ValidationResult(Resources.ErrorStrings.InvalidValuesInRequest);
    }
}

Both attributes are applied to the class property as below -

[CustomTagValidator]
[RegularExpression(@"([©a-zA-Z0-9\u0900-\u097f,\.\s\-\'\""!?\(\)\[\]\<\>\/]*)")]
public string PropertyToValidate { get; set; }

Also added an ActionFilterAttribute to ensure the validation check is performed before the controller action is called -

public class ValidateModelAttribute : ActionFilterAttribute
{
    public override void OnActionExecuting(HttpActionContext actionContext)
    {
        if (actionContext.ModelState.IsValid == false)
        {
            actionContext.Response = actionContext.Request.CreateErrorResponse(
                HttpStatusCode.BadRequest, actionContext.ModelState);
        }
    }
}

and applied this to relevant controller action as below -

    [ValidateModel]
    public HttpResponseMessage Post([FromBody] MyModel mm)

Hope this helps someone stuck with similar issues.

Almost forgot, same solution was applied on client side using same regex based javascript validation.