Regex for Highlight

561 Views Asked by At

I have a problem when I use different regex to highlight words and comments in document (RichEditControl) like SQL.

This is my first regex:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(--.*)

This works good in: /*blahblah*/ and --blahblah

And I have another regex:

((""(.|/[[:blank:]]/)*?"")|('(.|/[[:blank:]]/)*?'))

This works good in: 'blahblah' (like sql string)

But, if I do this:

'/*blahblah*/'

Before I write the last ' the program show me a exception:

An unhandled exception of type 'System.ArgumentException' occurred in DevExpress.Office.v15.2.Core.dll

Thanks in advance for the help.

This is the full code:

    private List<SyntaxHighlightToken> ParseTokens()
    {
        List<SyntaxHighlightToken> tokens = new List<SyntaxHighlightToken>();            
        DocumentRange[] ranges = null;            

        #region SearchSimpleCommas
        Regex quotations = new Regex(@"((""(.|/[[:blank:]]/)*?"")|('(.|/[[:blank:]]/)*?'))");
        ranges = document.FindAll(quotations);
        foreach (var range in ranges)
        {
            if (!IsRangeInTokens(range, tokens))
                tokens.Add(new SyntaxHighlightToken(range.Start.ToInt(), range.Length, StringSettings));   
        }
        #endregion

        #region SearchComment--/**/
        Regex comment = new Regex(@"(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(--.*)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
        ranges = document.FindAll(comment);
        for (int i = 0; i < ranges.Length; i++)
        {
            tokens.Add(new SyntaxHighlightToken(ranges[i].Start.ToInt(), ranges[i].Length, CommentsSettings));
        }
        #endregion

        tokens.Sort(new SyntaxHighlightTokenComparer());
        // fill in gaps in document coverage
        AddPlainTextTokens(tokens);
        return tokens;
    }

    private void AddPlainTextTokens(List<SyntaxHighlightToken> tokens)
    {
        int count = tokens.Count;
        if (count == 0)
        {
            tokens.Add(new SyntaxHighlightToken(0, document.Range.End.ToInt(), defaultSettings));
            return;
        }
        tokens.Insert(0, new SyntaxHighlightToken(0, tokens[0].Start, defaultSettings));
        for (int i = 1; i < count; i++)
        {
            tokens.Insert(i * 2, new SyntaxHighlightToken(tokens[i * 2 - 1].End, tokens[i * 2].Start - tokens[i * 2 - 1].End, defaultSettings));
        }
        tokens.Add(new SyntaxHighlightToken(tokens[count * 2 - 1].End, document.Range.End.ToInt() - tokens[count * 2 - 1].End, defaultSettings));
    }

    private bool IsRangeInTokens(DocumentRange range, List<SyntaxHighlightToken> tokens)
    {
        return tokens.Any(t => IsIntersect(range, t));            
    }
    bool IsIntersect(DocumentRange range, SyntaxHighlightToken token)
    {
        int start = range.Start.ToInt();
        if (start >= token.Start && start < token.End)
            return true;
        int end = range.End.ToInt() - 1;
        if (end >= token.Start && end < token.End)
            return true;
        return false;
    }

    #region ISyntaxHighlightServiceMembers
    public void ForceExecute()
    {
        Execute();
    }
    public void Execute()
    {//The Exepction show in this part
        document.ApplySyntaxHighlight(ParseTokens());
    }
    #endregion

EDIT: Thanks Harrison Mc.

I share the code I used in case anyone needs it, only what I modified (inside method ParseTokens):

    #region SearchComments&Strings
    Regex definitiveRegex = new Regex(@"(?<string>'[^\\']*(?>\\.[^\\']*)*')|(?<comment>(?>/\*(?>[^*]|[\r\n]|(?>\*+(?>[^*/]|[\r\n])))*\*+/)|(?>--.*))");
    MatchCollection matches = definitiveRegex.Matches(document.Text);
    foreach (System.Text.RegularExpressions.Match match in matches)
    {
        try
        {
            System.Text.RegularExpressions.GroupCollection groups = match.Groups;
            if (groups["string"].Value.Length > 0)
            {
                ranges = null;
                for (int s = 0; s < groups.Count; s++)
                {
                    if (groups[s].Value != string.Empty)
                    {
                        ranges = document.FindAll(groups[s].Value, SearchOptions.None);
                        for (int z = 0; z < ranges.Length; z++)
                        {
                            if(!IsRangeInTokens(ranges[z], tokens))
                                tokens.Add(new SyntaxHighlightToken(ranges[z].Start.ToInt(), ranges[z].Length, StringSettings));
                        }
                    }
                }
            }
            else if (groups["comment"].Value.Length > 0)
            {
                ranges = null;
                for (int c = 0; c < groups.Count; c++)
                {
                    if (groups[c].Value != string.Empty)
                    {
                        ranges = document.FindAll(groups[c].Value.Trim(), SearchOptions.None);
                        for (int k = 0; k < ranges.Length; k++)
                        {
                            if (!IsRangeInTokens(ranges[k], tokens))
                                tokens.Add(new SyntaxHighlightToken(ranges[k].Start.ToInt(), ranges[k].Length, CommentsSettings));
                        }
                    }
                }
            }
        }
        catch(Exception ex){ }
    }
    #endregion
1

There are 1 best solutions below

1
On BEST ANSWER

In order to avoid highlighting comments in strings and strings in comments, you need some sort of "state", which regular expressions can't easily give you. These situations would be difficult for individual string and comment regular expressions to deal with, because it would require keeping track of whether or not you're in a comment when looking for a string and vice versa.

"This string looks like it contains a /*comment*/ but it does not."
/* This comment looks like it contains a 'string' but it does not. */

However, if you use one regular expression that has different groups for match strings versus comments, the greedy consuming of characters would prevent a "comment" in a string or a "string" in a comment from messing things up.

I tested this regular expression, and it seemed to work for both "comments" in strings and "strings" in comments (both with multiple lines).

(?<string>'[^\\']*(?>\\.[^\\']*)*'|""[^\\""]*(?>\\.[^\\""]*)*"")|(?<comment>(?>/\*(?>[^*]|[\r\n]|(?>\*+(?>[^*/]|[\r\n])))*\*+/)|(?>--.*))

The key here is that the regular expression is keeping track of the "state" that determines if we're in the middle of a string or in the middle of a comment.

To use this, you'll need to grab the individual groups out of the overall match. The (?<name>group) syntax creates a named group, which you can extract later. If the <string> group has a match then it's a string, and if the <comment> group has a match then it's a comment. Since I'm not familiar with the document.FindAll method, I adopted an example from the .NET documentation using the regex.Matches method:

Regex stringAndCommentRegex = new Regex(@"(?<string>'[^\\']*...");
MatchCollection matches = stringAndCommentRegex.Matches(text);
foreach (Match match in matches)
{
    GroupCollection groups = match.Groups;
    if (match.groups["string"].Value.Length > 0)
    {
        // handle string
    }
    else if (match.groups["comment"].Value.Length > 0)
    {
        // handle comment
    }
}

Hopefully this helps!

P.S. I used regex101.com to test the regex, but to do so I had to escape the forward slashes and not escape the double quotes. I tried my best to add them back in, but I may have missed one or two.

References: