Signature Patterns Parsing Efficiency

150 Views Asked by At

I convert binary files into hex strings to search through them for specific patterns that the user has provided, the same way anti-viruses handle their signatures database. If a pattern is found, then it will return true.

One difficulty I'm facing is wildcards and the slow scanning speed. The user has thousands of patterns ranging up to 200 characters each or even more.

For example, this pattern is used to verify if a file was compiled under C++, while the "?" character is a wildcard (that can match any one character):

55 8B EC 53 8B 5D 08 56 8B 75 0C 85 F6 57 8B 7D 10 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 01

Patterns similar to that one are all stacked in a file ranging in length, so I guess you get the general idea.

This is the code that I'm using (works fine, but is extremely slow compared with other tools that do pattern scanning in mere seconds, like ExeInfoPE or Die)

        static bool compare(string[] mask, byte[] buffer, int position)
    {
        var i = 0; // index
        foreach (string x in mask) // loop through the mask
        {
            if (x.Contains("?")) // is current mask position a wildcard?
            {
            }// if so skip comparison
            else if (byte.Parse(x, System.Globalization.NumberStyles.HexNumber) == buffer[position + i]) // else try to compare
            {
            }// succeeded, move onto next byte
            else
                return false; // failed, pattern not found
            ++i; // increment the index.
        }
        return true; // pattern was found
    }

Any ideas on how to tremendously increase the speed, while maintaining support for wildcards so that my tool can be used in the real world?

1

There are 1 best solutions below

4
xanatos On

You could-should pre-convert the masks to byte?[] (or more simply a byte[] plus a bool[] isWildcard that is true if there is a wildcard) if you have multiple files to match. This byte.Parse(x, System.Globalization.NumberStyles.HexNumber) is work that, if you have 1000 files, will be uselessly repeated 1000 times.

Another problem... How much big is buffer? I do hope that you are reading only up to 4kb (and even less... you could pre-check with all the masks to see which one is the biggest) of data from each file and that you aren't reading the whole file.

Second thing you could try to index the patterns, at least for the first byte of the pattern (but remember that you have to handle the case of patterns that begin with ?).

Some random comments:

  • it isn't clear what you are really trying to do. Some more informations are needed. How many files do you need to scan? 1 or 2 or many (10+, 100+?)?

  • your patterns start with a fixed position in the file, or at least are always present in a certain point of the file (like the MZ signature of the exe files, that is the first two characters of the file), or can they be anywhere?
    Many programs "cheat": they don't load the whole file. They load small pieces. Like for example the first 4kb and the last 4kb. They "know" that their pattern must be there. So if you load the whole file, you'll surely be slower.

  • the first bytes of your patterns, how are distributed statistically? For example... You have 1000 patterns, and there are 256 possible values of a byte. The first byte of the pattern is distributed equally to all the possible values? So there are on average 4 patterns that begin with each possible value of a byte (4 patterns that begin with 'A', 4 that begin with 'B' an so on), or there are some bytes that are much more present (if for example there are 100 patterns that begin with 'A') because while in the first case it is possible to index your patterns by first byte and select quickly a small subset of patterns just having the first byte, in this second case the first byte isn't a discriminant enough to select for a subset of patterns to check.

In general I would have two structures:

Dictionary<byte, List<Pattern>> PatternsByFirstByte

and

List<Pattern> PatternsWithFirstCharacterWildcard

In this way, for each byte you have to check for a small number of patterns from the PatternsByFirstByte and all the patterns of PatternsWithFirstCharacterWildcard. But this works if the first byte is discriminant enough. More advanced would be creating a full trie structure to hold/index the patterns, and/or handle the wildcard at position zero (?? ?? xx yy) in a different way. It is clear that (?? ?? xx yy) is equivalent to (xx yy) with starting position >= 2. Then you can put in the dictionary the pattern xx yy and put an annotation somewhere that the position must be >= 2, like in the Pattern class itself, that then would be:

class Pattern
{
    public readonly byte?[] Bytes;
    public int MinimumStartingPosition;
}