I convert binary files into hex strings to search through them for specific patterns that the user has provided, the same way anti-viruses handle their signatures database. If a pattern is found, then it will return true.
One difficulty I'm facing is wildcards and the slow scanning speed. The user has thousands of patterns ranging up to 200 characters each or even more.
For example, this pattern is used to verify if a file was compiled under C++, while the "?" character is a wildcard (that can match any one character):
55 8B EC 53 8B 5D 08 56 8B 75 0C 85 F6 57 8B 7D 10 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 01
Patterns similar to that one are all stacked in a file ranging in length, so I guess you get the general idea.
This is the code that I'm using (works fine, but is extremely slow compared with other tools that do pattern scanning in mere seconds, like ExeInfoPE or Die)
static bool compare(string[] mask, byte[] buffer, int position)
{
var i = 0; // index
foreach (string x in mask) // loop through the mask
{
if (x.Contains("?")) // is current mask position a wildcard?
{
}// if so skip comparison
else if (byte.Parse(x, System.Globalization.NumberStyles.HexNumber) == buffer[position + i]) // else try to compare
{
}// succeeded, move onto next byte
else
return false; // failed, pattern not found
++i; // increment the index.
}
return true; // pattern was found
}
Any ideas on how to tremendously increase the speed, while maintaining support for wildcards so that my tool can be used in the real world?
You could-should pre-convert the masks to
byte?[](or more simply abyte[]plus abool[] isWildcardthat is true if there is a wildcard) if you have multiple files to match. Thisbyte.Parse(x, System.Globalization.NumberStyles.HexNumber)is work that, if you have 1000 files, will be uselessly repeated 1000 times.Another problem... How much big is
buffer? I do hope that you are reading only up to 4kb (and even less... you could pre-check with all the masks to see which one is the biggest) of data from each file and that you aren't reading the whole file.Second thing you could try to index the patterns, at least for the first byte of the pattern (but remember that you have to handle the case of patterns that begin with
?).Some random comments:
it isn't clear what you are really trying to do. Some more informations are needed. How many files do you need to scan? 1 or 2 or many (10+, 100+?)?
your patterns start with a fixed position in the file, or at least are always present in a certain point of the file (like the MZ signature of the exe files, that is the first two characters of the file), or can they be anywhere?
Many programs "cheat": they don't load the whole file. They load small pieces. Like for example the first 4kb and the last 4kb. They "know" that their pattern must be there. So if you load the whole file, you'll surely be slower.
In general I would have two structures:
and
In this way, for each byte you have to check for a small number of patterns from the
PatternsByFirstByteand all the patterns ofPatternsWithFirstCharacterWildcard. But this works if the first byte is discriminant enough. More advanced would be creating a full trie structure to hold/index the patterns, and/or handle the wildcard at position zero (?? ?? xx yy) in a different way. It is clear that (?? ?? xx yy) is equivalent to (xx yy) with starting position >= 2. Then you can put in the dictionary the patternxx yyand put an annotation somewhere that the position must be >= 2, like in thePatternclass itself, that then would be: