Efficient Algorithm to Replace All SubArrays with Other SubArrays

147 Views Asked by At

I have a byte array(can get very large, past 32 million bytes), and I need to replace some subarrays that with other subarrays of the same length. My current approach is search the byte array for all the subarrays I need to replace, every time I find one add the index of the subarray to a list, and continue.

My code is as follows. I have a nagging feeling that this cannot be efficient, because 32 million bytes is taking greater than 10 seconds to finish searching and replacing. I pass it 8 strings to replace, so it basically ends up searching for 16 subarrays.
Anyone see either any flaws with my algorithm or a more efficient one?


P.S. I don't actually replace them in this code, just find the indices. My code for that should be perfectly efficient.

 public class Search
{
    public List<int> positions;
    public List<int> lengths;
    private List<byte[]> stringsToSearchFor;
    public Search(List<string> strings){
        stringsToSearchFor = new List<byte[]>();
        positions = new List<int>();
        lengths = new List<int>();
        foreach (string tempString in strings){
            stringsToSearchFor.Add(Encoding.ASCII.GetBytes(tempString));
            stringsToSearchFor.Add(Encoding.Unicode.GetBytes(tempString));
        }
    }

    public void SearchBytes(byte[] haystack){
        int[] arrayOfInt = new int[stringsToSearchFor.Count];
        bool[] arrayOfBoolean = new bool[stringsToSearchFor.Count];
        for (var i = 0; i < haystack.Length; i++){
            byte currentByte = haystack[i];
            for (int stringCounter = 0; stringCounter < arrayOfBoolean.Length; stringCounter++)
            {
                byte[] stringLookFor = stringsToSearchFor.ElementAt(stringCounter);
                byte currentStringByte = stringLookFor[arrayOfInt[stringCounter]];
                //Saying the current byte is the desired one
                if (currentStringByte == currentByte)
                {
                    if (arrayOfInt[stringCounter] + 1 == stringLookFor.Length){
                        positions.Add(i - stringLookFor.Length + 1);
                        lengths.Add(stringLookFor.Length);
                        arrayOfInt[stringCounter] = 0;
                    }
                    else
                    {
                        arrayOfInt[stringCounter]++;
                    }
                }
                else
                {
                    arrayOfInt[stringCounter] = 0;
                }
            }
        }
        return;
    }



}
2

There are 2 best solutions below

0
On

You're basically doing a brute force search. Instead of doing brute force, you could do something more akin to the Knuth-Morris-Pratt or Rabin-Karp string searching algorithms (but instead of searching for character sequences in strings, you're searching for byte sequences in arrays).

0
On

I can tell from the fact that SearchBytes() only has 2 nested for loops that this brute-force search algorithm has a bug. A brute-force search of this kind needs 3 nested loops: for each starting position within the haystack, for each needle string, you need a loop to check whether the entire needle appears at that position in the haystack. (This innermost loop may abort early if it finds a character mismatch.)

Here's a concrete example: If the haystack is ABCABCABD and one of your needle strings is ABCABD, this string will not be found, despite the fact that it does occur. That's because as soon as your algorithm sees the second C in the haystack, it will conclude that it must start looking for the needle from the current position in the haystack, when in fact it needs to start looking from an earlier position.

Anyway, the time complexity of brute-force for searching for a single length-m needle string in a length-n haystack string is O(nm), which is pretty terrible if both are moderately long. John Kurlak suggested Knuth-Morris-Pratt or Rabin-Karp, and running either of these will certainly speed things up (as well as be correct :-P) if you are looking for a few large strings, but the algorithm that specifically targets efficiently finding multiple strings in a string is called the Aho-Corasick algorithm. It takes time O(n+s+k), where n is the haystack size, s is the sum of the sizes of the needle strings to search for, and k is the number of occurrences of any needle string -- which is pretty hard to beat.