What's the fastest way to remove characters from an alpha-numeric string?

1.1k Views Asked by At

Say we have the following strings that we pass as parameters to the function below:

string sString = "S104";
string sString2 = "AS105";
string sString3 = "ASRVT106";

I want to be able to extract the numbers from the string to place them in an int variable. Is there a quicker and/or more efficient way of removing the letters from the strings than the following code?: (*These strings will be populated dynamically at runtime - they are not assigned values at construction.)

Code:

public GetID(string sCustomTag = null)
{
    m_sCustomTag = sCustomTag;
    try {
        m_lID = Convert.ToInt32(m_sCustomTag); }
        catch{
            try{
                int iSubIndex = 0;     
                char[] subString = sCustomTag.ToCharArray(); 

                //ITERATE THROUGH THE CHAR ARRAY
                for (int i = 0; i < subString.Count(); i++)     
                {
                    for (int j = 0; j < 10; j++)
                    {
                        if (subString[i] == j)
                        {
                            iSubIndex = i;
                            goto createID;
                        }
                    }
                }

            createID: m_lID = Convert.ToInt32(m_sCustomTag.Substring(iSubIndex));
            }
            //IF NONE OF THAT WORKS...
            catch(Exception e)
            {
                m_lID = 00000;
                throw e;
            }
         }
     }
 }

I've done things like this before, but I'm not sure if there's a more efficient way to do it. If it was just going to be a single letter at the beginning, I could just set the subStringIndex to 1 every time, but the users can essentially put in whatever they want. Generally, they will be formatted to a LETTER-then-NUMBER format, but if they don't, or they want to put in multiple letters like sString2 or sString3, then I need to be able to compensate for that. Furthermore, if the user puts in some whacked-out, non-traditional format like string sString 4 = S51A24;, is there a way to just remove any and all letters from the string?

I've looked about, and can't find anything on MSDN or Google. Any help or links to it are greatly appreciated!

5

There are 5 best solutions below

4
On BEST ANSWER

You can use a Regex, but it's probably faster to just do:

public int ExtractInteger(string str)
{
    var sb = new StringBuilder();
    for (int i = 0; i < str.Length; i++)
        if(Char.IsDigit(str[i])) sb.Append(str[i]);
    return int.Parse(sb.ToString());
}

You can simplify further with some LINQ at the expense of a small performance penalty:

public int ExtractInteger(string str)
{
    return int.Parse(new String(str.Where(c=>Char.IsDigit(c)).ToArray()));
}

Now, if you only want to parse the first sequence of consecutive digits, do this instead:

public int ExtractInteger(string str)
{
    return int.Parse(new String(str.SkipWhile(c=>!Char.IsDigit(c)).TakeWhile(c=>Char.IsDigit(c)).ToArray()));
}
3
On

While you asked "what's the fastest way to remove characters..." what you really appear to be asking is "how do I create an integer by extracting only the digits from the string"?

Going with this assumption, your first call to Convert.ToInt32 will be slow for the case where you have other than digits because of the exception throwing.

Let's try another approach. Let's think about each of the cases.

  1. The string always starts with a series of digits (e.g. 123ABC => 123)
  2. The string always ends with a series of digits (e.g. ABC123 => 123)
  3. A string has a series of contiguous digits in the middle (e.g. AB123C ==> 123)
  4. The digits are possibly noncontiguous (e.g. A77C12 => 7712)

Case 4 is the "safest" assumption (after all, it is a superset of Case 1, 2 and 3. So, we need an algorithm for that. As a bonus I'll provide algorithms specialized to the other cases.

The Main Algorithm, All Cases

Using in-place unsafe iteration of the characters of the string, which uses fixed, we can extract digits and convert them to a single number without the data copy in ToCharArray(). We can also avoid the allocations of, say, a StringBuilder implementation and a possibly slow regex solution.

NOTE: This is valid C# code though it's using pointers. It does look like C++, but I assure you it's C#.

public static unsafe int GetNumberForwardFullScan(string s)
{
    int value = 0;
    fixed (char* pString = s)
    {
        var pChar = pString;
        for (int i = 0; i != s.Length; i++, pChar++)
        {
            // this just means if the char is not between 0-9, we exit the loop (i.e. stop calculating the integer)                   
            if (*pChar < '0' || *pChar > '9')
                continue;

            // running recalculation of the integer
            value = value * 10 + *pChar - '0';
        }
    }

    return value;
}

Running this against any of the inputs: "AS106RVT", "ASRVT106", "106ASRVT", or "1AS0RVT6" results in pulling out 1, 0, 6 and calculating on each digit as

  • 0*10 + 1 == 1
  • 1*10 + 0 == 10
  • 10*10 + 6 == 106

Case 1 Only Algorithm (Digits at Start of String)

This algorithm is identical to the one above, but instead of continue we can break as soon as we reach a non-digit. This would be much faster if we can assume all the inputs start with digits and the strings are long.

Case 2 Only Algorithm (Digits at End of String)

This is almost the same as Case 1 Only except you have to

  • iterate from the end of the string to the beginning (aka backwards) stopping on the first non-digit
  • change the calculation to sum up powers of ten.

Both of those are a bit tricky, so here's what that looks like

public static unsafe int GetNumberBackward(string s)
{
    int value = 0;
    fixed (char* pString = s)
    {
        char* pChar = pString + s.Length - 1;
        for (int i = 0; i != -1; i++, pChar--)
        {              
            if (*pChar < '0' || *pChar > '9')
                break;

            value = (*pChar - '0') * (int)Math.Pow(10, i) + value;
        }
    }

    return value;
}

So each of the iteration of the calculation looks like

  • 6*100 + 0 == 6
  • 0*101 + 6 == 6
  • 1*102 + 6 == 106

While I used Math.Pow in these examples, you can find integer only versions that might be faster.

Cases 1-3 Only (i.e. All Digits Contiguous Somewhere in the String

This algorithm says to

  1. Scan all non-digits
  2. Then scan only digits
  3. First non-digit after that, stop

It would look like

public static unsafe int GetContiguousDigits(string s)
{
    int value = 0;
    fixed (char* pString = s)
    {
        var pChar = pString;

        // skip non-digits
        int i = 0;
        for (; i != s.Length; i++, pChar++)
            if (*pChar >= '0' && *pChar <= '9')
                break;

        for (; i != s.Length; i++, pChar++)
        {
            if (*pChar < '0' || *pChar > '9')
                break;

            value = value * 10 + *pChar - '0';
        }
    }

    return value;
}
0
On

Fastest is to parse the string without removing anything:

var s = "S51A24";
int m_lID = 0;

for (int i = 0; i < s.Length; i++)
{
    int d = s[i] - '0';
    if ((uint)d < 10)
        m_lID = m_lID * 10 + d;
}

Debug.Print(m_lID + ""); // 5124
2
On

You can use a regular expression. It's not necessarily faster, but it's more concise.

string sString = "S104";
string sString2 = "AS105";
string sString3 = "ASRVT106";

var re = new Regex(@"\d+");

Console.WriteLine(re.Match(sString).Value); // 104
Console.WriteLine(re.Match(sString2).Value); // 105
Console.WriteLine(re.Match(sString3).Value); // 106
2
On
    string removeLetters(string s)
    {
        for (int i = 0; i < s.Length; i++)
        {
            char c = s[i];

            if (IsEnglishLetter(c))
            {
                s = s.Remove(i, 1);
            }
        }

        return s;
    }

    bool IsEnglishLetter(char c)
    {
        return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z');
    }