Parse words with strtok

661 Views Asked by At

I want to say to strtok() to use as delimeters everything except the alphanumerical characters.

My attempts are the example of the ref:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}

However I am going to parse real text files (that contain reviews for a site). Currently I check to see what other delimeters occurs and I augment the second argument of strtok(). For example, I saw an [, so I did it " ,.-[" and so on, but OK I might miss something and maybe a new text file contain a new delimeter.

Can't I do something smarter (and actually correct, because this is not)?

For example if I get:

[Hello_sir I'm George]

I would like to get these tokens:

Hello
sir
I
m
George

The problem is that I don't know which are the delimeters.

I would like to say use as delimeters everything except alphanumerical characters.


EDIT

I thought of going character by character and checking if it is alphanumerical, but I was hoping for something built-in, like feeding as desired the strtok().

2

There are 2 best solutions below

4
On BEST ANSWER

The only way to do that with strtok (without overwriting the source string's non-alphanumeric characters with something else) would be to pass a delimiter string which contained all the non-alphanumeric characters. You could build this once at first run-time like this:

static char delims[256]; /* this is oversized */

...

void
initdelims()
{
    int i;
    int j = 0;
    for (i = 1; i<256; i++)
    {
        if (!isalnum(i))
            delims[j++] = i;
    }
    delims[j] = 0; /* this is unnecessary as statics are initialised to zero */
}

Then use delims as your delimiter string.

However, this is both ugly and inefficient. You would be better writing a hand-rolled parser, borrowing the source to strtok if necessary.

2
On

You can collect the non-alphanumeric characters once, in one pass, in one string, then use that string as the delimiter set for strtok():

char delims[(1 << CHAR_BIT) + 1] = { 0 };
for (int i = 0, j = 0; i < sizeof delims - 1; i++) {
    if (!isalnum(i)) {
        delims[j++] = i;
    }
}

pch = strtok(str, delims);
while (pch != NULL)
{
    printf ("%s\n",pch);
    pch = strtok(NULL, delims);
}