getline() / strsep() combination causes segmentation fault

543 Views Asked by At

I'm getting a segmentation fault when running the code below.

It should basically read a .csv file with over 3M lines and do other stuff afterwards (not relevant to the problem), but after 207746 iterations it returns a segmentation fault. If I remove the p = strsep(&line,"|"); and just print the whole line it will print the >3M lines.

int ReadCSV (int argc, char *argv[]){

    char *line = NULL, *p;
    unsigned long count = 0;

    FILE *data;
    if (argc < 2) return 1;
    if((data = fopen(argv[1], "r")) == NULL){
        printf("the CSV file cannot be open");
        exit(0);
    }


    while (getline(&line, &len, data)>0) {

        p = strsep(&line,"|");  

        printf("Line number: %lu \t p: %s\n", count, p);
        count++;
    }

    free(line);
    fclose(data);

    return 0;
}

I guess it'd have to do with the memory allocation, but can't figure out how to fix it.

2

There are 2 best solutions below

0
On BEST ANSWER

A combination of getline and strsep often causes confusion, because both functions change the pointer that you pass them by pointer as the initial argument. If you pass the pointer that has been through strsep to getline again, you run the risk of undefined behavior on the second iteration.

Consider an example: getline allocates 101 bytes to line, and reads a 100-character string into it. Note that len is now set to 101. You call strsep, which finds '|' in the middle of the string, so it points line to what used to be line+50. Now you call getline again. It sees another 100-character line, and concludes that it is OK to copy it into the buffer, because len is still 101. However, since line points to the middle of the buffer now, writing 100 characters becomes undefined behavior.

Make a copy of line before calling strsep:

while (getline(&line, &len, data)>0) {
    char *copy = line;
    p = strsep(&copy, "|");  
    printf("Line number: %lu \t p: %s\n", count, p);
    count++;
}

Now line that you pass to getline is preserved between loop iterations.

0
On

Look at the expression getline(&line, &len, data) and read the manpage:

If *line is set to NULL and *len is set 0 before the call, then getline() will allocate a buffer for storing the line. This buffer should be freed by the user program even if getline() failed.

This should be the case on your first time round the loop (although we can't see where len is declared, let's just assume your real code does this correctly)

Alternatively, before calling getline(), *line can contain a pointer to a malloc(3)-allocated buffer *len bytes in size. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating *line and *len as necessary.

OK, so if line != NULL it must point to a buffer allocated by malloc of size len. The buffer allocated by your first call to getline (as above) satisfies this.

Note it's not good enough for line to point somewhere into that buffer, it must be the beginning.

Now look at the expression strsep(&line,"|") and read the manpage for that:

... This token is terminated by overwriting the delimiter with a null byte ('\0'), and *line is updated to point past the token

So, the first argument (line) is changed so that you can call strsep again with the same first argument, and get the next token. This means line is no longer a valid argument to getline, because it isn't the start of a malloc'd buffer (and the length len is also now wrong).

In practice, either

  1. getline will try to read len bytes into the buffer you gave it, but since you advanced line by the length of the first token, it writes off the end of your allocated block. This might just damage the heap rather than dying immediately
  2. getline will try to realloc the buffer you gave it, but since it isn't a valid allocated block, you get heap damage again.

While we're here, you also don't check p is non-NULL, but damaging line is the main problem.

Oh, and if you think the problem is allocation-related, try using valgrind - it generally finds the moment things first go wrong.