C: Low level character formatting: (enter+newline) using fgetc

956 Views Asked by At

I'm working on a project on C that reads a text file and converts it to an array of booleans. First I read the file to a string of size n (is a unsigned char array), then I use a function to convert that string to a boolean array with size n * 8. The function works perfectly, no questions on that.

I get the string from the file using this code:

unsigned char *Data_in; // define pointer to string
int i;

FILE* sp = fopen("file.txt", "r"); //open file

fseek(sp, 0, SEEK_END);            // points sp to the end of file
int data_dim = ftell(sp);          // Returns the position of the pointer (amount of bytes from beginning to end)
rewind(sp);                        // points sp to the beginning of file

Data_in = (unsigned char *) malloc ( data_dim * sizeof(unsigned char) ); //allocate memory for string
unsigned char carac; //define auxiliary variable 

for(i=0; feof(sp) == 0; i++)       // while end of file is not reached (0)
{
   carac = fgetc(sp);              //read character from file to char
   Data_in[i] = carac;             // put char in its corresponding position
}
//

fclose(sp);                        //close file

The thing is that have a text file made by Notepad in Windows XP. Inside it I have this 4 character string ":\n\nC" (colon, enter key, enter key, capital C).

This is what it looks like with HxD (hex editor): 3A 0D 0A 0D 0A 43.

This table makes it clearer:

character             hex      decimal    binary
 :                    3A       58         0011 1010
 \n (enter+newline)   0D 0A    13 10      0000 1101 0000 1010    
 \n (enter+newline)   0D 0A    13 10      0000 1101 0000 1010
 C                    43       67         0100 0011

Now, I execute the program, which prints that part in binary, so I get:

character      hex      decimal      binary
 :             3A         58         0011 1010
 (newline)     0A         10         0000 1010    
 (newline)     0A         10         0000 1010
 C             43         67         0100 0011

Well, now that this is shown, I ask the questions:

  • Is the reading correct?
  • If so, why does it take the 0Ds out?
  • How does that work?
4

There are 4 best solutions below

1
On

Make the fopen binary:

fopen("file.txt", "rb");
                    ^

Otherwise your standard library will just eat away the \r (0x0D).


As a side note, opening the file in binary mode also mitigates another problem where a certain sequence in the middle of the file looks like EOF on DOS.

0
On

It is because you're treating the file as an ASCII file. If you treat it as a binary file, you will be able to see both characters. For this use "rb" as the mode while opening the file. Also use fread to read the file contents.

0
On

In addition to the "rb" issue, there's one more error: you'll read an extra character at the end, because feof(sp) remains 0 after reading the last character. It is set to 1 only after you have attempted to read past EOF. This is a common beginner's mistake. The idiomatic C code to iterate over input characters is

int c;   /* int, not char due to EOF. */

while ((c = fgetc(sp)) != EOF) {
   /* Work with c. */
}
3
On

The other answers have discussed binary vs text mode input.

Your code actually has a separate problem in it. This idiom is for Pascal, not C:

for (i = 0; feof(sp) == 0; i++)
{
   carac = fgetc(sp);
   Data_in[i] = carac;
}

The trouble is that when the fgetc() gets EOF, you treat it as a character (probably mapping it to ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS). The feof() test is misplaced; it does not detect EOF in advance of the attempt to read the next character. Additionally, the function fgetc() and its relatives getc() and getchar() all return an int, not a char. You must learn to use the standard C idiom:

int c;
for (i = 0; (c = fgetc(sp)) != EOF; i++)
   Data_in[i] = c;

The idiom is the combination of assignment and test. The counting around it is less standard; in fact, it is likely to be fairly uncommon. But it is not wrong; it is applicable to your program.

There's no need to use feof() in most C code; virtually any time you use it, it is a mistake. Not always; it exists for a purpose. But that purpose is to distinguish between EOF and an error after a function such as fgetc() has returned EOF, not to test whether you've reached the EOF yet before a reading function says it has reached EOF. (In all my hundreds of programs, I don't think there are more than a very few references to feof(): 2884 source files, 18 references to feof(), and most of those in code originally written by other people.)