Reading lines with different data types from txt file in c

84 Views Asked by At

How can I read multiple lines from a txt file if I want to store the data in different variables? Each line contain the same data type order: int string string char string, separated with tabs.

For example a line in the txt file look like:

11 \t I would like an apple \t What is your favourite car brand? \t b \t elephant 

Thanks for help in advance.

I tried to read it with fscanf(“%d\t%s\t%s\t%c\t%s\n”,..); but I can’t read the strings because %s cut my sentence at the first space and it reads only the first line, I can’t move to the next lines.

1

There are 1 best solutions below

0
arfneto On

This is a csv file and scanf was written to consume this type of data, large files of tabular data. And scanf is really good at that. But scanf --- and also sscanf and fscanf--- skips white space. And white space includes tabs, spaces and newlines.

So your file is problematic, by using tab as a separator. Another issue is that many editors translate tabs to spaces up to tab stop columns, so tabs may not be recorded at all. Tabs as symbols are more common on document editors like Word from Microsoft, that can print a symbol for each tab, line break, paragraph mark and so on. On Unix/Linux/Mac vi you can use set list to see TAB as ^I.

Example

I will show an example using the 2 alternatives: [1] use sscanf and [2] parsing the line in code.

As usual, it is easier to use encapsulation and pointers and use each record as such: an object of some sort, so it is the way I will use here.

This is the definition used for each record in the example:

typedef struct
{
    int  f_int;
    char f_string_1[80];
    char f_string_2[80];
    char f_char;
    char f_string_3[80];
} Record;

In practice this is better:

typedef struct
{
    int   f_int;
    char* f_string_1;
    char* f_string_2;
    char  f_char;
    char* f_string_3;
} P_Record;

It is trivial to convert one from another and in the code there are functions included that does it. The main reason to use pointers inside the record is to use just the needed amount of RAM instead of 240 bytes for each and every set of strings.

the file used in the examples

11\tI would like an apple\tWhat is your favourite car brand?\tb\telephant
-11\tI would like an apple\tWhat is your favourite car brand?\tb\telephant
0\t \t \t \tStack Overflow

This is almost like in the original example, but I removed the extra spaces. The blank fields at the end has at least one space and this is for testing scanf consumption. For the local parser it makes no difference.

\t is of course replaced by the delimiter in use.

functions in use in the code

Record* so_free(Record*);
char    so_get_delim(const char*, const char);
Record* so_parse(const char*, const size_t, const char);
Record* so_parse_sc(const char*, const size_t, const char);
int     so_show(Record*, const char*);
int     so_show_parms(const char* f_name, const char delim);

// conversion helpers
P_Record* so_free_pack(P_Record*);
P_Record* so_pack(Record*);
int       so_show_pack(P_Record*, const char*);
Record*   so_unpack(P_Record*);

These are fairly obvious, but:

  • so_parse gets a line and returns a Record with the extracted fields, by parsing the line.
  • so_parse_sc does the same, but uses sscanf

main for a test

int main(int argc, char** argv)
{
    const char* df_file    = "input.txt";
    const char  df_delim   = ',';
    char        line[1024] = {0};
    if (argc > 1)
        strcpy(line, argv[1]);
    else
        strcpy(line, df_file);

    char delim = df_delim;
    if (argc > 2) delim = so_get_delim(argv[2], df_delim);
    so_show_parms(line, delim);

    FILE* in = fopen(line, "r");
    if (in == NULL) return -1;
    char*  p      = NULL;
    size_t n_line = 0;
    char   r_msg[40];
    while (NULL != (p = fgets(line, sizeof(line) - 1, in)))
    {
        // fgets returns the '\n' where possible
        if (line[strlen(line) - 1] == '\n')
            line[strlen(line) - 1] = 0;
        n_line += 1;
        // local parser
        sprintf(r_msg, "\nRecord %llu\n", n_line);
        Record* one = so_parse(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);

        one = so_free(one);
        // using sscanf
        sprintf(
            r_msg, "\n[using sscanf]\nRecord %llu\n",
            n_line);
        one = so_parse_sc(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);
        one = so_free(one);
    };  // while
    fclose(in);
    return 0;
}

Two arguments are expected: the file name and the delimiter. The defaults are "input.txt" and , comma for the delimiter. The delimiter can be entered as ";" or ; for a semicolon, "\t" for a TAB, or \nnn for a decimal value, like \064 for @

As expected so_parse goes ok but so_parse_sc can not parse some lines when TAB is the separator.

output using comma as a delimiter


C: SO> p input.txt ","

 file is "input.txt", delimiter is ',' = 0x2C

Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 3
             int: 0
        string 1: " "
        string 2: " "
            char: ' '
        string 3: "Stack Overflow"

[using sscanf]
Record 3
             int: 0
        string 1: " "
        string 2: " "
            char: ' '
        string 3: "Stack Overflow"

C: SO>

output using TAB as a delimiter


C: SO> ..\x64\debug\soc23-1104-fread.exe input-tab.txt "\t"

 file is "input-tab.txt", delimiter is 0x9

Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 1
             int: 11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

[using sscanf]
Record 2
             int: -11
        string 1: "I would like an apple"
        string 2: "What is your favourite car brand?"
            char: 'b'
        string 3: "elephant"

Record 3
             int: 0
        string 1: "."
        string 2: "."
            char: '.'
        string 3: "Stack Overflow"

[using sscanf]
Record 3
             int: 0
        string 1: "."
        string 2: "."
            char: '.'
        string 3: "Stack Overflow"

C: SO>

And scanf failed to read the last record, the one with blank fields since the delimiter is also white space.

So why use scanf

This functions can parse the delimiters and convert types like string, char, float, double and int, in a single call. In general scanf is capable of processing and converting any valid csv file. For valid we can try any online validator like this one or read the RFC4180. There is no real formal definition of a csv because the format precedes internet and W3C by some time.

Here the mask used is

    char mask[] =
        "%dx%79[^x]x%79[^x]x%cx%79[^x\n]";

where x is the delimiter in use. It can parse the strings, the char and the int value. In prodution code:

  • it could be more precisely built, instead of using 79 for a field of 80 bytes. :)
  • we need to know if the first line has the field names and if they are needed --- see the RFC. Here first line has normal data.
  • we need to know if the fields are escaped and the delimiter if it is the case. It is common for the fields being encoded into " for example --- see the RFC. Here the fields are not escaped so they can not have a delimiter inside.
  • we have 5 specifiers, for 5 fields, so scanf can return something from -1 to 5.
  • csv are giant MxN tables of M records with N fields, so all M lines must have N=4 delimiters here
  • the x%79[^x]x means at most 79 chars not x between delimiter x.

complete C Code


#define CRT_SECURE_NO_WARNINGS

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct
{
    int  f_int;
    char f_string_1[80];
    char f_string_2[80];
    char f_char;
    char f_string_3[80];
} Record;

typedef struct
{
    int   f_int;
    char* f_string_1;
    char* f_string_2;
    char  f_char;
    char* f_string_3;
} P_Record;

Record* so_free(Record*);
char    so_get_delim(const char*, const char);
Record* so_parse(const char*, const size_t, const char);
Record* so_parse_sc(const char*, const size_t, const char);
int     so_show(Record*, const char*);
int     so_show_parms(const char* f_name, const char delim);

// conversion helpers
P_Record* so_free_pack(P_Record*);
P_Record* so_pack(Record*);
int       so_show_pack(P_Record*, const char*);
Record*   so_unpack(P_Record*);

/// <summary>
/// defaults are "input.txt" for file name and
/// ',' comma for the delimiter
/// </summary>
/// <param name="argc"></param>
/// <param name="argv">
/// argv[1] is the file name
/// argv[2] is the delimiter. can be \nnn decimal or \t or
/// the delimiter itself
/// </param>
/// <returns></returns>
int main(int argc, char** argv)
{
    const char* df_file    = "input-tab.txt";
    const char  df_delim   = '\t';
    char        line[1024] = {0};
    if (argc > 1)
        strcpy(line, argv[1]);
    else
        strcpy(line, df_file);

    char delim = df_delim;
    if (argc > 2) delim = so_get_delim(argv[2], df_delim);
    so_show_parms(line, delim);

    FILE* in = fopen(line, "r");
    if (in == NULL) return -1;
    char*  p      = NULL;
    size_t n_line = 0;
    char   r_msg[40];
    while (NULL != (p = fgets(line, sizeof(line) - 1, in)))
    {
        // fgets returns the '\n' where possible
        if (line[strlen(line) - 1] == '\n')
            line[strlen(line) - 1] = 0;
        n_line += 1;
        // local parser
        sprintf(r_msg, "\nRecord %llu\n", n_line);
        Record* one = so_parse(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);

        one = so_free(one);
        // using sscanf
        sprintf(
            r_msg, "\n[using sscanf]\nRecord %llu\n",
            n_line);
        one = so_parse_sc(line, 1023, delim);
        if (one == NULL)
        {
            fprintf(stderr, "Ignored: %s", r_msg);
            continue;
        }
        so_show(one, r_msg);
        one = so_free(one);
    };  // while
    fclose(in);
    return 0;
}

/// <summary>
/// free...
/// </summary>
/// <param name="one"></param>
/// <returns>returns NULL</returns>
Record* so_free(Record* one)
{
    if (one == NULL) return NULL;
    free(one);
    return NULL;
}

/// <summary>
/// free a packed record
/// </summary>
/// <param name="one"></param>
/// <returns>returns NULL</returns>
P_Record* so_free_pack(P_Record* one)
{
    if (one == NULL) return NULL;
    free(one->f_string_1);
    free(one->f_string_2);
    free(one->f_string_3);
    free(one);
    return NULL;
}

/// <summary>
/// get argument from arg.
/// </summary>
/// <param name="arg"></param>can be a char or \t for a tab
/// or \nnn for a decimal value <returns>delimiter</returns>
char so_get_delim(const char* arg, const char df_delim)
{  // argument should be \t or \nnn decimal
    char delim = df_delim;
    if (arg[0] == '\\')
    {
        if (arg[1] == 't')
            delim = '\t';
        else
        {
            if (strlen(arg) > 3)
                delim = (arg[1] - '0') * 100 +
                        (arg[2] - '0') * 10 +
                        (arg[3] - '0');
            else
                delim = df_delim;
        }
    }
    else
        delim = arg[0];
    return delim;
}

/// <summary>
/// returns a new packed record from a record
/// </summary>
/// <param name="src"></param>
/// <returns></returns>
P_Record* so_pack(Record* src)
{
    size_t len = 0;
    if (src == NULL) return NULL;
    P_Record* one = malloc(sizeof(P_Record*));
    if (one == NULL) return NULL;
    one->f_int      = src->f_int;  // field 1
    len             = strlen(src->f_string_1);
    one->f_string_1 = malloc(1 + len);
    if (one->f_string_1 == NULL)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_1, src->f_string_1);  // field 2
    // now for the 2nd string
    len             = strlen(src->f_string_2);
    one->f_string_2 = malloc(1 + len);
    if (one->f_string_2 == NULL)
    {
        free(one->f_string_1);
        free(one);
        return NULL;
    }
    strcpy(one->f_string_2, src->f_string_2);  // field 3
    // now for the single char
    one->f_char = src->f_char;  // field 4;
    // now for the last string
    len             = strlen(src->f_string_3);
    one->f_string_3 = malloc(1 + len);
    if (one->f_string_3 == NULL)
    {
        free(one->f_string_1);
        free(one->f_string_2);
        free(one);
        return NULL;
    }
    strcpy(one->f_string_3, src->f_string_3);  // field 5
    return one;
}

/// <summary>
/// parse a line to get a Record
/// </summary>
/// <param name="line"></param>
/// <param name="limit"></param>
/// <param name="delim"></param>
/// <returns>pointer to a new Record</returns>
Record* so_parse(
    const char* line, size_t limit, const char delim)
{
    if (line == NULL) return NULL;
    size_t len = strlen(line);
    if (len > limit) return NULL;
    const size_t n_tabs  = 4;  // 5 fields
    size_t       tabs[5] = {0};
    const char*  p       = line;
    // check line format
    for (size_t i = 0; i < len; i += 1)
    {
        if (*p == delim)
        {
            tabs[0] += 1;
            tabs[tabs[0]] = i;
        }
        p++;
    }
    if (tabs[0] != 4) return NULL;
    // line has 5 fields:
    //   create record
    //   extract fields
    Record* nr = malloc(sizeof(Record));
    if (nr == NULL) return NULL;
    // first field is int
    nr->f_int    = atoi(line);
    char*  begin = NULL;
    char*  end   = NULL;
    size_t fl    = 0;

    // now for the 1st string
    begin                      = (char*)line + tabs[1];
    end                        = (char*)line + tabs[2];
    fl                         = end - begin;
    *(nr->f_string_1 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_1, begin + 1, fl - 1);
    // now for the 2nd string
    begin                      = (char*)line + tabs[2];
    end                        = (char*)line + tabs[3];
    fl                         = end - begin;
    *(nr->f_string_2 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_2, begin + 1, fl - 1);
    // now for the single char
    // format: <tab3><field><tab4>
    nr->f_char =
        *(line + tabs[3] + 1);  // 1st char is blank
    // now for the last string
    begin                      = (char*)line + tabs[4];
    end                        = (char*)line + len;
    fl                         = end - begin;
    *(nr->f_string_3 + fl - 1) = 0;  // terminate string
    memcpy(nr->f_string_3, begin + 1, fl - 1);
    return nr;
}

/// <summary>
/// build a record from a line, using sscanf
/// </summary>
/// <param name="line"></param>
/// <param name="limit"></param>
/// <returns>pointer to Record</returns>
Record* so_parse_sc(
    const char* line, size_t limit, const char delim)
{
    if (line == NULL) return NULL;
    // should use the size of the strings and not fix 79
    //  (“%d\t%s\t%s\t%c\t%s\n”,..)
    char mask[] =
        "%dx%79[^x]x%79[^x]x%cx%79[^x\n]";
    // change mask for delimiter in use
    for (int i = 0; mask[i] != '\n'; i += 1)
        if (mask[i] == 'x') mask[i] = delim;
    size_t len = strlen(line);
    if (len > limit) return NULL;
    Record lcl;
    int    res = sscanf(
        line, mask, &lcl.f_int, lcl.f_string_1, lcl.f_string_2,
        &lcl.f_char, lcl.f_string_3);
    if (res != 5) return NULL;
    Record* nr = malloc(sizeof(Record));
    if (nr == NULL) return NULL;
    *nr = lcl;
    return nr;
}

/// <summary>
/// display Record contents
/// </summary>
/// <param name="one"></param>
/// <param name="msg"></param>
/// <returns>0 for success or -1</returns>
int so_show(Record* one, const char* msg)
{
    if (one == NULL) return -1;
    if (msg != NULL) printf("%s", msg);
    printf("\t     int: %d\n", one->f_int);
    printf("\tstring 1: \"%s\"\n", one->f_string_1);
    printf("\tstring 2: \"%s\"\n", one->f_string_2);
    printf("\t    char: '%c' \n", one->f_char);
    printf("\tstring 3: \"%s\"\n", one->f_string_3);
    return 0;
}

/// <summary>
/// display P_Record contents
/// </summary>
/// <param name="one"></param>
/// <param name="msg"></param>
/// <returns></returns>
int so_show_pack(P_Record* one, const char* msg)
{
    if (one == NULL) return -1;
    if (msg != NULL) printf("%s", msg);
    printf("\t     int: %d\n", one->f_int);
    printf("\tstring 1: \"%s\"\n", one->f_string_1);
    printf("\tstring 2: \"%s\"\n", one->f_string_2);
    printf("\t    char: '%c' \n", one->f_char);
    printf("\tstring 3: \"%s\"\n", one->f_string_3);
    return 0;
}

/// <summary>
/// show file name and delimiter in use
/// </summary>
/// <param name="f_name"></param>
/// <param name="delim"></param>
/// <returns>0</returns>
int so_show_parms(const char* f_name, const char delim)
{
    if (f_name == NULL) return -1;
    if (isprint(delim))
        printf(
            "\f file is \"%s\", delimiter is '%c' = "
            "0x%X\n",
            f_name, delim, delim);
    else
        printf(
            "\f file is \"%s\", delimiter is 0x%x\n",
            f_name, delim);
    return 0;
}

/// <summary>
/// convert from P_Record to Record
/// </summary>
/// <param name="src"></param>
/// <returns>pointer</returns>
Record* so_unpack(P_Record* src)
{
    size_t len = 0;
    if (src == NULL) return NULL;
    Record* one = malloc(sizeof(Record));
    if (one == NULL) return NULL;
    one->f_int = src->f_int;
    if (sizeof(one->f_string_1) - strlen(src->f_string_1) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_1, src->f_string_1);  // field 2
    // now for the 2nd string
    if (sizeof(one->f_string_2) - strlen(src->f_string_2) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_2, src->f_string_2);  // field 3
    // now for the single char
    one->f_char = src->f_char;  // field 4;
    // now for the last string
    if (sizeof(one->f_string_3) - strlen(src->f_string_3) <
        1)
    {
        free(one);
        return NULL;
    }
    strcpy(one->f_string_3, src->f_string_3);  // field 3
    return one;
}

// https://stackoverflow.com/questions/77423959/
// reading-lines-with-different-data-types-from-
// txt-file-in-c