Let's say I want to store a C array (of fixed length N) of structures:

typedef struct {
  type0 field0;
  type1 field1;
} foo_struct;

foo_struct array[N];

in a file, so that the program can read the array from the file, manipulate it, and write it back.

The program could use system call write to write each instance of typen. Then when we call the program to read, if we assume that sizeof(typen) returns the same as before, we can allocate the array in memory, and use read to stuff fields. I assume that due to portability issues, there is no way to reliably stuff the whole structure at once, please correct me if I am wrong.

But that is too slow for my purposes. Even if I read everything at once into a big buffer, I then have to copy to fields. What happens is that my data size is "yuuge", but manipulation is sporadic. Reading and copying would take way more time than actual data access.

So I prefer to use mmap and I am assuming mmap works on-demand basis - again please correct me if I am wrong.

Now, this may be faster, but I will have some trouble accessing the data in memory.

Specifying a return value from mmap yourself being a bad idea, the function returns the buffer for you, which is not guaranteed to be aligned, and even if it were aligned on the multiples of sizeof(foo_struct), this would still not portably guarantee that I can access fields with pointers to the struct and -> operator.

So I think, I have to forget about identifying the structure at all, and just think of my array as a series of chunks of sizes S0 = sizeof(field0) S1 = sizeof(field1) S = S0 + S1, and calculate by hand where the data is, by using pointer arithmetic:

buffer + M * S
buffer + M * S + S0

Then, even that pointer is not aligned, so if I want to read or write a fieldn, I have to split the data into bytes and do it byte-by-byte, which is slow. Even though there are not so many of these accesses, but then, there are many iterations of this whole process, so I still care to have it as fast as possible.

Is there a way to use mmap (or some other way so that I don't have to read the whole yuuge file), but, not have to access data field-by-field and byte-by-byte?

Please also share if some of what I wrote, goes against Linux or common decency, as I am not completely sure that it does not.

1

There are 1 best solutions below

0
On

Just for demonstration purposes, the standard read struct-array from binary file :

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

struct omg{
        int num;
        char buff[122];
        double values[23];
        };

#define NNN (3*1024*1024)
#define FILENAME "omg.dat"
#define COUNTOF(a) (sizeof a/sizeof a[0])

struct omg array[NNN];

int main(void)
{
int fd;
int ret, ii, jj;

fprintf(stderr, "Sizeof array[0] is %zu\n", sizeof array[0] );
        /* initialise the array to nonsence */
for (ii=0; ii < COUNTOF(array); ii++) {
        array[ii].num=ii;
        sprintf(array[ii].buff, "Hello world %d", ii);
        for (jj=0; jj < COUNTOF(array[0].values); jj++) {
                array[ii].values[jj] = ii / (jj+1) ;
                }
        }

fd = open(FILENAME, O_RDWR|O_CREAT, 0660);
if (fd < 0) return 1;

ret = read(fd, array, sizeof array);
fprintf(stderr, "Read %d/ %zu\n", ret, sizeof array);

        /* modify the nonsence */
for (ii=0; ii < COUNTOF(array); ii++) {
        array[ii].num += 1;
        sprintf(array[ii].buff, "Hello world %d", array[ii].num);
        for (jj=0; jj < COUNTOF(array[0].values); jj++) {
                array[ii].values[jj] = array[ii].num / (jj+1) ;
                }
        }

ret = lseek(fd, 0, SEEK_SET);
fprintf(stderr, "Seek = %d\n", ret);


ret = write(fd, array, sizeof array);
fprintf(stderr, "Wrote %d/ %zu\n", ret, sizeof array);

close(fd);
return 0;
}

Result:

plasser@pisbak$ vi readstruct.c
plasser@pisbak$ cc -Wall -O2 readstruct.c
plasser@pisbak$ time ./a.out
Sizeof array[0] is 312
Read 981467136/ 981467136
Wrote 981467136/ 981467136

real    0m3.972s
user    0m1.689s
sys     0m0.782s

Now, I wouldn't call reading plus writing 900MB in 4 seconds slow. Most of the user CPU is probably consumed by the sprintf() calls.