compress or size optimization for large sparse array in C

541 Views Asked by At
/* The following codes are compiled into library(test.a) */
typedef struct {
    short x[5];
} weight_t;

typedef struct {
    weight_t wgt;
    char *ptr;
} tbl_t;

/* huge static array and 70% of entries are empty/zero */
static tbl_t tbl[] ={
{ { 0x0102, 0x0102, 0x0102, 0x0102, 0x0102, }, 0 }, /* 0000 */
{ { 0x0103, 0x0103, 0x0103, 0x0103, 0x0103, }, 0 }, /* 0001 */
{ { 0x0104, 0x0104, 0x0104, 0x0104, 0x0104, }, 0 }, /* 0002 */
{ { 0x0105, 0x0105, 0x0105, 0x0105, 0x0105, }, 0 }, /* 0003 */

{ { 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, }, 0 }, /* 3134c */
{ { 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, }, 0 }, /* 3134d */
......
......
.....

}; /* more than one million entries */

Since array1 contains more than one million entries(even though 70% of them are empty/zero entries), the size of library file(test.a) is quite big.

I know using hash table to only include non-empty entries can reduce the size of array1 but is there any way to compress or do size optimization for library file(test.a) without changing array1?


There are tons of places in other libraries accessing the value of x in struct weight_t, there could be huge effort involved to change current static array tbl implementation.

    /* tbl_t obj ==> tbl_t_obj */
    val = tbl_t_obj.wgt.x[1];
3

There are 3 best solutions below

0
On

any way to compress or do size optimization of binary file without changing current implementation

To reduce the size of the executable, follow @user3386109 idea: Load the table via an external file. Although the executable file is smaller, the total size of executable plus external file is not reduced.

Code could compress that external file and then uncompressed it as part of the tbl[] assignment early in the run of the executable. Depending on compression time, that may make the external file 10x smaller.

0
On

Since array tbl[] contains 75% of empty/zero entries

First, introduce getter and setter. Store the index and the value in a dynamically allocated array. Allocate only when really you need to allocate. For the rest, just return zeros.

// stores index and the table
// sorted on indexes, so we can use bsearch
struct maybetbl_s {
   uint32_t index;
   tbl_t tbl;
};

// dynamic memory
static struct maybetbl_s *tblidx = NULL;
static uint32_t tblidxcnt = 0;

// all zeros
const tbl_t zeros = {0};

// check if we have index idx
// if we have, return a pointer to it
tbl_t *_tbl_has(size_t idx) {
   return bsearch(&idx, tblidx, ...);
}

// memmove + realloc to remove an element     
void _tbl_remove(tbl_t *ret) {
    tbl_t *end = tblidx + tblidxcnt;
    if (ret + 1 != end) {
        memmove(ret, ret + 1, end - ret - 1); // or smth like that
    }
    tblidx = realloc(tblidx, tblidxcnt * sizeof(*tblidx));
    tblidxcnt -= 1;
}

int _tbl_allocate(size_t idx) {
   void *pnt = realloc(tblidx, tblidxcnt * sizeof(*tblidx));
   if (!pnt) return -ENOMEM;
   tblidx = pnt;
   tblidx[tblidxcnt].index = idx;
   tblidxcnt += 1;
   qsort(tblidx, ...); // so we can use bsearch
   return 0;
}

// public api --------------------------------

// if idx exists, return the value
// if it doesn't, return zeros
tbl_t tbl_get(size_t idx) {
   tbl_t *ret = _tbl_has(idx);
   return ret ? ret : zeros;
}    
 
// set specific idx to the value
// note - can allocate memory, returns 0 on success
int tbl_set(size_t idx, tbl_t value) {
   tbl_t *ret = _tbl_has(idx);
   // if we are storing zeros, just clear the element
   if (memcpy(value, zeros, sizeof(value)) {
       if (ret) {
          _tbl_remove(ret);
       }
       return 0;
   }
   // allocate the element for index if not exists
   if (!ret) {
       int err = _tbl_allocate(idx);
       if (err) err;
       ret = _tbl_has(idx);
       assert(ret);
   }
   // store the value and return success
   *ret = value;
   return 0;
}
0
On

TL:DR: store the non-zero entries plus basically run-length-encoding of the zeros, in a zstd / lz4 / gzip format, and initialize a zeroed array from that.

As you mentioned here and in How to compress or do size optimization for large static sparse array in C (basically a duplicate of this where you forgot to link this existing question), you seem to be aiming primarily at reducing file size of the executable, not RAM usage after your code is loaded. (The latter is impossible without an ABI change that means recompiling all code that touches this array. Which is certainly something you should consider, e.g. to use Lundin's answer on your other question: reduce padding in your arrays if char* is wider than short).

Your only option to reduce that is to not statically-initialize this array with non-zero values. The compiler can't optimize this for you. (Or at least, not that I'm aware; I guess in theory an embedded implementation that has to copy nonzero .data initializers from flash to RAM anyway could literally compress that data with lz4 or zstd. IDK if any toolchains do that; it would be a neat feature.)


To do this yourself, zero-init the actual array (static tbl_t tbl[];) so it goes in the BSS and doesn't take space in your .a and call an init function at startup to loop through a more compact representation of your data. Either from a file or from a static const array.

You'd omit the char* member that's initially always(?) zero, but with a with a uint8_t or unsigned short distance_to_next_nonzero member. This is a similar idea to run-length encoding. If you have a run of zeros too long for that member, just have a record with all-zero weights and a new length.

Or it could even be a separate binary file, so it's easier to avoid having that initializer memory allocated to this for the lifetime of your application1.

Perhaps generate the file from compiling source like this in a program that uses fopen / fwrite to dump it. That would make it easy to maintain even if you have that initializer data compressed using zstd2 as part of your build process. Although you can always just hexdump any binary file back into a C const char initddata_zstd[] = {...}; array initializer.

I'm assuming the char* member is always NULL, otherwise getting it initialized to point at constant strings is more of a challenge.

typedef struct {
    short x[5];
    unsigned short num_zeros;
    // or  uint_least8_t num_zeros;  if you use __attribute__((packed)) for the init stuff
} weight_initializer;

Footnote 1:
Freeing init data: kernels like Linux put their init data in a custom linker section so it's contiguous, and at some point at the end of boot put that region of memory on the free list to be reused. In user-space under Linux you could do something like that by calling munmap on pages that consist solely of initializer array. (Use alignas(4096) static const unsigned char initdata[] = {...}; so it starts at the beginning of a page.)

Footnote 2:
The non-zero initializer data in your example looks very compressible. zstd or lz4 are (AFAIK) the go-to choices for quick decompression, with zstd allowing good compression ratios if you spend enough time compressing, which I think doesn't hurt decompression speed. (You can limit max memory required for decompression, at the expense of compression efficiency. Although I think that only gets relevant for inputs larger than memory size.)

LZ4 is faster I think, but maybe not by much. Apparently there's also a Lizard (next-gen LZ5). Again, spending lots of time compressing can make a smaller compressed size, potentially leading to even faster decompression.

zstd is about an order of magnitude faster than gzip to compress or decompress, IIRC, and gets equal or better compression ratios at that speed. (With a wide range of speed vs. compression presets.) It makes zlib basically obsolete if you don't need format compatibility.

If your data is really that regular (incrementing patterns), zstd is vastly better than lz4. I assume not, but I was curious just how well compression would work on a pattern like shown in the example in the question. So I wrote a loop to generate repetitive data without a zero-gap member.

typedef struct {
    short x[5];
} weight_t;

typedef struct {
    weight_t wgt;
   // char *ptr;
//    unsigned ptr;   // leave 8 or 4-byte zeros in each record or not
} tbl_t;

#define SIZE (1<<20)
static tbl_t tbl[SIZE] ={ };

#include <stdio.h>

int main(){
        unsigned short x = 0x0102;
        for(int i=0 ; i<SIZE ; i++){
                tbl[i].wgt = (weight_t){x,x,x,x,x};
                x++;
        }
        fwrite(tbl, sizeof(tbl[0]), SIZE, stdout);
}

So it wraps around after 2^16 iterations; zstd may be seeing that and exploiting that longer-distance redundancy.

$ gcc -O3 foo.c
$ ./a.out > initdata.bin
$ lz4 -k --best --favor-decSpeed initdata.bin initdata.bin.fastdec.lz4
Compressed 10485760 bytes into 9789946 bytes ==> 93.36%
$ lz4 -k --best  initdata.bin 
Compressed 10485760 bytes into 4092881 bytes ==> 39.03%                        
$ gzip -v -k --best  initdata.bin
initdata.bin:    85.4% -- created initdata.bin.gz
$ zstd -v -k -19  initdata.bin 
initdata.bin         :  0.65%   (  10.0 MiB =>   66.7 KiB, initdata.bin.zst)   .00% 
$ xz -v -k --best initdata.bin       # rather slow to decompress, you prob. don't want it
initdata.bin (1/1)
  100 %         27.2 KiB / 10.0 MiB = 0.003                                    
$ ll -S initdata.bin*
-rw-r--r-- 1 peter peter  10M Apr 20 20:28 initdata.bin
-rw-r--r-- 1 peter peter 9.4M Apr 20 20:28 initdata.bin.fastdec.lz4
-rw-r--r-- 1 peter peter 4.0M Apr 20 20:28 initdata.bin.lz4
-rw-r--r-- 1 peter peter 1.5M Apr 20 20:28 initdata.bin.gz
-rw-r--r-- 1 peter peter  67K Apr 20 20:28 initdata.bin.zst
-rw-r--r-- 1 peter peter  28K Apr 20 20:28 initdata.bin.xz

zstd with the default -3 compression level still got it down to 1.67% of the original size, 171 KiB.

xz uses LZMA, like 7-zip. Slow to compress, but high-compression. Decompression speed is ok (nowhere near as slow as compression), but not as fast as zstd.

So it seems LZ4 doesn't do a very good job; if your real data behaves remotely like this simplistic test, zstd's compression advantage will pay for itself even if the decompression code is somewhat larger. Your real data might be much less compressible by zstd, but you should get some gain if most of your weights are only using about 9 of the 16 bits even if they're otherwise pretty random.