Erasure Code for chunked file

348 Views Asked by At

Is there an erasure code, which can be applied to multiple chunks (maybe 100 or 200, each few hundred kB) by (somehow) adding redundancy chunks ?

I heard about Reed-Solomon, but it doesn't look like it can be used for huge data sets and multiple chunks, am I wrong ?

Thanks!

2

There are 2 best solutions below

0
On

Erasure codes encode $N$ original data chunks into $M$ parity chunks for redundancy, while these $N$ original data chunks and $M$ parity chunks are only a stripe of the whole storage. Theoretically, the size of $N$ can be arbitrary large for Reed-Solomon(RS) codes, only if the Galois field $GF(2^w)$ the RS constructed over is large enough. Based on the above, your question is more likely as follow

Why the number of (original data) chunks in a stripe rarely be too large, for example, $N = 100$ or $200$ ?

The reasons are update problem and repair problem: If you encode a great number data chunks into parity chunks by the erasure codes, a lot of data/parity chunks are co-related. As long as you update one data chunk, all parity chunks must be updated as well, which cause heavy I/O for the parity part; repair problem is the situation when one data/parity chunk is failed, lots of data/parity chunks are accessed and transferred for repairing, causing enormous disk I/O or network traffic.

Take RAID5 of $3$ data chunks (A, B, C) and parity chunk P=A+B+C as a example, the repair for the failure of any chunk requires all other three chunks to participate.

The greater number of chunks are encoded, the more serious update problem and repair problem for a storage system might meet, which further greatly influences the performance of the system.

BTW, the decoding(process to obtain the original data) speed greatly falls off when the $N$ enlarges.

0
On

Of course Reed-Solomon can be used for any data size.

Just think of you data as set of multiple RS-sized blocks (eg. 255 byte for a byte-based RS code) and make the calculation for each block independly. All checksums together are the checksum of the whole big data thing.

If your data length is not a multiple of the RS block size, ie. the last block is too short, just add some 0 bytes to fill it up before encoding, and remove the 0s after decoding again. You'll have to save the original data length somewhere, but that should be no problem.