I had to replace all the null bytes in a file with another character (I arbitrarily chose @), and was pretty surprised that tr '\00' '@' was about 1/4 the speed of gzip:
$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20
My real data file is 3GB gzipped and took 50 minutes to tr, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or pv, isn't the bottleneck in either case; both gzip and tr are using 100% CPU, and cat is much faster:
$ pv < lawl | cat > /dev/null
642MiB 0:00:00 [1.01GiB/s] [================================>] 100%
This code:
#include <stdio.h>
int main() {
int ch;
while ((ch = getchar()) != EOF) {
if (ch == '\00') {
putchar('@');
} else {
putchar(ch);
}
}
}
compiled with clang -O3 is somewhat faster:
$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0
Compiling with gcc -O4 -mtune=native -march=native (4.8.4) is comparable, maybe very slightly faster. Adding -march=native to clang (Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)) produces an identical binary.
This is presumably just because the generic processing code for replacements in tr is replaced with constants and the checks can be compiled down. The LLVM IR (clang -S -O3 stupidtr.c) looks pretty good.
I guess gzip must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to gzip speeds?
Some specifications, if they're relevant:
The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a Boyer-Moore search for
,\00,, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.
You can parallelize if you want, though
gzipdoes this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.Both machines I might run on have 16GB of RAM.
Combining ideas from the various answers with some extra bithacks, here is an optimized version: