How fast can we make a specific tr?

259 Views Asked by At

I had to replace all the null bytes in a file with another character (I arbitrarily chose @), and was pretty surprised that tr '\00' '@' was about 1/4 the speed of gzip:

$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====>                             ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==>                               ]  9% ETA 0:01:20

My real data file is 3GB gzipped and took 50 minutes to tr, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or pv, isn't the bottleneck in either case; both gzip and tr are using 100% CPU, and cat is much faster:

$ pv < lawl | cat > /dev/null
 642MiB 0:00:00 [1.01GiB/s] [================================>] 100%

This code:

#include <stdio.h>

int main() {
    int ch;
    while ((ch = getchar()) != EOF) {
        if (ch == '\00') {
            putchar('@');
        } else {
            putchar(ch);
        }
    }
}

compiled with clang -O3 is somewhat faster:

$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=>                                ]  8% ETA 0:01:0

Compiling with gcc -O4 -mtune=native -march=native (4.8.4) is comparable, maybe very slightly faster. Adding -march=native to clang (Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)) produces an identical binary.

This is presumably just because the generic processing code for replacements in tr is replaced with constants and the checks can be compiled down. The LLVM IR (clang -S -O3 stupidtr.c) looks pretty good.

I guess gzip must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to gzip speeds?

Some specifications, if they're relevant:

  • The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a Boyer-Moore search for ,\00,, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.

  • A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.

  • You can parallelize if you want, though gzip does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.

  • Both machines I might run on have 16GB of RAM.

4

There are 4 best solutions below

8
On BEST ANSWER

Combining ideas from the various answers with some extra bithacks, here is an optimized version:

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFFER_SIZE  16384
#define REPLACE_CHAR  '@'

int main(void) {
    /* define buffer as uint64_t to force alignment */
    /* make it one slot longer to allow for loop guard */
    uint64_t buffer[BUFFER_SIZE/8 + 1];
    ssize_t size, chunk;
    uint64_t *p, *p_end;
    uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL;

    while ((size = read(0, buffer, BUFFER_SIZE)) != 0) {
        if (size < 0) {
            if (errno == EINTR) continue;
            fprintf(stderr, "read error: %s\n", strerror(errno));
            return 1;
        }
        p = buffer;
        p_end = p + ((size + 7) >> 3);
        *p_end = 0ULL; /* force a 0 at the end */
        for (;; p++) {
#define LOWBITS   0x0101010101010101ULL
#define HIGHBITS  0x8080808080808080ULL
            uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS);
            if (m != 0) {
                if (p >= p_end) break;
                m |= m >> 1;
                m |= m >> 2;
                m |= m >> 4;
                *p |= m & rep8;
            }
        }
        for (unsigned char *pc = (unsigned char *)buffer;
             (chunk = write(1, pc, (size_t)size)) != size;
             pc += chunk, size -= chunk) {
            if (chunk < 0) {
                if (errno == EINTR) continue;
                fprintf(stderr, "write error: %s\n", strerror(errno));
                return 2;
            }
        }
    }
    return 0;
}
8
On

You need to use block reads and writes for speed. (Even with a buffered I/O library like stdio.h, the cost of managing the buffer can be significant.) Something like:

#include <unistd.h>
int main( void )
{
    char buffer[16384];
    int size, i;
    while ((size = read(0, buffer, sizeof buffer)) > 0) {
        for( i = 0; i < size; ++i ) {
            if (buffer[i] == '\0') {
                buffer[i] = '@';
                // optionally, i += 64; since
                // "Once you've found a null byte, it's also guaranteed that there can't
                // be another one for a hundred bytes or so"
            }
        }
        write(1, buffer, size);
    }
}

Naturally, compile with optimizations so that the compiler can transform indexing into pointer arithmetic if helpful.

This version also lends itself well to SIMD optimizations if you still aren't meeting your speed targets (or a smart enough compiler may vectorize the for loop automatically).

Furthermore, this code lacks robust error handling. As @chqrlie mentions in a comment, you should retry when you get -EINTR, and you should handle partial writes.

11
On

First, as others have noted, don't use getchar()/putchar(), or even any of the FILE-based methods such as fopen()/fread()/fwrite(). Use open()/read()/write() instead.

If the file is already uncompressed on disk, do not use pipes. If it is compressed, then you DO want to use a pipe in order remove an entire read/write cycle. If you uncompress from disk back to disk, then replace the NUL characters, the data path is disk->memory/cpu->disk->memory/cpu->disk. If you use a pipe, the path is disk->memory/cpu->disk. If you're disk-limited, that extra read/write cycle will just about double the time it takes to process your gigabytes (or more) of data.

One more thing - given your IO pattern and the amount of data your moving - read an entire multi-GB file, write the entire file - the page cache is only getting in your way. Use direct IO, thusly in C on Linux (headers and robust error checking left off for clarity):

#define CHUNK_SIZE ( 1024UL * 1024UL * 4UL )
#define NEW_CHAR '@'
int main( int argc, char **argv )
{
    /* page-aligned buffer */
    char *buf = valloc( CHUNK_SIZE );
    /* just set "in = 0" to read a stdin pipe */
    int in = open( argv[ 1 ], O_RDONLY | O_DIRECT );
    int out = open( argv[ 2 ], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0644 );

    for ( ;; )
    {
        ssize_t bytes = read( in, buf, CHUNK_SIZE );
        if ( bytes < 0 )
        {
            if ( errno == EINTR )
            {
                continue;
            }
            break;
        }
        else if ( bytes == 0 )
        {
            break;
        }

        for ( int ii = 0; ii < bytes; ii++ )
        {
            if ( !buf[ ii ] )
            {
                buf[ ii ] = NEW_CHAR;
            }
        }
        write( out, buf, bytes );
    }

    close( in );
    close( out );

    return( 0 );
}

Crank up the compiler optimization as high as it goes. To use this code on real data, you do need to check the results of the write() call - direct IO on Linux is a real finicky beast. I've had to close a file opened with O_DIRECT and reopen a it without direct IO set in order to write the last bytes of a file on Linux when the last bits weren't a multiple of a full page.

If you want to go even faster, you can multithread the process - one thread reads, one thread translates chars, and another thread writes. Use as many buffers, passing them from thread to thread, as necessary to keep the slowest part of the process busy at all times.

If you're really interested in seeing how fast you can move data, multithread the reads/writes, too. And if your filesystem support it, use asynchronous read/write.

5
On

Your code is incorrect because you do not test for end of file at the right spot. It is a very common mistake in do {} while loops. I recommend to avoid this construct completely (except in macros to convert sequences of statements into a single statement).

Also try and tell the glibc to perform fewer checks on the stream:

#include <stdio.h>

int main() {
    int c;
    while ((c = getchar_unlocked()) != EOF) {
        if (c == '\0')
            c = '@':
        putchar_unlocked(c);
    }
}

You can also play with different buffer sizes, for example try these before the while() loop:

setvbuf(stdin, NULL, _IOFBF, 1024 * 1024);
setvbuf(stdout, NULL, _IOFBF, 1024 * 1024);

It should not have much impact if you use the utility as a filter with pipes, but it may be more efficient if you use files.

If you use files, you can also mmap the file and use memchr to find the '\0' bytes, or even strchr that might be faster and you can ensure there is a `'\0`` at the end of the file (putting it there is a good way).