How to allow bash to performantly read data which includes NUL bytes?

77 Views Asked by At

More exactly, the question is:

Which recipes are there to enable bash scripts to properly and safely process N bytes which might contain NUL?

This question led to following observation:

bash -c 'LC_ALL=C read -rN 1 </dev/zero'
  • Tested with Debian 10's bash version 5.0.17(1)-release

(I tried to find out myself but found no pointer why this happens). All I found out so far is, that "my" bash apparently skips all NUL bytes on read -N.

A possible workaround in the special case with -N 1 is to use

LC_ALL=C IFS= read -rd '' -n 1

such that NUL acts as delimiter, so read returns. But this trick fails in case you want to skip over more than 1 byte, as then the read terminates after the first NUL seen.

For special cases there are workarounds, like forking off dd, but if you want to process the data in bash or need to often skip just a few bytes, forking hurts more than it helps.

Also looping over read -d '' -n 1 is cumbersome if you want to skip over bigger NUL areas, because this is one syscall per byte.

Notes:

  • This is not a question about opinions which solution is best.
  • This is a question to list ways to handle the most common cases.
  • And the answers should be applicable to use cases like:
    • Pipes, where you cannot seek
    • Sockets (like <>"/dev/tcp/$HOST/$PORT")

Please always keep in mind that "performance" includes more than just raw speed. It often includes the time you need to change something, where rewriting things from scratch takes too long, or plugging in something like dd gets extremely difficult. Quite often all you have is just pure bash. Plus some helpers.

For example there might be some bigger script which is applied to something like git fast-export. This script works perfectly, until the first binary with a NUL byte is added to the repo. Suddenly read -N goes out of sync, such that git fast-import complains. If the code is used mainly to edit commit messages (which are treated like the binary data) you have to duplicat your code: One for binary, NUL aware, one for commits, to change in bash.

Probably here is no such thing like one size fits all, so we likely need more solutions than to just call dd.

1

There are 1 best solutions below

0
On

Following solves it for me in the situation, where bash is talking to a pipe.

Instead of using producer | bashscript | consumer I put some transformation script into the pipe:

producer | encoder | bashscript | decoder | consumer
  • The encoder escapes 00 into 01 02 and 01 into 01 03.
  • The decoder unescapes 00 from 01 02 and 01 from 01 03.

Then, in bash I can use following routine to read N bytes:

: readbytes N variable
readbytes()
{
local -n ___var="${2:-REPLY}"
local ___esc ___tmp
LC_ALL=C read -rN "$1" ___var || return     # short read
___esc="$___var"
while   ___esc="${___esc//[^$'\x01']/}"
        ___tmp="${#___esc}"
        [ 0 -lt "$___tmp" ]
do
        ___esc=
        LC_ALL=C read -rN "$___tmp" ___esc
        ___tmp=$?
        ___var="$___var$___esc"
        [ 0 = $___tmp ] || return $___tmp   # short read
done
return 0
}

What does this routine do?

  • A call to readbytes N variable first reads N bytes into variable
  • Then it counts the 01-bytes (\1)
  • Each of the 01-bytes have a second byte, hence we are the given count short.
  • So read this additional count and append it to the variable.
  • Now, again, additional 01-bytes might have shown up, so we need to re-read them, too.
  • This loop thus terminates in at most ld N steps
  • So this routine has at most O(ld N) syscalls compared to O(N) with read -n. When 00- and 01-bytes are absent, this routine only does 1 syscall.
  • And the overall pessimistic runtime complexity is somewhat like O(N ld N) which is not perfect but much better than O(N*N) when using read -n

Notes:

  • This routine does not decode the data. So if you read 10 byte and there is one NUL in it, you will get back a string of 11 bytes (with NUL replaced by the byte sequence 01 02 from the encoder).

  • The decoder is not always needed, as bash is perfectly suited to write NUL bytes with something like printf '\0' or printf %b '\0'. However if you mostly copy STDIN to STDOUT while changing a few things, most time it is more convenient not to convert the data within bash and leave that to the decoder.

  • There probably is no good way to decode the data in bash, as bash variables (like all environment variables) cannot contain NUL.

Here is an encoder in Python3:

#!/usr/bin/env python3

import sys
while 1:
    a = sys.stdin.buffer.read(102400);
    if not a: break
    sys.stdout.buffer.write(a.replace(b'\1', b'\1\3').replace(b'\0', b'\1\2'))

And the decoder in Python3 which only is a bit more complex:

#!/usr/bin/env python3

import sys
dang = False
while 1:
    a   = sys.stdin.buffer.read(102400);
    if not a: break
    if dang:
        a   = b'\1'+a
        dang    = False
    if a[-1] == 1:
        dang    = True
        a   = a[:-1]
    sys.stdout.buffer.write(a.replace(b'\1\2', b'\0').replace(b'\1\3', b'\1'))

The complete git repo on GitHub also contains a C code wrapper bashnul, which runs much faster than the Python code (also the C program detects encoding errors etc.).

(Beware, it's not throroughly tested.)