More exactly, the question is:
Which recipes are there to enable bash scripts to properly and safely process N bytes which might contain NUL?
This question led to following observation:
bash -c 'LC_ALL=C read -rN 1 </dev/zero'
- Tested with Debian 10's
bashversion5.0.17(1)-release
(I tried to find out myself but found no pointer why this happens). All I found out so far is, that "my" bash apparently skips all NUL bytes on read -N.
A possible workaround in the special case with -N 1 is to use
LC_ALL=C IFS= read -rd '' -n 1
such that NUL acts as delimiter, so read returns. But this trick fails in case you want to skip over more than 1 byte, as then the read terminates after the first NUL seen.
For special cases there are workarounds, like forking off dd, but if you want to process the data in bash or need to often skip just a few bytes, forking hurts more than it helps.
Also looping over read -d '' -n 1 is cumbersome if you want to skip over bigger NUL areas, because this is one syscall per byte.
Notes:
- This is not a question about opinions which solution is best.
- This is a question to list ways to handle the most common cases.
- And the answers should be applicable to use cases like:
- Pipes, where you cannot seek
- Sockets (like
<>"/dev/tcp/$HOST/$PORT")
Please always keep in mind that "performance" includes more than just raw speed. It often includes the time you need to change something, where rewriting things from scratch takes too long, or plugging in something like dd gets extremely difficult. Quite often all you have is just pure bash. Plus some helpers.
For example there might be some bigger script which is applied to something like git fast-export. This script works perfectly, until the first binary with a NUL byte is added to the repo. Suddenly read -N goes out of sync, such that git fast-import complains. If the code is used mainly to edit commit messages (which are treated like the binary data) you have to duplicat your code: One for binary, NUL aware, one for commits, to change in bash.
Probably here is no such thing like one size fits all, so we likely need more solutions than to just call dd.
Following solves it for me in the situation, where
bashis talking to a pipe.Instead of using
producer | bashscript | consumerI put some transformation script into the pipe:encoderescapes00into01 02and01into01 03.decoderunescapes00from01 02and01from01 03.Then, in
bashI can use following routine to readNbytes:What does this routine do?
readbytes N variablefirst readsNbytes intovariable01-bytes (\1)01-bytes have a second byte, hence we are the given count short.variable.01-bytes might have shown up, so we need to re-read them, too.ld NstepsO(ld N)syscalls compared toO(N)withread -n. When00- and01-bytes are absent, this routine only does 1 syscall.O(N ld N)which is not perfect but much better thanO(N*N)when usingread -nNotes:
This routine does not decode the data. So if you read 10 byte and there is one
NULin it, you will get back a string of 11 bytes (withNULreplaced by the byte sequence01 02from the encoder).The decoder is not always needed, as
bashis perfectly suited to writeNULbytes with something likeprintf '\0'orprintf %b '\0'. However if you mostly copy STDIN to STDOUT while changing a few things, most time it is more convenient not to convert the data withinbashand leave that to the decoder.There probably is no good way to decode the data in
bash, asbashvariables (like all environment variables) cannot containNUL.Here is an encoder in Python3:
And the decoder in Python3 which only is a bit more complex:
The complete
gitrepo on GitHub also contains a C code wrapperbashnul, which runs much faster than the Python code (also the C program detects encoding errors etc.).(Beware, it's not throroughly tested.)