More exactly, the question is:
Which recipes are there to enable bash
scripts to properly and safely process N
bytes which might contain NUL
?
This question led to following observation:
bash -c 'LC_ALL=C read -rN 1 </dev/zero'
- Tested with Debian 10's
bash
version5.0.17(1)-release
(I tried to find out myself but found no pointer why this happens). All I found out so far is, that "my" bash
apparently skips all NUL
bytes on read -N
.
A possible workaround in the special case with -N 1
is to use
LC_ALL=C IFS= read -rd '' -n 1
such that NUL
acts as delimiter, so read
returns. But this trick fails in case you want to skip over more than 1 byte, as then the read
terminates after the first NUL
seen.
For special cases there are workarounds, like forking off dd
, but if you want to process the data in bash
or need to often skip just a few bytes, forking hurts more than it helps.
Also looping over read -d '' -n 1
is cumbersome if you want to skip over bigger NUL
areas, because this is one syscall per byte.
Notes:
- This is not a question about opinions which solution is best.
- This is a question to list ways to handle the most common cases.
- And the answers should be applicable to use cases like:
- Pipes, where you cannot seek
- Sockets (like
<>"/dev/tcp/$HOST/$PORT"
)
Please always keep in mind that "performance" includes more than just raw speed. It often includes the time you need to change something, where rewriting things from scratch takes too long, or plugging in something like dd
gets extremely difficult. Quite often all you have is just pure bash
. Plus some helpers.
For example there might be some bigger script which is applied to something like git fast-export
. This script works perfectly, until the first binary with a NUL
byte is added to the repo. Suddenly read -N
goes out of sync, such that git fast-import
complains. If the code is used mainly to edit commit messages (which are treated like the binary data) you have to duplicat your code: One for binary, NUL aware, one for commits, to change in bash.
Probably here is no such thing like one size fits all, so we likely need more solutions than to just call dd
.
Following solves it for me in the situation, where
bash
is talking to a pipe.Instead of using
producer | bashscript | consumer
I put some transformation script into the pipe:encoder
escapes00
into01 02
and01
into01 03
.decoder
unescapes00
from01 02
and01
from01 03
.Then, in
bash
I can use following routine to readN
bytes:What does this routine do?
readbytes N variable
first readsN
bytes intovariable
01
-bytes (\1
)01
-bytes have a second byte, hence we are the given count short.variable
.01
-bytes might have shown up, so we need to re-read them, too.ld N
stepsO(ld N)
syscalls compared toO(N)
withread -n
. When00
- and01
-bytes are absent, this routine only does 1 syscall.O(N ld N)
which is not perfect but much better thanO(N*N)
when usingread -n
Notes:
This routine does not decode the data. So if you read 10 byte and there is one
NUL
in it, you will get back a string of 11 bytes (withNUL
replaced by the byte sequence01 02
from the encoder).The decoder is not always needed, as
bash
is perfectly suited to writeNUL
bytes with something likeprintf '\0'
orprintf %b '\0'
. However if you mostly copy STDIN to STDOUT while changing a few things, most time it is more convenient not to convert the data withinbash
and leave that to the decoder.There probably is no good way to decode the data in
bash
, asbash
variables (like all environment variables) cannot containNUL
.Here is an encoder in Python3:
And the decoder in Python3 which only is a bit more complex:
The complete
git
repo on GitHub also contains a C code wrapperbashnul
, which runs much faster than the Python code (also the C program detects encoding errors etc.).(Beware, it's not throroughly tested.)