Grep with a regex character range that includes the NULL character

763 Views Asked by At

When I include the NULL character (\x00) in a regex character range in BSD grep, the result is unexpected: no characters match. Why is this happening?

Here is an example:

$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']

Here I expect all characters up until the last one to match, however the result is no output (no matches).

Alternatively, when I start the character range from \x01, it works as expected:

$ echo 'ABCabc<>/ă' | grep -o [$'\x01'-$'\x7f']
A
B
C
a
b
c
<
>
/

Also, here are my grep and BASH versions:

$ grep --version
grep (BSD grep) 2.5.1-FreeBSD

$ echo $BASH_VERSION
3.2.57(1)-release
2

There are 2 best solutions below

6
ilkkachu On BEST ANSWER

Noting that $'...' is a shell quoting construct, this,

$ echo 'ABCabc<>/ă' | grep -o [$'\x00'-$'\x7f']

would try to pass a literal NUL character as part of the command line argument to grep. That's impossible to do in any Unix-like system, as the command line arguments are passed to the process as NUL-terminated strings. So in effect, grep sees just the arguments -o and [.

You would need to create some pattern that matches the NUL byte without including it literally. But I don't think grep supports the \000 or \x00 escapes itself. Perl does, though, so this prints the input line with the NUL:

$ printf 'foo\nbar\0\n' |perl -ne 'print if /\000/'
bar

As an aside, at least GNU grep doesn't seem to like that kind of a range expression, so if you were to use that, you'd to do something different. In the C locale, [[:cntrl:][:print:]]' might perhaps work to match the characters from \x01 to \x7f, but I didn't check comprehensively. The manual for grep has some descriptions of the classes.


Note also that [$'\x00'-$'\x7f'] has an unquoted pair of [ and ] and so is a shell glob. This isn't related to the NUL byte, but if you had files that match the glob (any one-letter names, if the glob works on your system -- it doesn't on my Linux), or had failglob or nullglob set, it would probably give results you didn't want. Instead, quote the brackets too: $'[\x00-\x7f]'.

0
anubhava On

On BSD grep, you may be able to use this:

LC_ALL=C grep -o '[[:print:][:cntrl:]]' <<< 'ABCabc<>/ă'

A
B
C
a
b
c
<
>
/

Or you can just install gnu grep using home brew package and run:

grep -oP '[[:ascii:]]' <<< 'ABCabc<>/ă'