Find and replace curly quotes inside a character class

67 Views Asked by Daan At 14 May 2020 at 19:42

I'm getting strange results when I try to find and replace curly quotes inside a character class, with another character:

sed -E "s/[‘’]/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: '''foo'''

If you use a as a replacement, you'll get aaafooaaa. But this is only an issue when the curly quotes are inside a character class. This works:

sed -E "s/(‘|’)/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: 'foo'

Can anyone explain what's going on here? Can I still use a character class for curly quotes?

Original Q&A

There are 1 best solutions below

Mark Reed On 14 May 2020 at 19:48 BEST ANSWER

Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your sed implementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):

$ LC_ALL=C sed -E "s/[‘’]/'/g" <<<'‘foo’' # C locale, single-byte chars
'''foo'''

But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:

$ LC_ALL=en_US.UTF-8 sed -E "s/[‘’]/'/g" <<<'‘foo’' # UTF-8 locale, multibyte chars
'foo'

The way you're running it, sed doesn't see [‘‘] as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.

The version that uses alternation works because each alternate can be more than one character; even though sed is still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.

So make sure your locale is set properly for your text encoding and see if that resolves your issue.

Find and replace curly quotes inside a character class

There are 1 best solutions below

Related Questions in REGEX

Related Questions in BASH

Related Questions in SED

Related Questions in POSIX-ERE

Trending Questions

Popular # Hahtags

Popular Questions