I'm getting strange results when I try to find and replace curly quotes inside a character class, with another character:
sed -E "s/[‘’]/'/g" in.txt > out.txt
in.txt: ‘foo’
out.txt: '''foo'''
If you use a as a replacement, you'll get aaafooaaa. But this is only an issue when the curly quotes are inside a character class. This works:
sed -E "s/(‘|’)/'/g" in.txt > out.txt
in.txt: ‘foo’
out.txt: 'foo'
Can anyone explain what's going on here? Can I still use a character class for curly quotes?
Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your
sedimplementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:
The way you're running it,
seddoesn't see[‘‘]as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.The version that uses alternation works because each alternate can be more than one character; even though
sedis still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.So make sure your locale is set properly for your text encoding and see if that resolves your issue.