I am working on AIX unix and trying to remove non-printable characters from file the data looks like in Arizona w/ fiancÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix she I get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒
I want to replace all those special characters with space and my output should look like in Arizona w/ fianc
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g'
and it did not remove the characters.
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
Also I am using ksh
.
Easiest -
strings
Easiest way to do this is with the
strings
command:The problems with this approach:
Ugliest -
sed
'sl
plussed
post-processingNow, if you must use
sed
, then here's an alternative:Here, you're using
l
to 'dump' non-printable characters, transforming them into octal representations like\303
, then removing anything that looks like an octal value so created, and then removing the$
thatl
added at the end of the line.It's kinda ugly, and may interact badly with your file, if it has anything which starts with a backslash followed by three digits, so I'd stay with the
strings
option.Better -
sed
ranges with high Unicode charactersThe one below is also a hack, but looks better than the rest. It uses
sed
ranges, starting with '¡'. I picked that symbol because it is the second* character in the iso-8859-1 encoding, which also happens to be the Unicode section right after ASCII. So, I'm guessing that you're not having trouble with actual control codes, but instead of non-ASCII characters (anything represented over 127 Decimal).For the second item in the range, just pick some non-latin character (Japanese, Chinese, Hebrew, Arabic, etc), hoping it will be high enough in Unicode that it includes any of your 'non-printing' characters.
Unfortunately,
sed
does not have a[[:ascii:]]
range. Neither it accepts open-ended ranges, so you need this hack.(*) Note: I picked the second character in the range because the first character is a non-breaking space, so it would be hard to understand that it is not just a normal space.