How to remove punctuation from a string with exceptions using regex in bash

1.3k Views Asked by At

Using the command echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d '[:punct:]' prints the string "Jiro Inagaki Soul MediaBreeze".

However, I want to find a regular expression that will remove all punctuation except the underscore and ampersand i.e. I want "Jiro Inagaki & Soul Media_Breeze".

Following advice on character class subtraction from the sources listed at the bottom, I've tried replacing [:punct:] with the following:

  • [\p{P}\-[&_]]
  • [[:punct:]-[&_]]
  • (?![\&_])\p{P}
  • (?![\&_])[:punct:]
  • [[:punct:]-[&_]]
  • [[:punct:]&&[&_]]
  • [[:punct:]&&[^&_]]

... but I haven't gotten anything to work so far. Any help would be much appreciated!

Sources:

2

There are 2 best solutions below

7
On BEST ANSWER

You can specify the punctuation marks you want removed, e.g.

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d "[.,/\\-\=\+\{\[\]\}\!\@\#\$\%\^\*\'\\\(\)]"
Jiro Inagaki & Soul Media_Breeze

Or, alternatively,

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -dc '[:alnum:] &_'
Jiro Inagaki & Soul Media_Breeze
0
On

Posting my comment as an answer as requested by @jared_mamrot.

You can manually type out the set of punctuation, excluding _, that you want to delete. I took my punctuation set from GNU docs on [:punct:]:

‘[:punct:]’ Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

You can also look at POSIX docs which says the character classes depend on locale:

punct    <exclamation-mark>;<quotation-mark>;<number-sign>;\
         <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
         <left-parenthesis>;<right-parenthesis>;<asterisk>;\
         <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
         <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
         <greater-than-sign>;<question-mark>;<commercial-at>;\
         <left-square-bracket>;<backslash>;<right-square-bracket>;\
         <circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
         <vertical-line>;<right-curly-bracket>;<tilde>
$ echo 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"
abcd_

The set of characters in the tr command should be straightforward except for backslash, \\, which has been escaped for tr, and single quote, "'", which is being concatenated as a string quoted in double quotes, since you can't escape a single quote within single quotes.

I do prefer using @jared_marmot's complement solution, if possible, though. It is much neater.