Using awk for conditional find/replace

573 Views Asked by Mykro At 23 October 2011 at 07:14

I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).

My plan of attack is:

Process the file word by word
Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.

I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?

Any other suggested approaches would also be very welcome!

Original Q&A

There are 2 best solutions below

glenn jackman On 23 October 2011 at 12:07 BEST ANSWER

You don't really need more than bash for this:

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.

Jens On 23 October 2011 at 10:23

Two suggestions:

Fix the problem closer to where it originates, i.e. near the OCR Software. Can it be made to consult a dictionary and don't even come up with non-words containing 'I'? If not, try a different OCR program that can.
Running each word through hunspell creates a process for each word, which is a massive waste of CPU cycles. Try using several passes, where the first pass finds all 'I' words, then filter out correct words, then replace each correctable word.

Using awk for conditional find/replace

There are 2 best solutions below

Related Questions in BASH

Related Questions in SED

Related Questions in AWK

Related Questions in HUNSPELL

Related Questions in SPELL-CHECKING

Trending Questions

Popular # Hahtags

Popular Questions