I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).
My plan of attack is:
- Process the file word by word
- Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
- If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.
I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?
Any other suggested approaches would also be very welcome!
You don't really need more than bash for this:
It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.