Fixing corrupt mbox files with sed/awk

380 Views Asked by At

I got a bunch of old, inherited mbox files which I want to convert to maildir. Problem: The mboxes are not totally RFC compliant. There are several mailboxes missing the empty line before the "^From " line in some (but not all) mails which causes mb2md not to separate these mails from each other.

Example:

...
Text of mail 1
... bla....    
To unsubscribe, visit https:...                      
From fetchmail Fri Nov  8 18:35:54 CET 2002          ## ^missing empty line above
...
Text of mail 2
...

Now I'm searching an easy way to insert an empty line before any line matching "^From " - but only when not preceded by an empty line. A kind of stream-edit is must, because mailboxes could be really huge.

I use sed regularly - but I'm not familiar with multiline matching. Tried several things (cut'npaste with modifications) today without success :(

Last try was sed -E ':a;N;$!ba;s/\n(..*)\nFrom /\n\1\n\nFrom /g' /tmp/testfile

that only matched the last occurrence of the pattern!?

sed/awk-experts - do you have any hint for me?

2

There are 2 best solutions below

2
On

that only matched the last occurrence of the pattern!?

Yes. Regex is greedy. The .* matches everything, then after it has matched everything, a last single \nFrom is matched. Match everything except a newline, to match one line.

sed -z -E 's/(\n[^\n]+\n)(From )/\1\n\2/g'

If you do not want to read the whole file into memory, you have to read at least two lines in memory. Below I put the previous line into hold space - append current line with previous line on each line read to check the condition. After checking it, the previous line is printed.

sed -n -E '
      # Hold first line.
      1{h;b}
      # Append the line to hold space and switch hold space with pattern space
      # so that we have previous\ncurrent lines in pattern space.
      H;x
      # If we have From prepended by anything in previous line, add a newline
      /.+\nFrom /s/\n/\n\n/
      # Remove current line
      s/\n[^\n]*$//
      # Print previous line. Maybe with extra newline.
      p
      # If its last line, also print the holded last line
      ${x;p}
'

and a oneliner:

sed -nE '1{h;b};H;x;/.+\nFrom /s/\n/\n\n/;s/\n[^\n]*$//p;${x;p}'
2
On

Any time you're using sed constructs other than s, g, and p (with -n) you're using the wrong tool. If you can't use formail for some reason then just use awk:

$ awk '/^From/ && p{print ""} {p=NF; print}' file
...
Text of mail 1
... bla....
To unsubscribe, visit https:...

From fetchmail Fri Nov  8 18:35:54 CET 2002          ## ^missing empty line above
...
Text of mail 2
...

That will work using any awk on any UNIX box and it just reads 1 line at a time so it'll work no matter how large your input files are.