Handling fixed-length records with embedded unix newlines

163 Views Asked by At

I am receiving a text files that is fixed length fields and carriage return/newline delimited records (CRLF). Recently one of the text fields started to present with a newline character in the record (LF). This is obviously causing some problems on our unix server.

I would like to simply look for the use of LF in the file and replace it with a single space, but this will obviously interfere with the windows newlines.

I have tried tr and perl but can't quite seem to get it right:

cat badinput.txt | perl -p -e 's/\x0D\x0A/\x0D/' | perl -p -e 's/\0A/ /' | perl -p -e 's/\x0D/\x0D\x0A/' > goodoutput.txt

The idea is to

  • replace CRLF with CR
  • replace LF with
  • replace CR with CRLF

For some reason I'm not quite getting the CR -> CRLF transformation.

Suggestions?

3

There are 3 best solutions below

6
Philippe On BEST ANSWER

You can read the whole input with -0777 and then do the substitution:

perl -0777pe 's/\r\n/\r/g;s/\n/ /g;s/\r/\r\n/g' badinput.txt

The parameter are:

  • p which outputs the value of $_ at the end of each "line"
  • 0777 which sets the record delimiter to undef

Perl Command-line Options

9
zdim On

Why not replace \x0A (with a space) when it is not immediately preceded by \x0D

s/(?<!\x0D)\x0A/ /;

This uses negative lookbehind

It is probably safest to read the file into a string ("slurp" it) as it is not clear what those LF/CRLF will do for reading it "line by line" -- what is a "line" on the OS on which this is processed? So

perl -0777 -wpE's/(?<!\x0D)\x0A/ /g' file

The 0777 command switch effectively unsets the input record separator.

This prints out the file with changes. To change it in place, add -i. See the linkd docs.

2
brian d foy On

To be clear, you have records separated by a carriage-return / newline pair (and I edited your question for that). )And, update, you didn't have CSV as you said you did and fixed-length records makes a big difference. Why are there line endings in fixed-length records?) You can set the line ending to that, read one record at a time, and modify anything you like in it. You don't need chomp or any special line ending armor.

Here's a sample file, where there's a newline between the two NL literals and the line endings are CRLF (although Stackoverflow won't show you that probably):

one,two,three
uno,dos,tress
dog,cat,NL
NL
one,two,again

And what it looks like (notice 4e 4c 0a 4e 4c)

$ hexdump -C badinput.txt
00000000  6f 6e 65 2c 74 77 6f 2c  74 68 72 65 65 0d 0a 75  |one,two,three..u|
00000010  6e 6f 2c 64 6f 73 2c 74  72 65 73 73 0d 0a 64 6f  |no,dos,tress..do|
00000020  67 2c 63 61 74 2c 4e 4c  0a 4e 4c 0d 0a 6f 6e 65  |g,cat,NL.NL..one|
00000030  2c 74 77 6f 2c 61 67 61  69 6e 0d 0a              |,two,again..|
0000003c

Now I need to read this so that the line endings are CRLF. I set the special variable $/ (input record separator) to whatever I want. Now the bare LF isn't a problem because it's just a part of line 3, and since I don't do anything to the CR, the line endings are still CRLF (which you probably won't see here):

$ perl -ne 'print qq($. $_) } BEGIN { $/ = qq(\xd\xa) ' badinput.txt
1 one,two,three
2 uno,dos,tress
3 dog,cat,NL
NL
4 one,two,again

Next I replace all NL that are not at the end of the line (so, only interior ones). This uses a negative lookahead to check for the absolute end of string: (?!\z), but other sorts of patterns will work (such as zdim's answer):

$ perl -ne 's/\xa(?!\z)/ /g; print qq($. $_) } BEGIN { $/ = qq(\xd\xa) ' badinput.txt
1 one,two,three
2 uno,dos,tress
3 dog,cat,NL NL
4 one,two,again

Since the -n is really just wrapping the argument to -e, I can use its starting and ending braces to open and close different things. I close off the implicit while myself and use the leftover implicit closing brace for the BEGIN. No big whoop. But, this trick doesn't work with -p because there is additional implicit code that perl adds.