Remove any CRLF from file that does not have a SPACE before CRLF in Linux

65 Views Asked by At

I have an output file on Linux File System that I need to strip CRLF in the file anywhere there is not a space proceeding the CRLF. All Lines in the file that are valid have many spaces proceeding the CRLF. The CRLF that I want to remove have NO Space in front of the CRLF. Welcome to TR | SED | AWK. I have tried many methods with no luck.

I should mention the file is a fixed length file, if that helps.

I'm game with any of these methods TR | SED | AWK

I have attempted many different commands this evening but with various results but none that solve my issue.

echo -n $(tr -d "?<-c\x20\r\n" < file) > output.txt
echo -n $(tr -d "?<\x20\r\n" < file) > output.txt;

(tr -d "\r\n" < file.txt | fold -w 2538;echo) | sed 's/$/\r/' > output.txt

awk 'length($0) != 2536 {sub(/\r$/,""); printf "%s", $0; next} {print}' file.txt > output.txt
4

There are 4 best solutions below

0
jsbueno On

The line bellow will do it using Python (in place of sed or awk). It is just less used for this kind of things because there are no shortcuts in Python to reading/writting to stdin/stdout - the one liner requires full function calls to open("input.txt", "rb") - and the write converse - to be written, and is otherwise more verbose - as a proper Python program this would be 4 lines long, maybe:

python3 -c "open('output.txt', 'wb').writelines(line.replace(b'\r\n', b'') if len(line)>2 and line[line.find(b'\r\n') - 1] != 32 else line for line in open('input.txt', 'rb') )"

Or, otherwise, for using pipes, and relying on the automatic text decoding/encoding - this version:

cat input.txt| python -c "import sys;[print(line.replace('\r\n', '')if len(line) > 2 and line[line.find('\r\n') - 1] != ' '  else line , end='') for line in sys.stdin]" >output.txt

(the final code size is the same, so unless you may replace "cat input.txt" for the input or you want to pipe the result somewhere else, the other version is the same size, and could avoid shell pitfalls)

0
Daweo On

Remove any CRLF from file that does not have a SPACE before CRLF in Linux

This can be done using regular expression with negative lookbehind (which is kind of zero length assertion). If your Linux machine is equipped with Perl you could use it in the following way:

echo -e "I have space \r\nNoSpace\r\nJustNewline" > file1.txt
perl -p -0777 -e 's/(?<! )\r\n//g' file1.txt > file2.txt
cat --show-all file2.txt

gives output

I have space ^M$
NoSpaceJustNewline$

Explanation: -p -e engage sed mode, -0777 engage slurp mode (treat whole file as single giant line), \r\n is CRLF in Perl language, (?<! ) is negative lookbehind, s/ and /g have same meaning as in GNU sed, so the whole command means replace all CRLF not prefixed by space with empty string (delete it).

(tested in perl 5, version 34, subversion 0)

However if you are strictly limited to TR | SED | AWK then

  • tr itself is unable to do that task alone, it does care only about current character, not anything before or after it.
  • GNU sed is tricky whilst you want to do newline-related changes, if your file will never contain NUL character you might exploit -z option, but there is not negative lookbehind support; however if you are okay with deleting all CRLF prefixed by any character other than space, you can do it with sed -z 's/\([^ ]\)\r\n/\1/g' file1.txt > file2.txt output is same as above for that particular case (tested in GNU sed 4.8).
  • GNU AWK might be subverted into treating whole file as single line by setting RS to regular expression which never appears in your file, it does not support negative lookbehind, but you can get desired behavior by exploiting String functions in the following way: awk 'BEGIN{RS="\000";FS="\r\n";ORS=""}{for(i=1;i<=(NF-1);i+=1){print $i (substr($i,length($i))==" "?"\r\n":"")};print $NF}' file1.txt > file2.txt assuming NUL character does not appear in your file, output is same as above (tested in GNU Awk 5.1.0).
0
Ed Morton On

This may be what you're looking for, using any awk:

awk '{ sub(/\r$/,""); ORS=(/ $/ ? "\r\n" : "") } 1' file

but without sample input/output in the question it's obviously an untested guess.

The above assumes that:

  1. All input lines end with CRLF.
  2. When you say "SPACE" in your question you mean a blank character as you'd get when you hit the space bar on your keyboard.
  3. You're on a platform where the underlying C primitives don't strip CRs before awk sees them.
0
Walter A On

With GNU sed and option -z:

sed -rz 's/([^|^ ])\r\n/\1/g' file > output.txt

I used ([^|^ ]) and not ([^ ]) for matching anything except a space, because I want to delete an "empty" first line (only \r\n) as well.