How do I split this text into multiline records?

116 Views Asked by At
#!/usr/bin/bash

mailing_list="Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321



Amir Faquer
C. de la Lusitania 98
08206 Sabadell
        
Amir Faquer w spaces before
C. de la Lusitania 98
08206 Sabadel
    
      
      
      
Wife w spaces before
C. de la Lusitania 98
08206 Sabadell
"
echo "$mailing_list"|awk -v RS='' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'

echo "$mailing_list"|awk -v RS='\n\n+' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'

echo "$mailing_list"|awk -v RS='\n *\n+' -v FS='\n' '/.*/ 
END {print "The number of records is "NR"."}'


How do I split this mailing-list into multi-line records, not just when there is just with RS='\n\n+'. The last line of my code infms me that the number of records is seven, which is not correct - there are just five records. I also want the the blank lines that have arbitrary amounts of whitespace to act as RS. How might I accomplish that?

3

There are 3 best solutions below

0
Ed Morton On BEST ANSWER

You can put any awk into "paragraph mode" by setting RS to null. In that mode awk will treat any sequence of 1 or more empty lines as the record separator:

$ printf '  foo\n\tbar\n\n    etc\n'
  foo
        bar

    etc
$ printf '  foo\n\tbar\n\n    etc\n' |
    awk -v RS= '{print NR, "<"$0">"}'
1 <  foo
        bar>
2 <    etc>

I'm including white space at the start of lines to ensure that the solution proposed won't treat them as part of the RS.

That doesn't do everything you want though as you also want lines that contain only white space to be considered part of the record separator but the above won't do that:

$ printf '  foo\n\tbar\n    \n    etc\n' |
    awk -v RS= '{print NR, "<"$0">"}'
1 <  foo
        bar

    etc>

To include lines of all white space in the RS you need to write a regexp to do that. POSIX awk doesn't support multi-char RS, it only allows a single char regexp, but with GNU awk and a couple of others now, you CAN use a multi-char regexp as the separator and \s can be used as shorthand for [[:space:]]:

$ printf '  foo\n\tbar\n    \n    etc\n' |
    awk -v RS='\n((\\s*\n)|$)' '{print NR, "<"$0">"}'
1 <  foo
        bar>
2 <    etc>

The |$ is necessary so that the newline at the end of the file doesn't become part of the last record:

$ printf '  foo\n\tbar\n    \n    etc\n' |
    awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
1 <  foo
        bar>
2 <    etc
>

Note that you don't need to double the \ in \n as it's one of the specific escape sequences defined at https://www.gnu.org/software/gawk/manual/gawk.html#Escape-Sequences. You do need to double escape \s and anything else not defined in that link because when the string is converted to a regexp to be used as a field separator one layer of \s gets consumed.

That doesn't completely solve the problem though as you may need to consider blank lines at the start or end of your input.

$ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
    awk -v RS='\n(\\s*\n)' '{print NR, "<"$0">"}'
1 <
  foo
        bar>
2 <    etc>

The blank line at the end is being ignored which is correct behavior since multiple blank lines are your RS, but the blank line at the start should be considered as either:

  1. the end of a preceding record which is empty, or
  2. something to be ignored (as is done in "paragraph mode").

If you want "1" above then:

$ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
    awk -v RS='(^|\n)((\\s*\n)|$)' '{print NR, "<"$0">"}'
1 <>
2 <  foo
        bar>
3 <    etc>

but if you want "2" above (to emulate what "paragraph mode" does) then:

$ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
    awk -v RS='(^|\n)((\\s*\n)|$)' '(NR==1) && /^\s*$/{NR--; next} {print NR, "<"$0">"}'
1 <  foo
        bar>
2 <    etc>

and if you want the same behavior with a POSIX awk then:

$ printf '\n  foo\n\tbar\n    \n    etc\n\n' |
    awk '
        /^[[:space:]]*$/ { $0=rec; if ($0 != "") print ++nr, "<"$0">"; rec=""; next }
        { rec = ( rec == "" ? "" : rec ORS ) $0 }
        END { $0=rec; if ($0 != "") print ++nr, "<"$0">" }
    '
1 <  foo
        bar>
2 <    etc>

There may be cases I haven't thought through above, e.g. no input or all blank input, if the above doesn't do what you want for those cases, they are left as an exercise :-).

0
Patrick Janser On

I'm new to awk but was interested in discovering the tool. I had a go and I think I managed to get what you want:

#!/usr/bin/bash

mailing_list="Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321



Amir Faquer
C. de la Lusitania 98
08206 Sabadell

Amir Faquer w spaces before
C. de la Lusitania 98
08206 Sabadel




Wife w spaces before
C. de la Lusitania 98
08206 Sabadell
"

echo "$mailing_list" | awk '
BEGIN {
        # Record separator regex.
        RS = "(\\n\\s*){2,}"
        # Field separator regex.
        FS = "\\n"
}
{
        print "Record n°" NR
        print "----------"
        print "Name   : " $1
        print "Street : " $2
        print "City   : " $3
        print ""
}
END {
        print ""
        print "The number of records is " NR "."
}'

This will output the following:

Record n°1
----------
Name   : Jane Doe
Street : 123 Main Street
City   : Anywhere, SE 12345-6789

Record n°2
----------
Name   : John Smith
Street : 456 Tree-lined Avenue
City   : Smallville, MW 98765-4321

Record n°3
----------
Name   : Amir Faquer
Street : C. de la Lusitania 98
City   : 08206 Sabadell

Record n°4
----------
Name   : Amir Faquer w spaces before
Street : C. de la Lusitania 98
City   : 08206 Sabadel

Record n°5
----------
Name   : Wife w spaces before
Street : C. de la Lusitania 98
City   : 08206 Sabadell


The number of records is 5.

The problem seems to be simply in the way you have to write the regex for the RS and FS variables. If it's a double-quoted string then you have to remember that "\n" means the newline character itself. But the regex engine has to receive "\" followed by the "n". This means that you have to write "\\n\\n+" instead of "\n\n+".

Then, to handle optional lines with spaces, it will be better to use the regex pattern \s as it matches tabs also. So to match 2 newlines or more with optional spaces, you can use (\n\s*){2,}, which has to be written "(\\n\\s*){2,}" in a double-quoted string.

0
ufopilot On
$ awk  '
     /^ *$/{print; next}  {i+=1; if(i%3==1) print "record: ", (i+2)/3; print}
' <<<"$mailing_list"
record:  1
Jane Doe
123 Main Street
Anywhere, SE 12345-6789

record:  2
John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321



record:  3
Amir Faquer
C. de la Lusitania 98
08206 Sabadell

record:  4
Amir Faquer w spaces before
C. de la Lusitania 98
08206 Sabadel




record:  5
Wife w spaces before
C. de la Lusitania 98
08206 Sabadell