Awk: set RS to include newline and 1st (only) field of next row // logfile "splits" based on custom RS and print matching pattern therein

Question

Awk: set RS to include newline and 1st (only) field of next row // logfile "splits" based on custom RS and print matching pattern therein

175 Views Asked by cg792 At 03 January 2024 at 17:14

The short version of the question: To which value to set RS in awk to split records based on each line whose n-th field is empty ? (if line would be completely empty ,i.e. no Timestamp field in my examples, then setting RS="\n\n ..." would do.

The long version: This is how my log file looks like (notice the intertwined sections related to **amd64** resp. **arm64**) :

...
2023-12-29T16:05:20.3032116Z 
2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
2023-12-29T16:05:20.4085104Z 
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z 
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...
2023-12-29T16:05:20.6983744Z
2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...
2023-12-29T16:05:21.3972318Z 
...

.... as can be seen, each section ends with an line which doesn't contain anything except a Timestamp

The goal is to print separately the sections (lines) for each of amd64 and for arm64, e.g. (for amd64):

2023-12-29T16:05:20.4085104Z      <-- ideally be present in output
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z       <-- ideally be present in output
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz

The ideal solution would:

not mandatory to make use of awk, except when solutions in sed & co. are really overkill and more 'script-like'
be relatively easy to remember / and intuitively to replicate for repeated similar use-cases
not be too specific, i.e. work for other first (or n-th) field (not necessary for Timestamp-like formatted field)
not use any other extra Tools besides the main one (e.g. awk )

The followig solution only works (partially) but only if the log didn't have any fields in the empty lines (e.g. no Timestamp field): awk -vRS='\n\n' -vORS='\n\n' '/amd64 builder/ 1' logfile. however, and as an extra question: why (and how to correct it) does this solution print twice, in the first section of the output, the keyword searched for, i.e. amd64 in my case? Other (subsequent) sections only have the keyword once (as expected) ?

Thanks

LE: just realized that, without preserving the line with just the Timestamp in it, the output is hard to read .. so if you guys @Ed Morton and @markp-fuso could adjust a little bit your answers to preserve that line ? Thank you !

Original Q&A

There are 4 best solutions below

markp-fuso On 03 January 2024 at 17:54

UPDATE:

OP has stated the output is hard to read if we remove the timestamp-only line
OP has asked to keep the timestamp-only line
based on actual timestamp values (in the sample input) it looks like we need to keep the trailing timestamp-only line
answer has been updated to keep/print the trailing timestamp-only line

One awk idea:

awk -v arch="arm64" '                               # assign awk variable "arch" the name of the chip architecture

function print_block() {                            # use a function to determine if "block" should be printed to stdout
    if (block != "" && block ~ arch)                # if awk variable "block" is not empty and also contains "arch" then ...
       print block ORS $0                           # print current contents of "block" plus the current timestamp-only line
    block = ""                                      # re-init "block"
}

NF <  2 { print_block() }                           # if missing the 2nd field then see if we need to print current contents of "block"
NF >= 2 { block = block (block ? ORS : "") $0 }     # if 2nd field exists then append to "block"; if block is not empty we append with ORS else if block is empty we append with ""
END     { print_block() }                           # print last "block"?
' logfile

NOTE: replace print block ORS $0 with print block ORS "" to print a blank line in place of the timestamp-only line

With -v arch="arm64" this generates:

2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
2023-12-29T16:05:20.4085104Z
2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...
2023-12-29T16:05:21.3972318Z

With -v arch="amd64" this generates:

2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...
2023-12-29T16:05:20.6983744Z

As for OP's requirement - work for other first (or n-th) field - something like this may work:

awk -v arch="arm64" -v nth_fld=2 '
function ...
NF <  nth_fld { print_block() }
NF >= nth_fld { block = .... }
END ...
' logfile

anubhava On 03 January 2024 at 18:38

Already posted answers would work in other awk versions as well. But you insist on a custom RS based answer then here is a gnu-awk solution:

awk -v tgt='amd64' -vORS='\n\n' -v RS='(^|\n)[0-9]{4}-[0-9T:.-]+Z\n' '$0 ~ "/" tgt' file

2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s

2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...

and:

awk -v tgt='arm64' -vORS='\n\n' -v RS='(^|\n)[0-9]{4}-[0-9T:.-]+Z\n' '$0 ~ "/" tgt' file

2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s

2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...

Take note of -v RS='(^|\n)[0-9]{4}-[0-9T:.-]+Z\n' that sets input record separator as 4 digits appearing after start of file or at the line start followed by a hyphen and then 1+ of any given characters inside the character class and ending with letter Z and then a line break.

This will break input by block of text appearing between empty timestamps and then finally $0 ~ "/" tgt prints only those records that match tgt command line argument.

Based on comment below if RS is to be set when a line has just first field:

awk -v tgt='arm64' -vORS='\n\n' -v RS='(^|\n)\\S+\n' '$0 ~ "/" tgt' file

Daweo On 03 January 2024 at 19:12

To which value to set RS in awk to split records based on each line whose n-th field is empty?

If you are okay with assuming white-space-sheared file, you might use following RS to use 1-column wide lines as separators "\n[^[:space:]]+\n", consider simplified example, let file.txt content be

1900 zero
1901 one
1903
1905 five
1907 seven

then

awk 'BEGIN{RS="\n[^[:space:]]+\n"}/zero/' file.txt

gives output

1900 zero
1901 one

If you want to preserve that line, you might exploit RT built-in variable as follows

awk 'BEGIN{RS="\n[^[:space:]]+\n"}/zero/{printf("%s%s",$0,RT)}' file.txt

gives output

1900 zero
1901 one
1903

Explanation: Regular expression is newline (\n) followed by one-or-more (+) non (^) white-space ([:space:]) characters followed by newline (\n). RT hold row terminator of current line, this is useful if you have RS which might match different strings. printf provides

more precise control over the output format than what is provided by print

in this case I use it as I do not want trailing newline appended by print, %s is Control Letter which instruct GNU AWK to treat what it will get as string, observe that number of control letters must be exactly equal to number of variables rammed, in this case 2.

(tested in GNU Awk 5.1.0)

**Ed Morton** · Accepted Answer · 2024-01-03T17:28:57.107000

$ awk -v tgt='amd64' 'NF<2{f=""; next} !f{f=($3 ~ ("/"tgt"$"))} f' file
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...

$ awk -v tgt='arm64' 'NF<2{f=""; next} !f{f=($3 ~ ("/"tgt"$"))} f' file
2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...

NF<2{f=""; next} clears the flag f when there's only a timestamp on the line.
!f{f=($3 ~ ("/"tgt"$"))} sets f to 1 (if tgt is present) or 0 (otherwise) when each line that looks like #11 [linux/amd64 builder 1/8] is read.
f causes the current line to be printed when f is 1.

I don't know why you thought setting RS to \n\n would work for you, it fails because doing so is unrelated to your problem.

Given your comments, it sounds like this is what you're looking for (using GNU awk for multi-char RS, RT, and \S/\s):

$ awk -v RS='\n\\S+\\s*\n' -v ORS= '/amd64/{print $0 RT}' file
2023-12-29T16:05:20.4085552Z #11 [linux/amd64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.5499792Z #11 DONE 0.1s
2023-12-29T16:05:20.5505699Z
2023-12-29T16:05:20.5509862Z #12 [linux/amd64 builder 2/8] RUN apk add --no-cache libc6-compat
2023-12-29T16:05:20.5512029Z #12 0.138 fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
2023-12-29T16:05:20.6982466Z #12 ...
2023-12-29T16:05:20.6983744Z

$ awk -v RS='\n\\S+\\s*\n' -v ORS= '/arm64/{print $0 RT}' file
2023-12-29T16:05:20.3032116Z
2023-12-29T16:05:20.3040485Z #10 [linux/arm64 builder 1/8] WORKDIR /app
2023-12-29T16:05:20.4084773Z #10 DONE 0.8s
2023-12-29T16:05:20.4085104Z
2023-12-29T16:05:21.2474882Z #16 [linux/arm64 runner 2/7] RUN addgroup -S -g 1001 nodejs
2023-12-29T16:05:21.3971789Z #16 ...
2023-12-29T16:05:21.3972318Z

Awk: set RS to include newline and 1st (only) field of next row // logfile "splits" based on custom RS and print matching pattern therein

There are 4 best solutions below

Related Questions in AWK

Related Questions in TEXT-PROCESSING

Related Questions in UNIX-TEXT-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions