Return strings within parameters sed/grep/awk/gawk

118 Views Asked by At

Need some help to return all data in a log file within 2 specific delimiters. We usually have logs like the one below:

2018-04-17 03:59:29,243 TRACE [xml] This is just a test.
2018-04-17 13:22:24,230 INFO [properties] I believe this is another test.
2018-04-18 03:48:07,043 ERROR [properties] (Thread-13) UpdateType: more data coming here; ProcessId: 5010
2018-04-17 13:22:24,230 INFO [log] I need to retrieve this string here
and also this one as it is part of the same text
2018-04-17 13:22:24,230 INFO [det] I believe this is another test.

If I grep "here" I just get the line including the word but I actually need to retrieve the whole text, the breaks are probably contributing to my problem also.

2018-04-17 13:22:24,230 INFO [log] I need to retrieve this string here
and also this one as it is part of the same text

We could have several "here" within the log file. I tried to do it through sed but I can't find the right way to use the delimiters which I think should be the whole DATE.

I really appreciate your help on this.

New example after Karakfa's comments

2018-04-17 03:48:07,044 INFO  [passpoint-logger] (Thread-19) ERFG|1.0||ID:414d512049584450414153541541871985165165130312020203aa4b|Thread-19|||2018-04-17 03:48:07|out-1||out-1|
2018-04-17 03:59:29,243 TRACE [xml] (Thread-19) RAW MED XML: <?xml version="1.0" encoding="UTF-8" standalone="yes"?><MED:MED_PMT_Tmp_Notif xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://services.xxx.com/POQ/v01" xmlns:POQ="http://services.xxx.com/POQ/v01" xmlns:MED="http://services.xxx.com/MED/v1.2" version="1.2.3" messageID="15290140135778972043" Updat584ype="PGML" xsi:schemaLocation="http://services.xxx.com/MED/v1.2 MED_PMT_v.1.2.3.xsd">
    <MED_Space xmlns:ns2="http://services.xxx.com/MED/v1.2" xmlns:ns4="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns3="http://services.xxx.com/POQ_Header/v01" status="AVAIL" dest="MQX" aircraftType="DH8" aircraftConfig="120">
        <Space_ID partition="584" orig="ADD3" messageCreate="2018-04-17T03:59:29.202-05:00">
            <Space carrier="584" date="2018-04-18">0108</Space>
        </Space_ID>
        <DepartAndArrive estDep="2018-04-18T18:10:00+03:00" schedDep="2018-04-18T18:10:00+03:00" estArrival="2018-04-18T19:30:00+03:00" schedArrival="2018-04-18T19:30:00+03:00"/>
        <Sched_OandD orig="ADD3" dest="MQX"/>
    </MED_Space>
    <TRX_Record xmlns:ns2="http://services.xxx.com/MED/v1.2" xmlns:ns4="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns3="http://services.xxx.com/POQ_Header/v01">
        <TRX_ID FILCreate="2018-04-17T03:59:00-05:00" resID="1">TFRSVL</TRX_ID>
        <Space>
            <Inds revenue="1"/>
            <Identification nameID="1" dHS_ID="TFRSVL001" gender="X">
                <Name_First>SMITH MR</Name_First>
                <Name_Last>P584ER</Name_Last>
                <TT tier="0"/>
            </Identification>
                <TRXType>F</TRXType>
            <SRiuyx>0</SRiuyx>
            <GroupRes>1</GroupRes>
            <SystemInstances inventory="H">Y</SystemInstances>
            <OandD_FIL orig="ADD3" dest="MQX"/>
            <Store="584">0108</Store>
            <CodingSpec="584">0108</CodingSpec>
        </Space>
    </TRX_Record>
        <ns2:TRX_Count xmlns:ns2="http://services.xxx.com/MED/v1.2" xmlns:ns4="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns3="http://services.xxx.com/POQ_Header/v01">1</ns2:TRX_Count>
    <ns2:Transaction_D584ails xmlns:ns2="http://services.xxx.com/MED/v1.2" xmlns:ns4="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns3="http://services.xxx.com/POQ_Header/v01" sourceID="TPF">
        <Client_Entry_Info authRSX="54" agx="S4" code="ADD3">RESTORE AMEND:NEW-FIL/AFAX-UPDATED</Client_Entry_Info>
    </ns2:Transaction_D584ails>
</MED:MED_PMT_Tmp_Notif>
2018-04-17 03:59:29,244 INFO  [properties] (Thread-19) Updat584ype: PGML ; ProcessId: ##MISSING##

The entry below is not returning the whole text: awk -v RS='(^|\n)[0-9 :,-]+' '/TFRSVL/{print rs,$0} {rs=RT}' file

2

There are 2 best solutions below

2
karakfa On BEST ANSWER

with GNU awk multi-char record separator

$ awk -v RS='(^|\n)[0-9 :,-]+' '/here/{print rs,$0} {rs=RT}' file

2018-04-18 03:48:07,043  ERROR [properties] (Thread-13) UpdateType: more data coming here; ProcessId: 5010

2018-04-17 13:22:24,230  INFO [log] I need to retrieve this string here
and also this one as it is part of the same text

NB Here I cheated by creating the record separator that uses the values in the time stamp. You can formulate it exactly to eliminate false positives ending up on the start of the second line. Or, perhaps add the debug levels to the match as well.

2
Ed Morton On

Assuming every record starts with a timestamp then a string of all upper case letters then another string within square brackets:

$ cat tst.awk
/^[0-9]{4}(-[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2},[0-9]{3} [[:upper:]]+ \[[^][]+\] / { prt() }
{ rec = (rec=="" ? "" : rec ORS) $0 }
END { prt() }

function prt() {
    if (rec ~ regexp) {
        print rec
        print "----"
    }
    rec = ""
}

$ awk -v regexp='here' -f tst.awk file
2018-04-18 03:48:07,043 ERROR [properties] (Thread-13) UpdateType: more data coming here; ProcessId: 5010
----
2018-04-17 13:22:24,230 INFO [log] I need to retrieve this string here
and also this one as it is part of the same text
----

You can change the starting regexp to something else if that's not restrictive enough, e.g. if the text within a record ends up with a string matching that same regexp at the start of a subsequent line (though I don't know how you'd actually deal with that given what you've shown us so far).

Also, think about what this is doing:

$ cat tst.awk
/^[0-9]{4}(-[0-9]{2}){2} [0-9]{2}(:[0-9]{2}){2},[0-9]{3} [[:upper:]]+ \[[^][]+\] / { prt() }
{ rec = (rec=="" ? "" : rec ORS) $0 }
END { prt() }

function prt(   flds,recDate,recTime,recPrio,recType,recText) {
    split(rec,flds)
    recDate = flds[1]
    recTime = flds[2]
    recPrio = flds[3]
    recType = flds[4]
    gsub(/[][]/,"",recType)
    recText = rec
    sub(/([^[:space:]]+ ){4}/,"",recText)
    gsub(/[[:space:]]+/," ",recText)

    if (NR > 1) {
        if ( date=="" || date==recDate ) {
            printf "date = <%s>\n", recDate
            printf "time = <%s>\n", recTime
            printf "prio = <%s>\n", recPrio
            printf "type = <%s>\n", recType
            printf "text = <%s>\n", recText
            print "----"
        }
    }
    rec = ""
}

.

$ awk -v date='2018-04-18' -f tst.awk file
date = <2018-04-18>
time = <03:48:07,043>
prio = <ERROR>
type = <properties>
text = <(Thread-13) UpdateType: more data coming here; ProcessId: 5010>
----

.

$ awk -f tst.awk file
date = <2018-04-17>
time = <03:59:29,243>
prio = <TRACE>
type = <xml>
text = <This is just a test.>
----
date = <2018-04-17>
time = <13:22:24,230>
prio = <INFO>
type = <properties>
text = <I believe this is another test.>
----
date = <2018-04-18>
time = <03:48:07,043>
prio = <ERROR>
type = <properties>
text = <(Thread-13) UpdateType: more data coming here; ProcessId: 5010>
----
date = <2018-04-17>
time = <13:22:24,230>
prio = <INFO>
type = <log>
text = <I need to retrieve this string here and also this one as it is part of the same text>
----
date = <2018-04-17>
time = <13:22:24,230>
prio = <INFO>
type = <det>
text = <I believe this is another test.>
----

and imagine how you can easily create precise queries on specific fields of your log records using that approach, generate CSVs for import to Excel, etc, etc...