awk and sort and uniq combination does not count properly

116 Views Asked by At

This is my file.log:

{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Request: [86.104.44.22:63761] (udp) / 'a1834.dscg2.akamai.NET.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.868627873Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Request: [86.104.44.22:62682] (udp) / 'th.bing.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.868688906Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Request: [86.104.44.22:62248] (udp) / 'th.bing.com.' (HTTPS)\n","stream":"stdout","time":"2023-08-20T10:07:02.868705921Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Request: [86.104.44.22:62719] (udp) / 'a1834.dscg2.akamai.NET.' (HTTPS)\n","stream":"stdout","time":"2023-08-20T10:07:02.868721225Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62682] (udp) / 'th.bing.com.' (A) / RRs: CNAME,CNAME,CNAME,A,A,A,A,A,A,A,A,A\n","stream":"stdout","time":"2023-08-20T10:07:02.868739879Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62836] (udp) / 'www.bing.com.' (HTTPS) / RRs: CNAME,CNAME,CNAME\n","stream":"stdout","time":"2023-08-20T10:07:02.868760977Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:63761] (udp) / 'a1834.dscg2.akamai.NET.' (A) / RRs: A,A\n","stream":"stdout","time":"2023-08-20T10:07:02.868810466Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62719] (udp) / 'a1834.dscg2.akamai.NET.' (HTTPS) / RRs: \n","stream":"stdout","time":"2023-08-20T10:07:02.868831775Z"}
{"log":"2023-08-20 10:06:59 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62248] (udp) / 'th.bing.com.' (HTTPS) / RRs: CNAME,CNAME,CNAME\n","stream":"stdout","time":"2023-08-20T10:07:02.868896564Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:63587] (udp) / 'hlb.apr-52dd2-0.edgecastdns.NET.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.868912596Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:63587] (udp) / 'hlb.apr-52dd2-0.edgecastdns.NET.' (A) / RRs: CNAME,A\n","stream":"stdout","time":"2023-08-20T10:07:02.868926565Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:63487] (udp) / 'www.bing.com.' (HTTPS)\n","stream":"stdout","time":"2023-08-20T10:07:02.868940116Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:63487] (udp) / 'www.bing.com.' (HTTPS) / RRs: CNAME,CNAME\n","stream":"stdout","time":"2023-08-20T10:07:02.868953663Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [84.241.34.239:53210] (udp) / 'yunqos.gamesafe.qq.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.868967065Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:63412] (udp) / 'th.bing.com.' (HTTPS)\n","stream":"stdout","time":"2023-08-20T10:07:02.868980656Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:63412] (udp) / 'th.bing.com.' (HTTPS) / RRs: CNAME,CNAME,CNAME\n","stream":"stdout","time":"2023-08-20T10:07:02.86899756Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:62452] (udp) / 'onedscolprdwus08.westus.cloudapp.azure.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869020234Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62452] (udp) / 'onedscolprdwus08.westus.cloudapp.azure.com.' (A) / RRs: A\n","stream":"stdout","time":"2023-08-20T10:07:02.86904008Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:64304] (udp) / 'www.bing.com.' (HTTPS)\n","stream":"stdout","time":"2023-08-20T10:07:02.869059471Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:64304] (udp) / 'www.bing.com.' (HTTPS) / RRs: CNAME,CNAME,CNAME\n","stream":"stdout","time":"2023-08-20T10:07:02.86907888Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [84.241.34.239:49270] (udp) / 'update.eset.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869098025Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [84.241.34.239:49270] (udp) / 'update.eset.com.' (A) / RRs: CNAME,A\n","stream":"stdout","time":"2023-08-20T10:07:02.869118703Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [84.241.34.239:53210] (udp) / 'yunqos.gamesafe.qq.com.' (A) / RRs: A,A,A,A,A,A,A,A,A,A\n","stream":"stdout","time":"2023-08-20T10:07:02.869133298Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [94.182.110.194:51644] (udp) / 'raja-bot.utravs.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869147079Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [94.182.110.194:51644] (udp) / 'raja-bot.utravs.com.' (A) / NXDOMAIN\n","stream":"stdout","time":"2023-08-20T10:07:02.869225953Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [84.241.34.239:56795] (udp) / 'dns.msftncsi.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869242518Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [84.241.34.239:56795] (udp) / 'dns.msftncsi.com.' (A) / RRs: A\n","stream":"stdout","time":"2023-08-20T10:07:02.86925624Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [86.104.44.22:62881] (udp) / 'wns.notify.trafficmanager.NET.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869269544Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Reply: [86.104.44.22:62881] (udp) / 'wns.notify.trafficmanager.NET.' (A) / RRs: A\n","stream":"stdout","time":"2023-08-20T10:07:02.869283006Z"}
{"log":"2023-08-20 10:07:00 [DNSHandler:ProxyResolver] Request: [84.241.34.239:61310] (udp) / 'upd.es-eset.com.' (A)\n","stream":"stdout","time":"2023-08-20T10:07:02.869323226Z"}

I'm trying to sort only the domain names and the IP with the following command:

awk '/DNSHandler:ProxyResolver\] Request|Reply/ {print $5, $8}' file.log | sort -k2 | uniq -c | sort -r

This is the current output:

      2 [94.182.110.194:51644] 'raja-bot.utravs.com.'
      2 [86.104.44.22:64304] 'www.bing.com.'
      2 [86.104.44.22:63761] 'a1834.dscg2.akamai.NET.'
      2 [86.104.44.22:63587] 'hlb.apr-52dd2-0.edgecastdns.NET.'
      2 [86.104.44.22:63487] 'www.bing.com.'
      2 [86.104.44.22:63412] 'th.bing.com.'
      2 [86.104.44.22:62881] 'wns.notify.trafficmanager.NET.'
      1 [86.104.44.22:62836] 'www.bing.com.'
      2 [86.104.44.22:62719] 'a1834.dscg2.akamai.NET.'
      2 [86.104.44.22:62682] 'th.bing.com.'
      2 [86.104.44.22:62452] 'onedscolprdwus08.westus.cloudapp.azure.com.'
      2 [86.104.44.22:62248] 'th.bing.com.'
      1 [84.241.34.239:61310] 'upd.es-eset.com.'
      2 [84.241.34.239:56795] 'dns.msftncsi.com.'
      2 [84.241.34.239:53210] 'yunqos.gamesafe.qq.com.'
      2 [84.241.34.239:49270] 'update.eset.com.'

As you see, for example www.bing.com is repeated 2 times in Request and 3 times in Reply, but it seems it's counting the second column (IP:Port);

The expected output:

      6 [86.104.44.22:63412] 'th.bing.com.'
      5 [86.104.44.22:64304] 'www.bing.com.'
      4 [86.104.44.22:63761] 'a1834.dscg2.akamai.NET.'
      2 [94.182.110.194:51644] 'raja-bot.utravs.com.'
      2 [86.104.44.22:63587] 'hlb.apr-52dd2-0.edgecastdns.NET.'
      2 [86.104.44.22:62881] 'wns.notify.trafficmanager.NET.'
      2 [86.104.44.22:62452] 'onedscolprdwus08.westus.cloudapp.azure.com.'
      2 [84.241.34.239:56795] 'dns.msftncsi.com.'
      2 [84.241.34.239:53210] 'yunqos.gamesafe.qq.com.'
      2 [84.241.34.239:49270] 'update.eset.com.'
      1 [84.241.34.239:61310] 'upd.es-eset.com.'

It seems since the IP:Port (specifically the Port) is different in the output, it's counting a different line.

I intend to sort them only based on the second column which is the domain name.

3

There are 3 best solutions below

0
Saeed On

This is how I solve the problem:

awk '/DNSHandler:ProxyResolver\] Request|Reply/ {print $5, $8}' file.log | sort -k2 | uniq -c -f1 | sort -k1r

Output:

      6 [86.104.44.22:62248] 'th.bing.com.'
      5 [86.104.44.22:62836] 'www.bing.com.'
      4 [86.104.44.22:62719] 'a1834.dscg2.akamai.NET.'
      2 [94.182.110.194:51644] 'raja-bot.utravs.com.'
      2 [86.104.44.22:63587] 'hlb.apr-52dd2-0.edgecastdns.NET.'
      2 [86.104.44.22:62881] 'wns.notify.trafficmanager.NET.'
      2 [86.104.44.22:62452] 'onedscolprdwus08.westus.cloudapp.azure.com.'
      2 [84.241.34.239:56795] 'dns.msftncsi.com.'
      2 [84.241.34.239:53210] 'yunqos.gamesafe.qq.com.'
      2 [84.241.34.239:49270] 'update.eset.com.'
      1 [84.241.34.239:61310] 'upd.es-eset.com.'
2
pmf On

Note that the input is JSON, so you might want to employ a JSON processor for parsing and extracting the parts needed. Here's an example (with a demo) using jq:

jq -sr 'map(.log/" " | [.[4,7]])
  |  group_by(last)  | sort_by(-length)[]
  | [length, last[]] | @tsv
' file.log
6   [86.104.44.22:63412]    'th.bing.com.'
5   [86.104.44.22:64304]    'www.bing.com.'
4   [86.104.44.22:62719]    'a1834.dscg2.akamai.NET.'
2   [84.241.34.239:56795]   'dns.msftncsi.com.'
2   [86.104.44.22:63587]    'hlb.apr-52dd2-0.edgecastdns.NET.'
2   [86.104.44.22:62452]    'onedscolprdwus08.westus.cloudapp.azure.com.'
2   [94.182.110.194:51644]  'raja-bot.utravs.com.'
2   [84.241.34.239:49270]   'update.eset.com.'
2   [86.104.44.22:62881]    'wns.notify.trafficmanager.NET.'
2   [84.241.34.239:53210]   'yunqos.gamesafe.qq.com.'
1   [84.241.34.239:61310]   'upd.es-eset.com.'
3
markp-fuso On

Assumptions/Understandings:

  • accumulate counts based on a combination of the ip + domain
  • a unique ip + domain pair may have multiple ports; we just need to display one of the ports; which port we display is unimportant [?? if this is the case then why display any ports ??]

Pulling everything but a sort call into awk:

awk '
BEGIN { regex = "DNSHandler:ProxyResolver]$" }

$3 ~ regex && \
$4 ~ /Reply:|Request:/ { split($5,a,"[:[]")         # a[2] == ip address
                         counts[a[2]][$8]++         # counts[ip][domain]
                         ports[a[2]]= $5            # store latest port; will overwrite previous port
                       }
END                    { for (ip in counts)
                             for (domain in counts[ip])
                                 print counts[ip][domain],ports[ip],domain
                       }
' file.log | sort -k1r

NOTES:

  • requires GNU awk for multidimesional arrays (aka array of arrays)
  • with GNU awk the sorting can be performed within the awk script (ie, the sort -k1r call could be eliminated) but in this case the sort -k1r is a bit more straightforward
  • I've copied sort -k1r from OP's answer; I'm assuming OP has confirmed this is sufficient to generate the desired result

This generates:

6 [86.104.44.22:62881] 'th.bing.com.'
5 [86.104.44.22:62881] 'www.bing.com.'
4 [86.104.44.22:62881] 'a1834.dscg2.akamai.NET.'
2 [94.182.110.194:51644] 'raja-bot.utravs.com.'
2 [86.104.44.22:62881] 'wns.notify.trafficmanager.NET.'
2 [86.104.44.22:62881] 'onedscolprdwus08.westus.cloudapp.azure.com.'
2 [86.104.44.22:62881] 'hlb.apr-52dd2-0.edgecastdns.NET.'
2 [84.241.34.239:61310] 'yunqos.gamesafe.qq.com.'
2 [84.241.34.239:61310] 'update.eset.com.'
2 [84.241.34.239:61310] 'dns.msftncsi.com.'
1 [84.241.34.239:61310] 'upd.es-eset.com.'