Returning lines from a JSON log which match some content when reading from a CSV?

57 Views Asked by At

So using a script, with Python, in my CSV file, I want to see if IP and timestamp values exists in some way in the line entries of the JSON log file, and if so, return that specific JSON log entry to another file. I tried to make it universal so it's applicable to all IP addresses. Here's what the sample CSV file would look like;

"clientip",""destip","dest_hostname","timestamp"
 "127.0.0.1","0.0.0.0","randomhost","2023-09-09T04:18:22.542Z"

A sample line entry from the Json Log File

{"log": "09-Sept-2023 rate-limit: info: client @xyz 127.0.0.1, "stream":"stderr", "time": 2023-09-09T04:18:22.542Z"}

It's the lines from the JSON log file we want to return in the output.txt file when there's a match. The JSON file doesn't have the same fields and organization like the CSV does (with clientip, destip, dest_hostname, timestamp, but I was hoping that I could still at least return lines from the JSON log files to a new file that had matches on the clientip (like we see here with 127.0.0.1 in "info: client @xyz 127.0.0.1) and maybe the timestamp.

I tried shell previously but could not get any matches. I tried the join command join file.csv xyz-json.log > output.txt but it didn't yield anything, neither did awk with specification like "NR==FR".

That's why I'm trying to get this done in Python now. I'm new to Python as well, but this is what I roughly had in mind, ignoring indentation for now.

import csv
for line in csv
for line in json-logs
if csv == json-logs
print l1 == l2

I would appreciate any help/assistance with this!

2

There are 2 best solutions below

0
On

One possibility would be to read both csv and json files into a dataframe; extract any ip values from the json log then do an inner merge from the json file on ip and time and output rows remaining after the merge:

dfc = pd.read_csv('test.csv')
dfj = pd.read_json('test.jsonl', lines=True)
dfj['ip'] = dfj['log'].str.extract(r'(\d+(?:\.\d+){3})')
res = dfj.merge(dfc, left_on=['ip', 'time'], right_on=['clientip','timestamp'],how='inner')
res[['log', 'stream', 'time']].to_json('result.jsonl', orient='records', lines=True)
0
On

Here is some code structure to get you started. I had to take some liberties with the CSV and JSON you provided as they don't appear to be well formed (as already mentioned in comments), so I coerced them to make them usable.

import io
import csv
import json

csv_str = '"clientip","destip","dest_hostname","timestamp"\n' + \
    '"127.0.0.1","0.0.0.0","randomhost","2023-09-09T04:18:22.542Z"'
json_str = '[{"log": "09-Sept-2023", "rate-limit": "somethingelse?", ' + \
    '"info": "client @xyz 127.0.0.1", "stream":"stderr", ' + \
    '"time": "2023-09-09T04:18:22.542Z"}]'

# https://stackoverflow.com/a/3305964
# Read the CSV data (this can read from the file directly instead)
csv_data = csv.DictReader(io.StringIO(csv_str))
# Read the JSON data (this can also be read from the file directly instead)
json_data = json.loads(json_str)

# Check each line in the CSV
for csv_line in csv_data:
    # Check each line the JSON
    for json_line in json_data:
        # Check if this line matches, and act if it did
        if csv_line['timestamp'] == json_line['time']:
            # Do whatever you want with the information
            print(json_line)

If you are able to post some better examples of the CSV and JSON I can update this to match, presuming the structures I guessed are inaccurate. Feel free to ask for additional guidance on any part of this.