We are using pandas to read a CSV in our python Lambda code. We have it set up so that it warns and skips bad lines instead of erroring out - this is the code for that:
dataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
This is partially working - it is outputting warnings and still processing the file successfully by skipping over the bad lines. However, now I am trying to add a metric filter on this so that we can metric on how many bad lines we are seeing.
I was planning to use this example which counts log events. However, it looks like this would only count the number of log events which have the matching pattern, but the library we are using is putting all of the failures in the same event in a kind of ugly way like this in the Lambda log:
2020-10-15T13:43:23.943-07:00 START RequestId: 14f054bb-aa9e-4a86-be87-fb46087a7b43 Version: $LATEST
2020-10-15T13:43:24.189-07:00 b'Skipping line 7: expected 17 fields, saw 20\nSkipping line 11: expected 17 fields, saw 20\n'
2020-10-15T13:43:24.705-07:00 END RequestId: 14f054bb-aa9e-4a86-be87-fb46087a7b43
Ideally, we'd want these to be on separate log entries / lines, but if not, is there at least a way to detect how many times the pattern appeared on the same line if we can't split them up? So far I've had no luck finding anything to support this kind of example. For example, this example goes over counting occurrences of a term, but it counts the number of events that contain the term, not how many occurrences exist within a single event.