Best technique to transform log file data for analysis using R or python

576 Views Asked by At

I want to convert log files to a format which can be read in R for further analysis.

things i came across while trying to look for a solution to this. Regex,RecordBreaker,OpenRefine or GoogleRefine,R has stringr and dplyr etc.

i tried using OpenRefine and it seemed useful but still would like to have more guidance since they say log files are the real big data.

Data looks like this;

M 8000000 NADR     14273 18:17:43.22 STC35256 00000291  DSNT375I  +HPN2 PLAN=DISTSERV WITH 026
 D                                         026 00000291          CORRELATION-ID=db2jcc_appli
 D                                         026 00000291          CONNECTION-ID=SERVER
 D                                         026 00000291          LUW-ID=G93FF023.DB11.CDD5C8DE241F=29839
 D                                         026 00000291
 D                                         026 00000291  THREAD-INFO=SAPHPNDB:9.63.240.123:SAPHPNDB:db2jcc_application:DYNAMIC
 D                                         026 00000291  :46835:*:*
 D                                         026 00000291          IS DEADLOCKED WITH PLAN=DISTSERV WITH
 D                                         026 00000291          CORRELATION-ID=db2jcc_appli
 D                                         026 00000291          CONNECTION-ID=SERVER
 D                                         026 00000291          LUW-ID=G93FF07C.EE5F.CDD5C82B2305=29799
 D                                         026 00000291
 D                                         026 00000291  THREAD-INFO=SAPHPNDB:9.63.240.33:SAPHPNDB:db2jcc_application:DYNAMIC:
 D                                         026 00000291  46835:*:*
 E                                         026 00000291          ON MEMBER HPN2
............................................................................

The underlying structure is like this;

  1. Each record starts with M and ends with E

  2. The D's are the variables that give more information about a single record. So the first instance of this as shown in the log text above,starts with M ends with E and in between the D's provide information such as the correlation ID, connection ID etc.

So the above log file should be one row in a data table format with the D's as the variables.

  [1]: https://i.stack.imgur.com/hw9zY.png

possible solution:

data <- readLines("data1.txt")
pattern <- "(M\\s+\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\d+:\\d+:\\d+.\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\w+\\s+)(\\+\\w+\\s+\\w+(\\=|\\s+)\\w+\\s+\\w+\\s+\\d+)"

m <- regexec(pattern,data)

matches <- regmatches(data, m)

parts <- do.call(rbind,lapply(regmatches(data, m), `[`,c(2L,3L,4L,5L,6L,7L,8L,9L)))

colnames(parts) <- c("ID1","ID2","Date","Time","ID3","ID4","ID5","description")

parts <- as.data.frame(parts)

parts1 <- na.omit(parts)
1

There are 1 best solutions below

0
LauriK On

Well, you could do it one log row at the time. Pseudocode would be something like this:

IF logrow.record == 'D' AND logrow.type == 'CORRELATION' THEN
  current.record$correlation = logrow.value
ELSE IF logrow.record == 'E' THEN
  all.records[n+1] = current.record
ELSE IF logrow.record == 'M' THEN
  current.record = empty new record
  current.record$ID = logrow.value
END

Basically if it's M, then you start a new record. If it's E then you end the current one. And if it's D, then add data to the current record based on the other information present.

It's not going to be too easy, but not too hard either. Start with one record, create a good amount of intermediate variables and take one step at the time.