I want to convert log files to a format which can be read in R for further analysis.
things i came across while trying to look for a solution to this. Regex,RecordBreaker,OpenRefine or GoogleRefine,R has stringr and dplyr etc.
i tried using OpenRefine and it seemed useful but still would like to have more guidance since they say log files are the real big data.
Data looks like this;
M 8000000 NADR 14273 18:17:43.22 STC35256 00000291 DSNT375I +HPN2 PLAN=DISTSERV WITH 026
D 026 00000291 CORRELATION-ID=db2jcc_appli
D 026 00000291 CONNECTION-ID=SERVER
D 026 00000291 LUW-ID=G93FF023.DB11.CDD5C8DE241F=29839
D 026 00000291
D 026 00000291 THREAD-INFO=SAPHPNDB:9.63.240.123:SAPHPNDB:db2jcc_application:DYNAMIC
D 026 00000291 :46835:*:*
D 026 00000291 IS DEADLOCKED WITH PLAN=DISTSERV WITH
D 026 00000291 CORRELATION-ID=db2jcc_appli
D 026 00000291 CONNECTION-ID=SERVER
D 026 00000291 LUW-ID=G93FF07C.EE5F.CDD5C82B2305=29799
D 026 00000291
D 026 00000291 THREAD-INFO=SAPHPNDB:9.63.240.33:SAPHPNDB:db2jcc_application:DYNAMIC:
D 026 00000291 46835:*:*
E 026 00000291 ON MEMBER HPN2
............................................................................
The underlying structure is like this;
Each record starts with M and ends with E
The D's are the variables that give more information about a single record. So the first instance of this as shown in the log text above,starts with M ends with E and in between the D's provide information such as the correlation ID, connection ID etc.
So the above log file should be one row in a data table format with the D's as the variables.
[1]: https://i.stack.imgur.com/hw9zY.png
possible solution:
data <- readLines("data1.txt")
pattern <- "(M\\s+\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\d+:\\d+:\\d+.\\d+\\s+)(\\w+\\s+)(\\d+\\s+)(\\w+\\s+)(\\+\\w+\\s+\\w+(\\=|\\s+)\\w+\\s+\\w+\\s+\\d+)"
m <- regexec(pattern,data)
matches <- regmatches(data, m)
parts <- do.call(rbind,lapply(regmatches(data, m), `[`,c(2L,3L,4L,5L,6L,7L,8L,9L)))
colnames(parts) <- c("ID1","ID2","Date","Time","ID3","ID4","ID5","description")
parts <- as.data.frame(parts)
parts1 <- na.omit(parts)
Well, you could do it one log row at the time. Pseudocode would be something like this:
Basically if it's M, then you start a new record. If it's E then you end the current one. And if it's D, then add data to the current record based on the other information present.
It's not going to be too easy, but not too hard either. Start with one record, create a good amount of intermediate variables and take one step at the time.