How to improve performance issues with reading, cleansing & analysing log files in R?

88 Views Asked by At

The task ata hand is trying to extract some data for analysis, reporting & modeling purposes from application-specific log files. The challenge with these files is that they are of some kind of a 'semi-structured' nature, as well as of varying size (a few kbytes up to several Mbytes), and there structure cannot be changed.

After analysing quite a fair number of these log files, four major categories of entries could be identified.

  • java exception/error message (plain text; spread across several adjacent rows);
  • status messages (1st entry structured, subsequent entries plain text spread across several adjacent rows);
  • markup/xml code (1st entry structured, subsequent entries plain text spread across several adjacent rows);
  • status messages (structured; spread across several (not necessarily) adjacent rows).

The structured part of the log file comprises the columns as set forth below.

  • date;
  • time;
  • reference number;
  • class;
  • endpoint;
  • description/text.

Example

<Date> <Time> <RefNo> <Class> <Endpoint> <Description>
2016-12-20 04:04:04.004 12345678 [my.java.class1] (host1-001) start pattern #1
2016-12-20 04:04:04.014 12345678 [my.java.class1] (host1-001) pattern #1-2
2016-12-20 04:04:04.104 12345679 [my.java.class1] (host1-001) pattern #1-3
2016-12-20 04:04:05.004 12345680 [my.java.class2] (host1-001) pattern #1-4
2016-12-20 04:04:06.004 12345681 [my.java.class2] (host1-001) end pattern #1
2016-12-21 05:05:04.005 12389741 [my.java.class3] (host2-003) start pattern #2
2016-12-21 05:05:04.016 12390000 [my.java.class3] (host2-003) pattern #2-2
2016-12-21 05:05:04.108 12390001 [my.java.class4] (host2-003) pattern #2-3
2016-12-21 05:05:04.110 12389741 [my.java.class7] (host3-004) start pattern #3
2016-12-21 05:05:04.115 12390000 [my.java.class7] (host3-004) pattern #3-2
2016-12-21 05:05:05.007 12390002 [my.java.class5] (host2-003) pattern #2-4
2016-12-21 05:05:06.003 12390010 [my.java.class6] (host2-003) end pattern #2
2016-12-21 05:05:06.012 12390010 [my.java.class8] (host3-004) end pattern #3

Each *status sequence* starts with a specific *start pattern*, and ends subsequently--but not necessarily adjacently--with a corresponding *end pattern*. Whenever focusing on a particular *status sequence*, it is of the utmost importance to maintain the chronological order, while at the same time pick the matching (identical) *endpoints*.

Having had a closer look at the *description/test* of the various structured status messages 20 different *pattern sequences* could be spotted, each of which have to get merged and transformed into single-row observations comprising a whole range of variables.

The intended target stucture consists of the following variables:

  • timestamp (format yyyy-mm-dd hh:mm:ss.tsec);
  • reference number;
  • status;
  • class;
  • endpoint;
  • tenant;
  • product;
  • service call;
  • flag 1;
  • flag 2;
  • flag 3;
  • flag 4;
  • ip address 1;
  • ip address 2;
  • total time (in minutes);
  • time 1 (in mseconds);
  • time 2 (in mseconds);
  • time 3 (in mseconds);
  • length;
  • file name;
  • file size;
  • meta data size;
  • return value;
  • error text;
  • note.

By filtering the entries, and apply to relevant pattern, the required information gets extracted from the coresponding rows of a *status sequence*.

As each of the aforementioned *pattern sequences* are differing from each other, and can occur multiple times in a single log file, specific functions for each of such a *pattern sequence* have been created.

clnPattern <- function(dt=NULL, cn=NA) {
    pat <- c(startPattern, 
             patternOne,
             patternTwo,
             patternThree,
             patternFour,
             endPattern)
    patLen <- length(pat)
    p <- NULL
    allEP <- NULL
    currEP <- NULL
    rowIdx <- list()
    wipeIdx <- NULL
    seqOk <- NULL
    i <- NULL
    j <- NULL

    tmpDT <- data.table(NULL)
    retDT <- data.table(NULL)

    # cleanse pattern entries
    if (length(dt) > 0) {
        # cycle over each endpoint separately, but only if at
        # least the starting pattern can be found
        if ((length(allEP <- unique(dt[, Endpoint])) > 0) &
            (sum(grepl(pat[1], dt[, Description], ignore.case=TRUE, perl=TRUE)) > 0)) {
            for (currEP in allEP) {
                p <- 1
                if (length(rowIdx[[p]] <- grep(pat[1], 
                    dt[Endpoint == currEP,Description],ignore.case=TRUE,  perl=TRUE)) > 0) {

                    # search for each subsequent pattern respectively
                    for (p in 2:patLen) {
                        rowIdx[[p]] <- grep(pat[p], dt[Endpoint == currEP,
                                            Description],
                                            ignore.case=TRUE, perl=TRUE)
                    } # for

                    # check for pattern sequence & cleanse
                    for (i in 1:length(rowIdx[[1]])) {
                        seqOk <- TRUE
                        wipeIdx <- dt[Endpoint == currEP, RowIdx][rowIdx[[1]][i]]

                        for (p in 2:patLen) {
                            seqOk <- (seqOk &
                                      ((rowIdx[[1]][i] + p - 1) %in% rowIdx[[p]]))
                            wipeIdx <- c(wipeIdx,
                                         dt[Endpoint == currEP, RowIdx][rowIdx[[1]][i] + p - 1])
                        } # for

                        # pattern sequence found
                        if (seqOk) {
                            # consolidate sequence into one single row
                            tmpDT <- dt[Endpoint == currEP][rowIdx[[1]][i]]

                            setColClass(tmpDT)

                            vTimeStamp1=as.POSIXct(strptime(paste(
                                dt[Endpoint == currEP, Date][rowIdx[[1]][i]],
                                   gsub(",", ".",
                                        dt[Endpoint == currEP, Time][rowIdx[[1]][i]],fixed=TRUE)), 
                                "%Y-%m-%d %H:%M:%OS"))
                            vTimeStamp2=as.POSIXct(strptime(paste(
                                dt[Endpoint == currEP, Date][rowIdx[[1]][i] + patLen - 1],
                                   gsub(",", ".",
                                        dt[Endpoint == currEP, Time][rowIdx[[1]][i] + patLen - 1],fixed=TRUE)), 
                                "%Y-%m-%d %H:%M:%OS"))

                            tmpDT[1, `:=`(
                                    TimeStamp=vTimeStamp1,
                                    Var1=getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i]], 
                                        "Description", startPattern, posVar1),
                                    Var2=getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i]], 
                                        "Description", startPattern,  posVar2),
                                    Var3=as.integer(getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i]], 
                                        "Description", startPattern, posVar3)),
                                    Var4=getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i] + 1], 
                                        "Description", patternOne, posVar4),
                                    Var5=getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i] + 2], 
                                        "Description", patternTwo,  posVar5),
                                    Var6=as.double(getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i] + 3], 
                                        "Description", patternThree, posVar6)),
                                    Var7=getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i] + 4], 
                                        "Description", patternThree, posVar7),
                                    Var8=as.double(getFromVariable(dt[Endpoint == currEP][rowIdx[[1]][i] + 5], 
                                        "Description", patternFour,  posVar8)),
                                    TmTotal=as.numeric(vTimeStamp2 - vTimeStamp1)
                                )]

                            # add to resulting data.table
                            retDT <- rbindlist(list(retDT, tmpDT))

                            # mark rows as processed
                            if (length(wipeIdx) > 0) {
                                dt[(RowIdx %in% wipeIdx),
                                    `:=`(Date=NA, Time=NA, RefNo=NA, Class=NA, Description=NA)]
                            } # if
                        } # if
                    } # for
                } # if
            } # for
        } # if

        # re-shape resulting data.table
        if ((length(retDT) > 0) & (length(cn) > 0)) {
            retDT <- retDT[, .SD, .SDcols=cn]
        } # if
    } # if

    # return resulting data.table
    return(retDT)
} # clnPattern

Within this function another user-defined function--*getFromVariable*--is called to extract the relevant pieces of the *descripion/text*, and stash them away into corresponding variables/columns.

getFromVariable <- function(dt=NULL, varName="", srchPat="", getPat="") {
    # local variables & constants
    retVal <- NULL
    # search für matching observations/rows, and extract parts from variable
    if ((length(dt) > 0) & (nchar(varName) > 0) & 
        (nchar(srchPat) > 0) & (nchar(getPat) > 0)) {
        # prepare if not a data.table
        if (!is.data.table(dt)) {
            dt <- as.data.table(dt)
            setnames(dt, 1, varName)
        } # if
        # search & extract
        if (varName %in% colnames(dt)) {
            retVal <- trimws(gsub(srchPat, getPat, 
                                  dt[grep(srchPat, 
                                          dt[, varName, with=FALSE][[1]], 
                                          ignore.case=TRUE, perl=TRUE), 
                                     varName, with=FALSE][[1]], 
                                  ignore.case=TRUE, perl=TRUE))
        } # if
    } # if
    # return result vector
    return(retVal)
} # getFromVariable

This function basically carries out two tasks:

  • search, merging & extracting the relevant bits & bobs from the *dt* data.table;
  • markes the thus processed rows in the *dt* data.table by setting some variables to *"NA"*.

The thus extracted data will get returned to the parent as a *data.table*. The parent--after having processed all of the defined *pattern sequences* is removing all of the marked (*NA*) rows from the original *dt* data.table, as well as assembling each of the returned pattern-specific data.tables into one single, combined one (the basis for the actual data analysis and modeling).

Lacking a more comprehensive insight in *R* and *data.tables*, the approach taken was one of facilitating *for-loops*, fully aware of that such a vehicle should be prevented from getting deployed wherever possible.

Processing a log file of a few kbytes takes only some seconds, but as soon as entering the realm of Mbytes, processing time increases dramatically, reaching up to several minutes. As 20 different patterns are getting processed for one single log file, to average overall time amounts to round about 1 hour per log file.

What I would be interested to learn is where I can improve the approach taken, optmise as much as possible, maybe head back to the drawing board, in order to get a better grasp on *R* and *data.tables*, as well as cutting down processing time, memory usage, and improving performance significantly.

--Sil68

0

There are 0 best solutions below