Improving QCC statistics calculations in R

969 Views Asked by At

I have the need perform QCC test on subsets of data in a loop. Plotting is not important but calculation of LCL, UCL and tagging of data points that are out of limits and violating Shewhart rules is.

The input data is organized in a DF as show below:

    TS  CATEGORY    KEYWORD CHANNEL QTY
    2013_Q1 ABC WIDGET1 RETAIL  55
    2013_Q2 ABC WIDGET1 RETAIL  57
    2013_Q3 ABC WIDGET1 RETAIL  18
    2013_Q4 ABC WIDGET1 RETAIL  20
    2014_Q1 ABC WIDGET1 RETAIL  7
    2014_Q2 ABC WIDGET1 RETAIL  15
    2014_Q3 ABC WIDGET1 RETAIL  24
    2014_Q4 ABC WIDGET1 RETAIL  21
    2015_Q1 ABC WIDGET1 RETAIL  43
    2015_Q2 ABC WIDGET1 RETAIL  70
    2015_Q3 ABC WIDGET1 RETAIL  51
    2015_Q4 ABC WIDGET1 RETAIL  83
    2013_Q1 ABC WIDGET1 ONLINE  31
    2013_Q2 ABC WIDGET1 ONLINE  37
    2013_Q3 ABC WIDGET1 ONLINE  31
    2013_Q4 ABC WIDGET1 ONLINE  56
    2014_Q1 ABC WIDGET1 ONLINE  56
    2014_Q2 ABC WIDGET1 ONLINE  62
    2014_Q3 ABC WIDGET1 ONLINE  55
    2014_Q4 ABC WIDGET1 ONLINE  86
    2015_Q1 ABC WIDGET1 ONLINE  79
    2015_Q2 ABC WIDGET1 ONLINE  79
    2015_Q3 ABC WIDGET1 ONLINE  62
    2015_Q4 ABC WIDGET1 ONLINE  83
    2013_Q1 ABC WIDGET1 AUCTION 2
    2013_Q2 ABC WIDGET1 AUCTION 0
    2013_Q3 ABC WIDGET1 AUCTION 2
    2013_Q4 ABC WIDGET1 AUCTION 1
    2014_Q1 ABC WIDGET1 AUCTION 3
    2014_Q2 ABC WIDGET1 AUCTION 4
    2014_Q3 ABC WIDGET1 AUCTION 3
    2014_Q4 ABC WIDGET1 AUCTION 2
    2015_Q1 ABC WIDGET1 AUCTION 6
    2015_Q2 ABC WIDGET1 AUCTION 2
    2015_Q3 ABC WIDGET1 AUCTION 1
    2015_Q4 ABC WIDGET1 AUCTION 2

I have been able to get the code to work using loops as follows:

  • determine unique groups (keys) in the data set based on Category, Keyword and Channel
  • Order data by increasing TS (for control chart)
  • Loop through the keys
  • select a subset
  • perform qcc calculations
  • update DF with results - i.e oos (out of spec), vlt (violating points), lcl and ucl

The performance is fine for small data sets but quite poor as the data set get large ( > 100,000 rows).

Any ideas to change the logic would be appreciated.

Below is the R code:

library(qcc)

# read data into DF
DF <- read.csv("SPCQty1.csv",header=TRUE,na.strings = "null")

# create ID row to use for later updates
DF$ID <- 1:nrow(DF)

# Create additional columns for later use
# these will be populated after calling qcc function for each group
DF$oos <- NA
DF$vlt <- NA
DF$ucl <- NA
DF$lcl <- NA

# determine unique groups in data set
keys <- unique(DF[,c('PL','KEYWORD','CHANNEL')])
len <- nrow(keys)

# perform stats on each set
for (i in 1:len)
{
  g1 <- as.data.frame.array(keys[i,]["PL"])[,"PL"]
  g2 <- as.data.frame.array(keys[i,]["KEYWORD"])[,"KEYWORD"]
  g3 <- as.data.frame.array(keys[i,]["CHANNEL"])[,"CHANNEL"]

  # select the subset  
  tmp <- subset(DF, PL == g1 & KEYWORD == g2 & CHANNEL == g3)
  # sort by TS for control chart
  spcdata <- tmp[order(tmp$TS),]

  # generate control chart stats

  spc <- qcc(spcdata$QTY, type="xbar.one", plot = FALSE)

  # get statistics object generated by qcc
  stats <- spc$statistics
  indices <- 1:length(stats)

  # get UCL and LCL   
  limits <- spc$limits
  lcl <- limits[,1]
  ucl <- limits[,2]

  # violating runs  
  violations <- spc$violations

  # create a data frame of the qcc stats  
  qc.data <- data.frame(df.indices <- indices, df.statistics <-   as.vector(stats), ID = spcdata$ID)

  # detect violating runs
  index.r <- rep(NA, length(violations$violating.runs))
  if(length(violations$violating.runs > 0)) { 
   index.r <- violations$violating.runs
   # Create a data frame for violating run points.
   df.runs <- data.frame(x.r = qc.data$ID[index.r], vlt = "Y")
   idx <- df.runs$x.r
   DF$vlt[DF$ID %in% idx]<- "Y"
   }

   # detect beyond limits points
   index.b <- rep(NA, length(violations$beyond.limits))
   if(length(violations$beyond.limits > 0)) { 
     index.b <- violations$beyond.limits
     # Create a data frame to tag beyond limit points.
     df.beyond <- data.frame(x.b = qc.data$ID[index.b], oos = "Y")
     idx <- df.beyond$x.b
     DF$oos[DF$ID %in% idx]<- "Y"
   }

   idx <- qc.data$ID
   DF$ucl[DF$ID %in% idx] <- ucl
   DF$lcl[DF$ID %in% idx] <- lcl
} 

DF[is.na(DF)] <- ""
# DF will now have 5 additional columns - ID, oos, vlt, ucl and lcl
1

There are 1 best solutions below

2
On

I noticed your code creates a large number of temporary variables (eq index.r, index.b etc..) If the array lengths are the same there is no need to track the indexs.

library(qcc)
# read data into DF
DF <- read.csv("sample.csv",header=TRUE,na.strings = "null")

# Create additional columns for later use
# these will be populated after calling qcc function for each group
DF$oos <- NA
DF$vlt <- NA
DF$ucl <- NA
DF$lcl <- NA

# determine unique groups in data set
keys <- unique(DF[,c('PL','KEYWORD','CHANNEL')])
len <- nrow(keys)
dfnew<-data.frame()

# perform stats on each set
for (i in 1:len)
{
   # select the subset  
   tmp <- subset(DF, PL == keys$PL[i] & KEYWORD == keys$KEYWORD[i] & CHANNEL == keys$CHANNEL[i])
   # generate control chart stats
   spc <- qcc(tmp$QTY, type="xbar.one", plot = FALSE)

    # get UCL and LCL   
    tmp$lcl <- spc$limits[,1]
    tmp$ucl <- spc$limits[,2]
    #get violations
    tmp$vlt[spc$violations$violating.runs]<- "Y"
    tmp$oos[spc$violations$beyond.limits]<- "Y"
    #add onto data frame
    dfnew<-rbind(dfnew,tmp)
} 
dfnew[is.na(dfnew)] <- ""
#Sort as needed
print(dfnew)

A new dataframe "dfnew" holds the final results. This simplified version is easier to read and should have some performance improvements, can't quantify this with the limited data. This version is also assuming the data is presorted before the loop. The next improvement would to eliminate the loop all together and replace with the _apply command. Also look into the Data.Table, this could improve performance of the subseting.