I have the need perform QCC test on subsets of data in a loop. Plotting is not important but calculation of LCL, UCL and tagging of data points that are out of limits and violating Shewhart rules is.
The input data is organized in a DF as show below:
TS CATEGORY KEYWORD CHANNEL QTY 2013_Q1 ABC WIDGET1 RETAIL 55 2013_Q2 ABC WIDGET1 RETAIL 57 2013_Q3 ABC WIDGET1 RETAIL 18 2013_Q4 ABC WIDGET1 RETAIL 20 2014_Q1 ABC WIDGET1 RETAIL 7 2014_Q2 ABC WIDGET1 RETAIL 15 2014_Q3 ABC WIDGET1 RETAIL 24 2014_Q4 ABC WIDGET1 RETAIL 21 2015_Q1 ABC WIDGET1 RETAIL 43 2015_Q2 ABC WIDGET1 RETAIL 70 2015_Q3 ABC WIDGET1 RETAIL 51 2015_Q4 ABC WIDGET1 RETAIL 83 2013_Q1 ABC WIDGET1 ONLINE 31 2013_Q2 ABC WIDGET1 ONLINE 37 2013_Q3 ABC WIDGET1 ONLINE 31 2013_Q4 ABC WIDGET1 ONLINE 56 2014_Q1 ABC WIDGET1 ONLINE 56 2014_Q2 ABC WIDGET1 ONLINE 62 2014_Q3 ABC WIDGET1 ONLINE 55 2014_Q4 ABC WIDGET1 ONLINE 86 2015_Q1 ABC WIDGET1 ONLINE 79 2015_Q2 ABC WIDGET1 ONLINE 79 2015_Q3 ABC WIDGET1 ONLINE 62 2015_Q4 ABC WIDGET1 ONLINE 83 2013_Q1 ABC WIDGET1 AUCTION 2 2013_Q2 ABC WIDGET1 AUCTION 0 2013_Q3 ABC WIDGET1 AUCTION 2 2013_Q4 ABC WIDGET1 AUCTION 1 2014_Q1 ABC WIDGET1 AUCTION 3 2014_Q2 ABC WIDGET1 AUCTION 4 2014_Q3 ABC WIDGET1 AUCTION 3 2014_Q4 ABC WIDGET1 AUCTION 2 2015_Q1 ABC WIDGET1 AUCTION 6 2015_Q2 ABC WIDGET1 AUCTION 2 2015_Q3 ABC WIDGET1 AUCTION 1 2015_Q4 ABC WIDGET1 AUCTION 2
I have been able to get the code to work using loops as follows:
- determine unique groups (keys) in the data set based on Category, Keyword and Channel
- Order data by increasing TS (for control chart)
- Loop through the keys
- select a subset
- perform qcc calculations
- update DF with results - i.e oos (out of spec), vlt (violating points), lcl and ucl
The performance is fine for small data sets but quite poor as the data set get large ( > 100,000 rows).
Any ideas to change the logic would be appreciated.
Below is the R code:
library(qcc)
# read data into DF
DF <- read.csv("SPCQty1.csv",header=TRUE,na.strings = "null")
# create ID row to use for later updates
DF$ID <- 1:nrow(DF)
# Create additional columns for later use
# these will be populated after calling qcc function for each group
DF$oos <- NA
DF$vlt <- NA
DF$ucl <- NA
DF$lcl <- NA
# determine unique groups in data set
keys <- unique(DF[,c('PL','KEYWORD','CHANNEL')])
len <- nrow(keys)
# perform stats on each set
for (i in 1:len)
{
g1 <- as.data.frame.array(keys[i,]["PL"])[,"PL"]
g2 <- as.data.frame.array(keys[i,]["KEYWORD"])[,"KEYWORD"]
g3 <- as.data.frame.array(keys[i,]["CHANNEL"])[,"CHANNEL"]
# select the subset
tmp <- subset(DF, PL == g1 & KEYWORD == g2 & CHANNEL == g3)
# sort by TS for control chart
spcdata <- tmp[order(tmp$TS),]
# generate control chart stats
spc <- qcc(spcdata$QTY, type="xbar.one", plot = FALSE)
# get statistics object generated by qcc
stats <- spc$statistics
indices <- 1:length(stats)
# get UCL and LCL
limits <- spc$limits
lcl <- limits[,1]
ucl <- limits[,2]
# violating runs
violations <- spc$violations
# create a data frame of the qcc stats
qc.data <- data.frame(df.indices <- indices, df.statistics <- as.vector(stats), ID = spcdata$ID)
# detect violating runs
index.r <- rep(NA, length(violations$violating.runs))
if(length(violations$violating.runs > 0)) {
index.r <- violations$violating.runs
# Create a data frame for violating run points.
df.runs <- data.frame(x.r = qc.data$ID[index.r], vlt = "Y")
idx <- df.runs$x.r
DF$vlt[DF$ID %in% idx]<- "Y"
}
# detect beyond limits points
index.b <- rep(NA, length(violations$beyond.limits))
if(length(violations$beyond.limits > 0)) {
index.b <- violations$beyond.limits
# Create a data frame to tag beyond limit points.
df.beyond <- data.frame(x.b = qc.data$ID[index.b], oos = "Y")
idx <- df.beyond$x.b
DF$oos[DF$ID %in% idx]<- "Y"
}
idx <- qc.data$ID
DF$ucl[DF$ID %in% idx] <- ucl
DF$lcl[DF$ID %in% idx] <- lcl
}
DF[is.na(DF)] <- ""
# DF will now have 5 additional columns - ID, oos, vlt, ucl and lcl
I noticed your code creates a large number of temporary variables (eq index.r, index.b etc..) If the array lengths are the same there is no need to track the indexs.
A new dataframe "dfnew" holds the final results. This simplified version is easier to read and should have some performance improvements, can't quantify this with the limited data. This version is also assuming the data is presorted before the loop. The next improvement would to eliminate the loop all together and replace with the _apply command. Also look into the Data.Table, this could improve performance of the subseting.