I have many data frames similar to
times = c("2015-12-30 20:00:00", "2016-01-06 20:00:00",
"2016-01-08 20:00:00", "2016-01-11 20:00:00",
"2016-01-13 20:00:00", "2016-01-14 20:00:00",
"2016-01-15 20:00:00", "2016-01-18 20:00:00",
"2016-01-20 20:00:00", "2016-01-21 20:00:00",
"2016-01-25 20:00:00")
counts = c(7, 14, 61, 1, 2, 66, 10, 35, 1, 304, 2)
df <- data.frame(timestamp = as.POSIXct(times, format="%Y-%m-%d %H:%M:%S",
tz="Pacific/Auckland"),
count = counts)
I am trying to identify outliers in data sets similar to the one above. Looking at the normal Q-Q plot and the histogram, it is obvious that this sample is not from a normal distribution.
hist(df$count)
qqnorm(df$count)
qqline(df$count)
Next, I use a Box-Cox power transform and try to bring the data close to a normally distributed data.
lambda <- geoR::boxcoxfit(df$count)$lambda
df$transformed <- car::bcPower(df$count, lambda=lambda)
Note: I am aware of other ways of find a Box-Cox transformation parameter such as using forecast
, or car
packages. There are also methods using an extended family of Box-Cox transformation function and optimising the variables like in https://stats.stackexchange.com/a/35717/101902 answer. One of the reasons I am not using forecast
is that, in most cases, my data is not equidistant and does not carry typical time-series properties. Other is the fact that I need to automate the process. Any method that fits a GLM, or, LM blindly just returns nothing useful.
After transforming the data and calculating z-scores on the transformed data, we get
timestamp count transformed zscore
1 2015-12-30 20:00:00 7 1.7922836 -0.14446864
2 2016-01-06 20:00:00 14 2.3618561 0.22598616
3 2016-01-08 20:00:00 61 3.4646761 0.94326978
4 2016-01-11 20:00:00 1 0.0000000 -1.31018523
5 2016-01-13 20:00:00 2 0.6729577 -0.87248782
6 2016-01-14 20:00:00 66 3.5198741 0.97917102
7 2016-01-15 20:00:00 10 2.0895953 0.04890541
8 2016-01-18 20:00:00 35 3.0646823 0.68311037
9 2016-01-20 20:00:00 1 0.0000000 -1.31018523
10 2016-01-21 20:00:00 304 4.5195550 1.62937200
11 2016-01-25 20:00:00 2 0.6729577 -0.87248782
Although, after transformation, data is closer to a normally distributed data, having data points that are 1s is skewing the standardisation process. So, a clear outlier is not detected at all. Most articles, blog posts, or similar media on standardising data never talks about these extreme cases.
When I started typing this question I was going to ask if there are other methods of transformation(s) that can handle 1s, but I realised it doesn't matter.
How would you handle having many the same value in a data set? Specially, if they are at the two extremes like being the minimum, or maximum of the data set.