Transforming Data with Repeated Values

171 Views Asked by At

I have many data frames similar to

times = c("2015-12-30 20:00:00", "2016-01-06 20:00:00", 
          "2016-01-08 20:00:00", "2016-01-11 20:00:00",
          "2016-01-13 20:00:00", "2016-01-14 20:00:00", 
          "2016-01-15 20:00:00", "2016-01-18 20:00:00",
          "2016-01-20 20:00:00", "2016-01-21 20:00:00", 
          "2016-01-25 20:00:00")
counts = c(7, 14, 61, 1, 2, 66, 10, 35, 1, 304, 2)
df <- data.frame(timestamp = as.POSIXct(times, format="%Y-%m-%d %H:%M:%S",
                                  tz="Pacific/Auckland"),
           count = counts)

I am trying to identify outliers in data sets similar to the one above. Looking at the normal Q-Q plot and the histogram, it is obvious that this sample is not from a normal distribution.

hist(df$count)

Histogram of df$count

qqnorm(df$count)
qqline(df$count)

Q-Q plot of df$count

Next, I use a Box-Cox power transform and try to bring the data close to a normally distributed data.

lambda <- geoR::boxcoxfit(df$count)$lambda
df$transformed <- car::bcPower(df$count, lambda=lambda)

Note: I am aware of other ways of find a Box-Cox transformation parameter such as using forecast, or car packages. There are also methods using an extended family of Box-Cox transformation function and optimising the variables like in https://stats.stackexchange.com/a/35717/101902 answer. One of the reasons I am not using forecast is that, in most cases, my data is not equidistant and does not carry typical time-series properties. Other is the fact that I need to automate the process. Any method that fits a GLM, or, LM blindly just returns nothing useful.

After transforming the data and calculating z-scores on the transformed data, we get

             timestamp count transformed      zscore
1  2015-12-30 20:00:00     7   1.7922836 -0.14446864
2  2016-01-06 20:00:00    14   2.3618561  0.22598616
3  2016-01-08 20:00:00    61   3.4646761  0.94326978
4  2016-01-11 20:00:00     1   0.0000000 -1.31018523
5  2016-01-13 20:00:00     2   0.6729577 -0.87248782
6  2016-01-14 20:00:00    66   3.5198741  0.97917102
7  2016-01-15 20:00:00    10   2.0895953  0.04890541
8  2016-01-18 20:00:00    35   3.0646823  0.68311037
9  2016-01-20 20:00:00     1   0.0000000 -1.31018523
10 2016-01-21 20:00:00   304   4.5195550  1.62937200
11 2016-01-25 20:00:00     2   0.6729577 -0.87248782

Although, after transformation, data is closer to a normally distributed data, having data points that are 1s is skewing the standardisation process. So, a clear outlier is not detected at all. Most articles, blog posts, or similar media on standardising data never talks about these extreme cases.

When I started typing this question I was going to ask if there are other methods of transformation(s) that can handle 1s, but I realised it doesn't matter.

How would you handle having many the same value in a data set? Specially, if they are at the two extremes like being the minimum, or maximum of the data set.

0

There are 0 best solutions below