R's caTools Sample.Split Results Incorrect

319 Views Asked by At

I'd like to preface my question by stating that this appears to be a common issue:

  1. Incorrect splitting of data using sample.split in R and issue with logistic regression
  2. SplitRatio results with sample.split (caTools)

Yet, I cannot fix my problem using the solutions recommended in the first question, and the second was never answered.

In the following code, I would expect 100 observations for each of the four results, as obviously 100/150 = 2/3:

library(caTools)
set.seed(123)

isample <- sample.split(iris[,1], SplitRatio = 2/3, group = NULL)
iris2 <- iris[isample,]

isample2 <- sample.split(iris[,1], SplitRatio = 2/3, group = NULL)
iris3 <- subset(iris, isample2 == T)

isample3 <- sample.split(iris$Sepal.Length, SplitRatio = 2/3, group = NULL)
sepal.length2 <- iris[isample3,1]

isample4 <- sample.split(iris$Sepal.Length, SplitRatio = 2/3, group = NULL)
sepal.length3 <- subset(iris[,1], isample4 == T)

However, I get 104 observations in both iris2 and iris3, as well as the vectors sepal.length2 and sepal.length3. I make sure to draw a new sample each time to ensure this isn't something weird with rounding in the sample function. Using column 2 and 3 from iris return 100 observations, but using column 5 returns 99 observations. Why does changing the column return different values? A common error with this function is to accidentally give it the entire data frame, so it selects based on the columns, but here I am making sure to give it a vector each time. In the last two examples, I am giving it a vector and then determining the split from a vector, and it still does not work.

If it helps, I'm running R 3.6.0 and caTools 1.18.0 on OS X. I normally would use the sample or sample.int function, so I am not all that familiar with caTools.

1

There are 1 best solutions below

0
On BEST ANSWER

After doing some searching and a little testing with a source file available [here],1 I have come to realize that this comes from the accumulation of rounding errors in how the authors wrote this function. The loop starting for( iU in 1:nU) rounds the number of random draws at each label, so that for a ratio like 2/3 and a label with, say, 4 occurrences in the data, we end up with n = round(length(idx)*rat) which rounds to 3, or 8 * 2/3 rounds to 5. Over the course of the loop, this leads to the resultant overcount.

Re-reading the sample.split documentation, it says "Split data from vector Y into two sets in predefined ratio while preserving relative ratios of different labels in Y." So, my conclusion is that this function is instead trying to preserve the ratio of each unique label in the vector, meaning that it tries to keep 2/3 of the occurrences of 5.3 in the sepal length, 2/3 of the occurrences of 4.9, etc. in each of the testing and training set. Users of this function would rather have an imprecise testing/training split and a more precise testing error in the end, as they can expect the ratios of each occurrence to be preserved. Since this function is for classification, I thus conclude that I should avoid using it for instances where there are many unique values in the data.