I am trying to split my data into training, test and validation groups within my data. I have 2 groups: control and TP and within these groups I have a secondary variable called Bio with numbers in both groups 1-4.
Within the groups I need to split based on the treatment group (control or TP) and then based on Bio as a dependent variable so that if I have Control 1 in the training set I have all of the control 1 groups and all of the TP 1 as well. Whilst my example data below has equal numbers in the Bio groupings e.g. 3 this is not the same with the rest of the data and there are different numbers in different Bio's.
Please see a minumum data set below:
Sample Treatment Bio 285.945846 286.9638976 288.1004758 288.8109355
Control1_A13 Control 1 0.003535191 0.001777255 0.004729780 0.002364995
Control1_A14 Control 1 0.005063256 0.000110063 0.006249624 0.001041584
Control1_A15 Control 1 0.004262099 0.000836256 0.004277461 0.002699177
Control2_B13 Control 2 0.002411720 0.000466887 0.001129674 0.001109870
Control2_B14 Control 2 0.003085647 0.001831629 0.002482230 0.000000000
Control2_B15 Control 2 0.001996473 0.001060616 0.003995243 0.001369387
Control3_C13 Control 3 0.000299744 0.000851944 0.002808119 0.004065315
Control3_C14 Control 3 0.003187073 0.000591202 0.006833653 0.001713096
Control3_C15 Control 3 0.003692511 0.000262144 0.004673039 0.000126174
Control4_D13 Control 4 0.003369294 0.001087459 0.005171894 0.000675702
Control4_D14 Control 4 0.003818057 0.000838719 0.005513885 0.000458708
Control4_D15 Control 4 0.002572840 0.000257058 0.003537029 0.000009040
LX2+TP1_E1 TP 1 0.003347067 0.001231945 0.008181087 0.004436654
LX2+TP1_E2 TP 1 0.001552547 0.001463769 0.008864838 0.002728083
LX2+TP1_E3 TP 1 0.003224648 0.000812735 0.008518836 0.004303950
LX2+TP2_F1 TP 2 0.001705551 0.000182659 0.000911028 0.000240785
LX2+TP2_F2 TP 2 0.000760944 0.000759464 0.002486596 0.002377735
LX2+TP2_F3 TP 2 0.001034440 0.000647382 0.008146538 0.001028800
LX2+TP3_G1 TP 3 0.003660741 0.001260433 0.008046637 0.003182006
LX2+TP3_G2 TP 3 0.001802459 0.000547580 0.004882082 0.004121552
LX2+TP3_G3 TP 3 0.003590003 0.000089100 0.002801237 0.000403527
LX2+TP4_H1 TP 4 0.002831592 0.001534135 0.009151124 0.003021942
LX2+TP4_H2 TP 4 0.001863099 0.000959953 0.008284829 0.005169246
LX2+TP4_H3 TP 4 0.005649448 0.001959382 0.011814467 0.004110110
I have tried 2 different methods to do this:
- Method 1
set.seed(1234)
inTraining <- createDataPartition(vis_data2$Treatment, p=0.6, list=FALSE)
training.set <- vis_data2[inTraining,]
Totalvalidation.set <- vis_data2[-inTraining,]
# This will create another partition of the 40% of the data, so 20%-testing and #20%-validation
inValidation <- createDataPartition(Totalvalidation.set$Treatment, p=0.5, list=FALSE)
testing.set <- Totalvalidation.set[inValidation,]
validation.set <- Totalvalidation.set[-inValidation,]
However this doesn't take into account the second variable for me - Bio groupings
- Method 2
set.seed(1)
#Split into training and validation data sets
Y1 = vis_data2[,1] #defining treatment/ variable column
g1 = vis_data2[,3] #defines group column
final_vis_data <- sample.split(Y1,SplitRatio = 0.5,group = g1)
table(Y1,final_vis_data) #get correct split ratios
split(final_vis_data,g1) #while keeping samples with the same group label together
full_train_set <- vis_data2[ final_vis_data,]
test.set <- vis_data2[!final_vis_data,]
#Split training data set into training and testing data sets
Y2 = full_train_set[,1] #defining treatment/ variable column
g2 = full_train_set[,3] #defines group column
final_vis_data2 <- sample.split(Y2,SplitRatio = 0.5,group = g2)
table(Y2,final_vis_data2) #get correct split ratios
split(final_vis_data2,g2) #while keeping samples with the same group label together
test.set <- full_train_set[final_vis_data2,1:3]
validation.set <- full_train_set[!final_vis_data2,1:3]
However, when I run this I often get 'na' values in my validation.index and often when I check the split the Bio data hasn't split correctly.
How to get this to work?
This answer uses functions from
rsampleand does not use Caret's partitioning function. It will hopefully help you create an initial split for model fitting.To demonstrate splitting test data as you described for validation sets I needed to make some extra groups.
Created on 2023-08-17 with reprex v2.0.2