Separating Train, Validation and Test set using ImageDataGenerator from keras for a CNN

45 Views Asked by At

So I have already separated a priori the train, validation and test set (this is how the data came).

And I have folders for each one of them like this:

Test

 Class1 

 Class0

Val

Class1

Class0

Train

Class1

Class0

Then I defined the paths as follows:

# Define paths

train_dir = os.path.join(PATH, 'train')
val_dir = os.path.join(PATH, 'val')
test_dir = os.path.join(PATH, 'test')

# Specify them by class

train_safe_dir = os.path.join(train_dir, 'class1')  
train_malicious_dir = os.path.join(train_dir, 'class0')  
val_safe_dir = os.path.join(val_dir, 'class1')  
val_malicious_dir = os.path.join(val_dir, 'class0')
test_safe_dir = os.path.join(test_dir, 'class1')  
test_malicious_dir = os.path.join(test_dir, 'class0')

Then I used the ImageDataGenerator as follows:

train_datagen = ImageDataGenerator(rescale=1./255)
val_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(batch_size=batch_size,
                                                directory=train_dir,
                                                shuffle=False,
                                                target_size=(IMG_H, IMG_W),
                                                class_mode='binary')
val_generator = val_datagen.flow_from_directory(batch_size=batch_size,
                                            directory=val_dir,
                                            target_size=(IMG_H, IMG_W),
                                            class_mode='binary')
test_generator = test_datagen.flow_from_directory(batch_size=batch_size,
                                                directory=test_dir,
                                                shuffle=False,
                                                target_size=(IMG_H, IMG_W),
                                                class_mode='binary')

Is this correct for when I want to evaluate on the test data? Am I doing data leakage somehow? If it's not correct, what would be the right approach to cope with the test data? Thank you so much!

I'm not sure if I should have a folder with the test data without the class separation but when I tried that I got a really low accuracy that didn't make sense. Any suggestion is appreciated!

results = CNN.evaluate(test_generator, batch_size=64)
1

There are 1 best solutions below

0
On

Actually you have some problems with your code.

1)First of all, the lines below

train_safe_dir = os.path.join(train_dir, 'class1')  
train_malicious_dir = os.path.join(train_dir, 'class0')  
val_safe_dir = os.path.join(val_dir, 'class1')  
val_malicious_dir = os.path.join(val_dir, 'class0')
test_safe_dir = os.path.join(test_dir, 'class1')  
test_malicious_dir = os.path.join(test_dir, 'class0')

are of no use. You might want to delete them from your code. They are redundant.

2)Secondly, you must have your test data folder organized for test and validation data as is the case for the traning data. Just bear in mind, when you have tabular data. Do you discard the output labels for the test data in order to evaluate it? The folders in the directory play the role of class labels.

3)Thirdly, your low accuracy originates from this fact that you haven't done any image augmentation on your training dataset. So, it is not surprising that your model has overfitted the training data.

and fortunately, you don't leak any information from the training data to validation and test data.