ImageDataGenerator.flow_from_directory to a dataset that can be used in Kfold

2k Views Asked by At

I am trying to use the cross validation approach for the model I use for classifying images into 3 classes. I use the following code to import images:

train_datagen = ImageDataGenerator(rescale=1./255)
data = train_datagen.flow_from_directory(directory=train_path,
                                       target_size=(300,205), batch_size=8, 
                                       color_mode='grayscale',class_mode='categorical')

It worked fine to train the model and test it before I tried using sklearn.model_selection's KFold. All the examples I find on the internet are simple numpy arrays, whereas I have a classification array. Meaning that the arrays of images have labels and I could not work anything around to convert this DirectoryIterator (flow_from_directory returns a DirectoryIterator) into an array that can be used with kfold.split function.

I tried the following approaches, please bear in mind I am new to classification models:

np_data = data.next()

num_folds = 5
kfold = KFold(n_splits=num_folds, shuffle=True)
for train, test in kfold.split(np_data):

Then I get: ValueError: Cannot have number of splits n_splits=5 greater than the number of samples: n_samples=2.

I believe I get this value error because np_array has 2 nested arrays inside, first for the images and second for their classes.

I would try to shuffle and kfold only the images, but then without the information what class they belong to I cannot train my model properly. I have tried following the guide in this link but the data for their testing and training seem to be imported in a different way than I have my data. Then I came across also this, but again it did not really help with my situation.

I have no idea what I am missing, any additional help will be much appreciated.

Lastly I have tried doing:

x, y = data.next()
for train, test in kfold.split(x, y):
     ...

This gives me the following error when it begins the first epoch of the first fold:

ValueError: No gradients provided for any variable: ['conv2d/kernel:0', 'conv2d/bias:0', 'conv2d_1/kernel:0', 'conv2d_1/bias:0', 'conv2d_2/kernel:0', 'conv2d_2/bias:0', 'conv2d_3/kernel:0', 'conv2d_3/bias:0', 'dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0'].

2

There are 2 best solutions below

2
On BEST ANSWER

The reason I got the last ValueError was because I did not include y[test] when I used model.fit(). The following worked fine for me.

After importing the images with ImageDataGenerator.flow_from_directory(...), x, y = data.next() yields images and their label into x and y arrays. Henceforth:

kfold = KFold(n_splits=num_folds, shuffle=True)

fold_no = 1
for train, test in kfold.split(x, y):
   model = keras.models.Sequential(.....)
   model.fit(x[train], y[train], epochs=epochs)
   ...
   scores = model.evaluate(x[test], y[test], verbose=0)
   ...
   fold_no = fold_no + 1

I also used this print line to keep track of the scores:

print(f'Score for fold {fold_no}: {network.metrics_names[0]} of {scores[0]}; {network.metrics_names[1]} of {scores[1]*100}%')

Additionally, loss and accuracy results can be stored in two separate arrays and get an average at the end of the folds.

acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

The above 2 lines have to be inside the for loop (for train, test in kfold.split(x, y):), and the below lines outside of it.

print("\n\n Overall accuracy: " + str(np.average(acc_per_fold)))
print("Overall loss: " + str(np.average(loss_per_fold)))
0
On

On this line

x, y = data.next()

Here it's only yielding only the single batch of data as your batch_size=8. If you print len(data.next()[0]) you can see the size will be 8. I think we don't want to only use the single batch data. So at first I extracted all data to numpy array.

x=np.concatenate([train_generator.next()[0] for i in range(train_generator.__len__())])
y=np.concatenate([train_generator.next()[1] for i in range(train_generator.__len__())])

Then I applied kfold_split

for train, test in kf.split(x, y):
    print(f"Fold {fold_no}/{k}")
    history = model.fit(x[train], y[train], epochs=5, validation_data=(x[test], y[test]))

@Jacqueline Thanks your code helped me a lot