Keras data generator predict same number of values

1k Views Asked by At

I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.

To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.

Do you have any other solution, or what is the problem in my approach?

My data generator code:

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
                 dim2 = 80, n_channels=1, shuffle=True):
        'Initialization'
        self.dim1 = dim1
        self.dim2 = dim2
        self.batch_size = batch_size
        self.data_X = data_X
        self.data_Z = data_Z
        self.target_y = target_y
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]

        # Generate data
        ([X, Z], y) = self.__data_generation(list_IDs_temp)

        return ([X, Z], y)

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim1, self.n_channels))
        Z = np.empty((self.batch_size, self.dim2))
        y = np.empty((self.batch_size))

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = np.load('data/' + data_X + ID + '.npy')
            Z[i,] = np.load('data/' + data_Z + ID + '.npy')

            # Store target
            y[i] = np.load('data/' + target_y + ID + '.npy')

How I call model.predict()

predict_params = {'list_IDs': 'indexes', 
                  'data_X': 'images',
                  'data_Z': 'factors',
                  'target_y': 'True_values'
                  'batch_size': 5,
                  'dim1': (120,120),
                  'dim2': 80, 
                  'n_channels': 1,
                  'shuffle'=False}

# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)

predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
3

There are 3 best solutions below

0
On BEST ANSWER

If we look at your __getitem__ function, we can see this code:

        list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]

This code will always return the same numbers IDs, because the length len of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.

You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Generate data
        ([X, Z], y) = self.__data_generation(list_IDs_temp)

        return ([X, Z], y)

See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.

1
On

According to this solution, you need to change your steps to the total number of images you want to test on. Try:

# Assuming test_index is a list
predition_results = model.predict(prediction_generator, steps = len(test_index), verbose=1)
6
On

When you use a generator you specify a batch size. model.predict will produce batch size number of output predictions. If you set steps=1 that is all the predictions you will get. To set the steps you should take the number of samples you have and divide it by the batch size. For example if you have 50 images with a batch size of 5 then you should set steps equal to 10. Ideally you want to go through your test set exactly once. The code below will determine the batch size and steps to do that. In the code b_max is a value you select that limits the maximum batch size . You should set this based on your memory size to avoid a OOM (out of memory) error. In the code below parameter length is equal to the number of test samples you have.

length=500
b_max=80
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]  
steps=int(length/batch_size)

The result will be batch_size= 50 steps=10. Note if length is a prime number the result will be batch_size=1 and steps=length