Image augmentation with SMOTE oversampling as batches without running out of RAM

973 Views Asked by At

I am trying to use an unbalanced dataset to feed a neural network. I am using colab. I found this code on kaggle which uses keras ImageDataGenerator for augmentation and SMOTE to oversample the data:

Augmentation:

ZOOM = [.99, 1.01]
BRIGHT_RANGE = [0.8, 1.2]
HORZ_FLIP = True
FILL_MODE = "constant"
DATA_FORMAT = "channels_last"

work_dr = ImageDataGenerator(rescale = 1./255, brightness_range=BRIGHT_RANGE, zoom_range=ZOOM, data_format=DATA_FORMAT, fill_mode=FILL_MODE, horizontal_flip=HORZ_FLIP)

train_data_gen = work_dr.flow_from_directory(directory=WORK_DIR, target_size=DIM, batch_size=6500, shuffle=False)

Then he uses next() iterator to load the images:

train_data, train_labels = train_data_gen.next()
print(train_data.shape, train_labels.shape)

Which gives the following outuput:

(6400, 176, 176, 3) (6400, 4)

At this point it has already consumed about 70% of my RAM on Colab not to mention the time taken to load the images. Notice, the batch size is set to 6500 which is a very large but if I set it to something like 32 or 64, then only the first batch is loaded when I use next() Then, to oversample the data, he uses SMOTE:

#Performing over-sampling of the data, since the classes are imbalanced

sm = SMOTE(random_state=42)

train_data, train_labels = sm.fit_resample(train_data.reshape(-1, IMG_SIZE * IMG_SIZE * 3), train_labels)

train_data = train_data.reshape(-1, IMG_SIZE, IMG_SIZE, 3)

print(train_data.shape, train_labels.shape)

This should give the following output:

(12800, 176, 176, 3) (12800, 4)

But instead it overloads my memory and Colab crashes due do RAM shortage. I am not very good at coding so I am having difficulty implementing what I want. What I want is to feed batches of augmented and oversampled data to my neural network without loading the entire dataset at once and thus saving memory. My question is, is there a way to do this? If so, could you please show me how to do it?

1

There are 1 best solutions below

0
On

I came across the same problem. You can run the code by copying it in kaggle itself and it runs very smoothly on kaggle. Hope this helps!!