How to pair matrices that are approximately the same in another numpy array

550 Views Asked by At

Background

I have the following code that works like a charm and is used to make pairs for a Siamese network:

def make_pairs(images, labels):

# initialize two empty lists to hold the (image, image) pairs and
# labels to indicate if a pair is positive or negative
pairImages = []
pairLabels = []

# calculate the total number of classes present in the dataset
# and then build a list of indexes for each class label that
# provides the indexes for all examples with a given label
#np.unique function finds all unique class labels in our labels list. 
#Taking the len of the np.unique output yields the total number of unique class labels in the dataset. 
#In the case of the MNIST dataset, there are 10 unique class labels, corresponding to the digits 0-9.

numClasses = len(np.unique(labels))

#idxs have a list of indexes that belong to each class

idx = [np.where(labels == i)[0] for i in range(0, numClasses)]

#let’s now start generating our positive and negative pairs
for idxA in range(len(images)):
    
    # grab the current image and label belonging to the current
    # iteration
    currentImage = images[idxA]
    label = labels[idxA]
    
    # randomly pick an image that belongs to the *same* class
    # label
    idxB = np.random.choice(idx[label])
    posImage = images[idxB]
    
    # prepare a positive pair and update the images and labels
    # lists, respectively
    pairImages.append([currentImage, posImage])
    pairLabels.append([1])
    
    #grab the indices for each of the class labels *not* equal to
    #the current label and randomly pick an image corresponding
    #to a label *not* equal to the current label
    negIdx = np.where(labels != label)[0]
    negImage = images[np.random.choice(negIdx)]
    # prepare a negative pair of images and update our lists
    pairImages.append([currentImage, negImage])
    pairLabels.append([0])
#return a 2-tuple of our image pairs and labels
return (np.array(pairImages), np.array(pairLabels))

Ok, this code works by selecting pairs for each image in the MNIST dataset. It builds one pair for that image by randomly selecting another image of the same class (label), and another patch of a different class (label) to make another pair. By running the code, the final shapes of the returned two matrices are:

# load MNIST dataset and scale the pixel values to the range of [0, 1]
print("[INFO] loading MNIST dataset...")
(trainX, trainY), (testX, testY) = mnist.load_data()

# build the positive and negative image pairs
print("[INFO] preparing positive and negative pairs...")
(pairTrain, labelTrain) = make_pairs(trainX, trainY)
(pairTest, labelTest) = make_pairs(testX, testY)

>> print(pairTrain.shape)
(120000, 2, 28, 28)
>> print(labelTrain.shape)
(120000, 1)

My Dataset

I want to do something a little different with another dataset. Suppose that I have another dataset of 5600 RGB images with 28x28x3 dimensions, as can be seen below:

>>> images2.shape
(5600, 28, 28, 3)

I have another array, let's call it labels2, it has 8 labels for all the 5600 images, being 700 images per label as can be seen below:

>>> labels2.shape
(5600,)

>>> len(np.unique(labels2))
8

>>> (labels2==0).sum()
700
>>> (labels2==1).sum()
700
>>> (labels2==2).sum()
700
...

What do I want to do

My dataset is not an MNIST dataset, so the images from the same class are not so similar. I would like to build pairs that are approximately the same in the following manner:

  1. For each image in my dataset, I want to do the following:

    1.1. Calculate the similarity through MSE between that image and all the others in the dataset.

    1.2 For the set of MSEs of images with the same label as that image, select the images with the 7 smallest MSEs and build 7 pairs, containing that image plus the 7 closest MSE images. These pairs represent the images from the same class for my Siamese Network.

    1.3 For the set of MSEs of images with different labels from that image select, for each different label, only one image with the smallest MSEs. Therefore, as there are 7 labels different from the label of that image, there are 7 more pairs for that image.

As there are 5600 images 28x28x3 for my dataset and, for each image, I build 14 pairs (7 of the same class and 7 for different classes) I am expecting to have a pairTrain matrix of size (78400, 2, 28, 28, 3)

What did I do

I have the following code that does exactly what I want:

def make_pairs(images, labels):

# initialize two empty lists to hold the (image, image) pairs and
# labels to indicate if a pair is positive or negative
pairImages = []
pairLabels = []


#In my dataset, there are 8 unique class labels, corresponding to the classes 0-7.
numClasses = len(np.unique(labels))

#Initial lists
pairLabels=[]
pairImages=[]

#let’s now start generating our positive and negative pairs for each image in the dataset
for idxA in range(len(images)):
        print("Image "+str(k)+ " out of " +str(len(images)))
        k=k+1  

        #For each image, I need to store the MSE between it and all the others
        mse_all=[]

        #Get each image and its label
        currentImage = images[idxA]
        label = labels[idxA]
        
        #Now we need to iterate through all the other images    
        for idxB in range(len(images)):
            candidateImage = images[idxB]
            #Calculate the mse and store all mses
            mse=np.mean(candidateImage - currentImage)**2
            mse_all.append(mse)
        
        mse_all=np.array(mse_all)

        #When we finished calculating mse between the currentImage ad all the others, 
        #let's add 7 pairs that have the smallest mse in the case of images from the 
        #same class and 1 pair for each different class 
        
        #For all classes, do                   
        for i in range(0,numClasses):

            #get indices of images for that class
            idxs=[np.where(labels == i)[0]] 
            
            #Get images of that class
            imgs=images[np.array(idxs)]
            imgs=np.squeeze(imgs, axis=0)
                
            #get MSEs between the currentImage and all the others of that class
            mse_that_class=mse_all[np.array(idxs)]
            mse_that_class=np.squeeze(mse_that_class, axis=0)
            
            #if the class is the same class of that image   
            if i==label:    
                #Get indices of that class that have the 7 smallest MSEs
                indices_sorted = np.argpartition(mse_that_class, numClasses-1)
            
            else:
                #Otherwise, get only the smallest MSE
                indices_sorted = np.argpartition(mse_that_class, 1)
            
            # Now, lets pair them
            for j in range(0,indices_sorted.shape[0]):

                image_to_pair=imgs[indices_sorted[j], :, :, :]
                pairImages.append([currentImage, image_to_pair])
                
                if i==label:
                    pairLabels.append([1])
                else:
                    pairLabels.append([0])
        del image_to_pair, currentImage, label, mse_that_class, imgs, indices_sorted, idxs, mse_all
return (np.array(pairImages), np.array(pairLabels))

My problem

The problem with my code is that it simply freezes my computer when I am running the pairs construction for image number 2200, I tried to clean the variables after each loop as you can see in the above code (del image_to_pair, currentImage, label, mse_that_class, imgs, indices_sorted, idxs, mse_all). The question is, a (120000, 2, 28, 28) pairImages matrix was not difficult to be built, but a (78400,2,28,28,3) is. So:

  1. Is this a possible memory problem?
  2. can I clean more variables in my code in order to make it work?
  3. Should I have to disconsider the last dimension of my pairImages matrix so it will have a smaller dimension than the first example and thus, will work?
  4. Is there an easier way to solve my problem?

You can find the functional code and input matrices HERE

3

There are 3 best solutions below

0
On BEST ANSWER

You can try running gc.collect() at the start of each loop to actively run the garbage collector. Memory in Python is not freed until garbage collection runs. It's not clear to me that your current del statement is doing what you want it to. (del decrements the refcount, but it doesn't necessarily free the memory, and your code is actually feeding it a newly created tuple instead of the variables).

78400 * 2 * 28 * 28 * 3 = 368,793,600, which is multiplied by the size of each piece of data in bytes, which indicates to me that it should be a memory issue. My guess is that the freezing is the computer trying to switch from using RAM to using a swap file on the drive, and using a swap file intensively like this will cause any computer to take a dump.

Your images should also be loaded one at a time via a generator instead of packed into an array.

import gc
gc.collect()
filenames = ["a.jpg", "b.jpg"]
labels = ["a", "b"]

def image_loader(filenames):  # this is a generator, not a function
   # code to load image
   for f in filenames:
       gc.collect()  # make super sure we're freeing the memory from the last image
       image = load_or_something(filename)
       yield image

make_pairs(image_loader(filenames), labels)

Generators function exactly like lists with concern to for loops and similar stuff, with the difference that each item in the list is generated on the spot instead of loaded into memory. (It's a bit technical but tl;dr it's a list-maker-thing that only loads the images on the fly).

0
On

I believe you can make this part of your code easier, which should help with the run time as well.

#Now we need to iterate through all the other images    
for idxB in range(len(images)):
    candidateImage = images[idxB]
    #Calculate the mse and store all mses
    mse=np.mean(candidateImage - currentImage)**2
    mse_all.append(mse)

Instead of iterating through your data with a for loop, you can just do this and let the NumPy do broadcasting

# assuming images is a numpy array with shape 5600,28,28,3
mse_all = np.mean( ((images - currentImage)**2).reshape(images.shape[0],-1), axis=1 )
# mse_all.shape 5600,
0
On

Some possible issues and optimizations

Apart from trying with forcing the garbage collector to free the unused memory (that seems to not resolve your problem by trying it), I think that there are other issues on your code, unless I didn't understand what's happening.

By looking at the following snippet:

#Agora adiciono os 7 mais parecidos com aquele bloco, sendo 7 da mesma e 1 de cada das outras classes. 1 bloco 
   for j in range(0,indices_sorted.shape[0]):

It seems that you are iterating for j <- (0..indices_sorted.shape[0]), where indices_sorted.shape[0] is always 700 and I'm not sure that this is what you want. I think you need just j <- (0..6). In addition the values in the images are always below 255 if I got it right. If so, there is a second optimization that could be added: you could force the uint8 type. In short, I think you could refactor to something similar to this:

for i in range(0,numClasses):
    # ...
    # ...
    #print(indices_sorted.shape) # -> (700,)
    #print(indices_sorted.shape[0]) # --> always 700!so why range(0, 700?)
    #Agora adiciono os 7 mais parecidos com aquele bloco, sendo 7 da mesma e 1 de cada das outras classes. 1 bloco 
    for j in range(0, 7):
        image_to_pair=np.array(imgs[indices_sorted[j], :, :, :], dtype=np.uint8)
        pairImages.append([currentImage, image_to_pair])
        
        if i==label:
            pairLabels.append([1])
        else:
            pairLabels.append([0])
    del imgs, idxs
#When finished to find the pairs for that image, lets clean the trash     
del image_to_pair, currentImage, label, mse_that_class, indices_sorted, mse_all
gc.collect()

A simple and intuitive hint

By doing some tests I saw that by commenting:

    pairImages.append([currentImage, image_to_pair])

you have an almost 0 memory footprint.

Other notes

As an additional note, I moved the del operation of imgs and idxs inside the i for and the major improvement here, seems to be obtained by forcing the correct type:

    image_to_pair=np.array(imgs[indices_sorted[j], :, :, :], dtype=np.uint8)

The results

According to my test, with the original code the memory usage for k = 100, 200 is respectively 642351104 Bytes and 783355904 Bytes. So the increase in memory usage for 100 iteration is 134.5 MB. After having applied the above modification we have 345239552 B and 362336256 B with an increase of only 16,5 MB.