I was wondering if there were already existing solutions for "resumable" computations with numpy.
Let me explain: I have a folder with a big amount of grayscale images over which I need to compute a sort of histogram using the numpy.unique function. My code looks like this:
from os import listdir
from os.path import isfile, join
import numpy as np
import matplotlib.image as img
import matplotlib.pyplot as plt
# storing all the images' names that need to be processed into a list:
work_dir = 'path/to/my/images'
images = [(work_dir + '/' + f) for f in listdir(work_dir) if isfile(join(work_dir, f))]
# allocating array that will contain the images' data:
nz = len(images)
nx, ny = img.imread(images[0]).shape
volume = np.zeros((nx, ny, nz), img.imread(images[0]).dtype)
print(volume.shape, nx*ny*nz, volume.dtype)
# loading the images into the allocated array:
for i in range(nz):
volume[:,:,i] = img.imread(images[i])
# computing the histogram as the number of occurrences of each unique value in volume:
values, counts = np.unique(volume, return_counts=True)
plt.plot(values, counts)
The problem is that my computer doesn't have enough RAM to allocate the necessary memory for both volume, values and counts arrays.
So is there an already existing solution that would look like this:
from os import listdir
from os.path import isfile, join
import numpy as np
import matplotlib.image as img
import matplotlib.pyplot as plt
# storing all the images' names that need to be processed into a list:
work_dir = 'path/to/my/images'
images = [(work_dir + '/' + f) for f in listdir(work_dir) if isfile(join(work_dir, f))]
# computing the histogram as the number of occurrences of each unique value in the first image:
values, counts = np.unique(img.imread(images[0]), return_counts=True)
# updating values and counts to include data from the other images:
for i in range(len(images)):
old_values, old_counts = values, counts
values, counts = update_unique(img.imread(images[i]), old_values, old_counts, return_counts=True)
plt.plot(values, counts)
I would rather avoid having to implement something myself because of time constraints. I am also open to alternatives that do not use numpy or even python.
I've had a little free time, so I tried to figure out how to do this on my own. I'm posting it here in case someone is interested in doing something similar. I believe my solution should be general enough that it can be used for other computations that need to aggregate the results from several separate computations while resolving duplicates.