I have a 3D array (z, y, x)
with shape=(92, 4800, 4800)
where each value along axis 0
represents a different point in time. The acquisition of values in the time domain failed in a few instances causing some values to be np.NaN
. In other instances no values have been acquired and all values along z
are np.NaN
.
What is the most efficient way to use linear interpolation to fill np.NaN
along axis 0
disregarding instances where all values are np.NaN
?
Here is a working example of what I'm doing that employs pandas
wrapper to scipy.interpolate.interp1d
. This takes around 2 seconds per slice on the original dataset meaning the whole array is processed in 2.6 hours. The example dataset with reduced size takes around 9.5 seconds.
import numpy as np
import pandas as pd
# create example data, original is (92, 4800, 4800)
test_arr = np.random.randint(low=-10000, high=10000, size=(92, 480, 480))
test_arr[1:90:7, :, :] = -32768 # NaN fill value in original data
test_arr[:, 1:90:6, 1:90:8] = -32768
def interpolate_nan(arr, method="linear", limit=3):
"""return array interpolated along time-axis to fill missing values"""
result = np.zeros_like(arr, dtype=np.int16)
for i in range(arr.shape[1]):
# slice along y axis, interpolate with pandas wrapper to interp1d
line_stack = pd.DataFrame(data=arr[:,i,:], dtype=np.float32)
line_stack.replace(to_replace=-37268, value=np.NaN, inplace=True)
line_stack.interpolate(method=method, axis=0, inplace=True, limit=limit)
line_stack.replace(to_replace=np.NaN, value=-37268, inplace=True)
result[:, i, :] = line_stack.values.astype(np.int16)
return result
Performance on my machine with the example dataset:
%timeit interpolate_nan(test_arr)
1 loops, best of 3: 9.51 s per loop
Edit:
I should clarify that the code is producing my expected outcome. The question is - how can I optimize this process?
I recently solved this problem for my particular use case with the help of numba and also did a little writeup on it.
This is about 20 times faster than my initial code.