Consensus/Cluster a set of variable length lists in Python?

294 Views Asked by At

I have a set of sensors measuring some temporal data. At any given time step, the sensor outputs either a 0 or a 1. A sensor will never output two 1s sequentially.

How can we find a best estimate, given the available sensors?

For example, say four sensors outputted a 1 at these provided indices.

A = [ 178,  511,  843, 1180, 1512, 1733]
B = [ 514,  846, 1182, 1515, 1736, 1937]
C = [ 182,  516,  848, 1517, 1738, 1939]
D = [ 179,  513,  845, 1181, 1513, 1735, 1936, 2124]

From visual inspection, I can see that:

  • A lost a value at the tail of the list
  • B lost a value at the head of the list
  • C lost a value in the middle of the list
  • D has an extra value at the tail of the list
# the None locations are not known to the consensus algorithm
a = [  178,  511,  843, 1180, 1512, 1733, None]
b = [ None,  514,  846, 1182, 1515, 1736, 1937]
c = [  182,  516,  848, None, 1517, 1738, 1939]
d = [  179,  513,  845, 1181, 1513, 1735, 1936] # 2124 removed

# Consensus: Average over columns with `None` removed
# rounded to the nearest integer
s = consensus((A,B,C,D))
s = [  180,  514,  849, 1181, 1514, 1736, 1937]

If we were to have two additional sensors E and F with the following values:

E = [ 2130 ]
F = [ 2121 ]
# these two sensors only have the one tail value
# therefore sensor D's extra reading is now part of consensus.
# All other values are unchanged.
s = consensus((A,B,C,D,E,F))
s = [  180,  514,  849, 1181, 1514, 1736, 1937, 2125]

Is there a solution to solve this problem that is not O(n^2)?

1

There are 1 best solutions below

0
On BEST ANSWER

Thank you to the two users in the comments who were able to lead me to a working solution.

EDIT: I hesitate to mark this as the definitive answer as we lose information that each sensor is unique when we concatenate all the readings into one array. Also I think an iterative or dynamic programming approach, keeping track of the distance to nearest value per sensor, may be used as well.

from matplotlib import pyplot as plt
from sklearn.neighbors import KernelDensity
from scipy.signal import find_peaks

concat = A + B + C + D
X = np.array(concat)[:, np.newaxis]

X_plot = np.linspace(0, 1.1 * X.max(), 1000)[:, np.newaxis]

kde = KernelDensity(bandwidth=2).fit(X)
log_dens = kde.score_samples(X_plot)
dens = np.exp(log_dens)
peaks, _ = find_peaks(dens)

plt.plot(X_plot[:, 0], dens)
plt.plot(X_plot[peaks], dens[peaks], "X")
plt.show()

print(tuple(int(i) for i in X_plot[peaks].squeeze()))
# (180, 514, 846, 1181, 1513, 1735, 1936, 2123)

Consensus plotted