outlier detection using 2D spatial information

269 Views Asked by At

I have a list of sensor measurements for air quality with geo-coordinates, and I would like to implement outlier detection. The list of sensors is relatively small (~50).

The air quality can gradually change with the distance, but abrupt local spikes are likely outliers. If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK.

Of course, I can ignore coordinates and do simple outlier detection assuming the normal distribution, but I was hoping to do something more sophisticated. What would be a good statistical way to model this and implement outlier detection?

2

There are 2 best solutions below

1
On

The above statement, ("If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK."), would indicate that sensors that are closer to each other tend to have values that are more alike.

Tobler’s first law of geography - “everything is related to everything else, but near things are more related than distant things”

You can quantify an answer to this question. The focus is should not be on the location and values from outlier sensors. Use global spatial autocorrelation to answer the degree to which sensors that are near each other tend to be more alike.

As a start, you will first need to define neighbors for each sensor.

0
On

I'd calculate a cost function, consisting of two costs:

1: cost_neighbors: Calculates the deviance from the sensor value of an expected value. The expected value is calculated by summing up all the values and weighting them by their distance.

2: cost_previous_step: Check how much the value of the sensor changed compared to the last time step. Large change in value leads to a large cost.

Here is some pseudo code describing how to calculate the costs:

expected_value = ((value_neighbor_0 / distance_neighbor_0)+(value_neighbor_1 / distance_neighbor_1)+ ... )/nb_neighbors

cost_neighbors = abs(expected_value-value)

cost_previous_timestep = value@t - value@t-1

total_cost = a*cost_neighbors + b*cost_previous_timestep

a and b are parameters that can be tuned to give each of the costs more or less impact. The total cost is then used to determine if a sensor value is an outlier, the larger it is, the likelier it is an outlier.

To figure out the performance and weights, you can plot the costs of some labeled data points, of which you know if they are an outlier or not.

cost_neigbors

|           X
|          X   X
|         
|o    o
|o o   o
|___o_____________  cost_previous_step

X= outlier
o= non-outlier

You can now either set the threshold by hand or create a small dataset with the labels and costs, and apply any sort of classifier function (e.g. SVM).

If you use python, an easy way to find neighbors and their distances is scipy.spatial.cKDtree