Geospatial to count NN within Radius - in Big Data

620 Views Asked by At

I have successfully done this in python using cKDTree and multiprocessing over 20+ threads, but it really hits a wall when trying to scale it up to millions of rows. I'm wondering how someone could do this in Hive or Spark.

1 dataset is a dataset of all customers and their GPS coords. Another dataset is of all fire hydrants and their GPS coords. Like so:

tblCustomer (100MM rows)

+-------------+-------------------+--------------------+
| customer_id | customer_latitude | customer_longitude |
+-------------+-------------------+--------------------+
| 123         | 42.123456         | -56.123456         |
+-------------+-------------------+--------------------+
| 456         | 44.123456         | -55.123456         |
+-------------+-------------------+--------------------+

tblFireHydrants (50MM rows)

+----------------+------------------+-------------------+
| firehydrant_id | hydrant_latitude | hydrant_longitude |
+----------------+------------------+-------------------+
| 123456         | 42.987654        | -55.984657        |
+----------------+------------------+-------------------+
| 456233         | 45.569841        | -55.978946        |
+----------------+------------------+-------------------+

The goal is to query how many fire hydrants are with radius r (meters) of each customer_id.

After I repeat it for a few times for various distances, the final result will look like this:

+-------------+----------------------+----------------------+-----------------------+
| customer_id | hydrants_within_100m | hydrants_within_500m | hydrants_within_1000m |
+-------------+----------------------+----------------------+-----------------------+
| 123456      | 0                    | 1                    | 6                     |
+-------------+----------------------+----------------------+-----------------------+
| 456233      | 1                    | 1                    | 9                     |
+-------------+----------------------+----------------------+-----------------------+

I could start from scratch and try to use KDTrees in scala, or I noticed there are some geospatial UDF's for Hive-MapReduce that might work. I'm not really sure.

Looking for any suggestion of where to start, hopefully not starting from scratch.

1

There are 1 best solutions below

6
On

If I understand correctly, you shouldn't use kNN-queries. kNN-Queries return the closest 'k' hydrants. The problem is that you have to set 'k' quite high to be sure that you don't miss out any hydrants within the minimum distance. However, using a high 'k' may return lots of hydrants that are not within your required distance.

I suggest using window queries or, if available, range queries. Range queries just return everything within a given range. This would be ideal, this feature is often not supported in kd-trees.

Window queries would be a good alternative. A window query typically returns all data within a specified axis-aligned rectangle. You have to make the window large enough to ensure that it returns all possible hydrants in the required distance. Afterwards, you would have to filter out all hydrants that are in the rectangle, but exceed your maximum distance, i.e. hydrants that are near the corners of the query window.

Range and window queries should be much more efficient than kd-queries, even if it may return additional results (in the case of window queries).

To put it another way, a naive kNN algorithm can be thought of as a series of window queries. The algorithm starts with a small window and then makes it larger until it finds 'k' elements. In the best case, the algorithm performs just one window query and find the right number of results (not too few, not too many), but in the typical case it has to perform many such queries. I don't know how the cKDTree works, but I strongly would assume that window/range queries are much cheaper than kNN queries.