As the title says, I'm trying to calculate the shortest distance between many longitude and latitude point sets. I have a set of homes and a set of stores. For each home, I am trying to determine which store is closest within a 20 mile radius.
The SQL I wrote works, but it doesn't scale well when adding more homes to the execution. I'm trying to find a way to do this calculation efficiently. Even if it takes a few hours to run, I would be happy with that as I can get away with running this on a monthly basis. As it stands, however, this query would run for days if I try it against the full volume of homes in my database.
What I've Tried So Far
- Using guidance from this question, I made use of Oracle's SDO_GEOM package to perform the distance calculation.
- In terms of efficiency, I followed the recommendation in this guide to set up indexes on each long/lat column as well as code in a where-clause to limit to a 20 mile radius in an attempt to filter out invalid longs/lats right off the bat, thus reducing superfluous calculations.
- I can add parallelism to the query, but I feel as though this is a brute-force method of decreasing runtime. While I think supplementing with parallelism is workable, I want to come to a solution that makes the query run efficiently before I throw processors at it.
Data Setup
I'm working on an Oracle 19c database that has two datasets:
1. A list of HOME_IDs and their associated longitudes and latitudes
create table tmp_homes (
home_id number not null,
home_long float not null,
home_lat float not null,
primary key(home_id)
) nologging compress pctfree 0
;
This list can be hundreds of thousands of records in size.
An index is set up for each long/lat column.
2. A list of STORE_IDs and their associated longitudes and latitudes
create table tmp_stores (
store_id number not null,
store_long float not null,
store_lat float not null,
primary key(store_id)
) nologging compress pctfree 0
;
This list is roughly one-thousand records in size.
An index is set up for each long/lat column.
Query
create table tmp_homes_to_stores compress nologging pctfree 0 as
select *
from (
select
h.home_id,
s.store_id,
sdo_geom.sdo_distance(
sdo_geometry(2001, 4326, sdo_point_type(h.home_long, h.home_lat, null), null, null),
sdo_geometry(2001, 4326, sdo_point_type(s.store_long, s.store_lat, null), null, null),
0.01,
'unit=KM'
) as distance,
s.radius
from tmp_homes h
cross join (
select store_id, store_long, store_lat, 32.1869 as radius, 111.045 as distance_unit, 0.0174532925 as deg2rad--, 57.2957795 as rad2deg
from tmp_stores
) s
where h.home_lat between s.store_lat - (s.radius / s.distance_unit) and s.store_lat + (s.radius / s.distance_unit)
and h.home_long between s.store_long - (s.radius / (s.distance_unit * cos(s.deg2rad * (s.store_lat)))) and s.store_long + (s.radius / (s.distance_unit * cos(s.deg2rad * (s.store_lat))))
)
where distance <= radius -- 32.1869km = 20.00mi
;
This query works well if I run it for a handful of records. Unfortunately, the moment I test it on a sizeable fraction of my working data, it takes hours to run. What modifications or tricks can I employ to make this query run significantly faster?
Note
The query in its current state will return all STORE_IDs associated to a HOME_ID within a 20 mile radius. The next step is to order the output by distance for each HOME_ID and choose the record with the shortest distance to a store. For reference, that query looks like this:
select home_id, store_id, distance
from (
select
hs.*,
row_number() over(partition by home_id order by distance asc) as distance_rank
from tmp_homes_to_stores hs
)
where distance_rank = 1
;