How Can I Efficiently Calculate The Shortest Distance Between Many Longs/Lats (Oracle SQL)?

161 Views Asked by At

As the title says, I'm trying to calculate the shortest distance between many longitude and latitude point sets. I have a set of homes and a set of stores. For each home, I am trying to determine which store is closest within a 20 mile radius.

The SQL I wrote works, but it doesn't scale well when adding more homes to the execution. I'm trying to find a way to do this calculation efficiently. Even if it takes a few hours to run, I would be happy with that as I can get away with running this on a monthly basis. As it stands, however, this query would run for days if I try it against the full volume of homes in my database.

What I've Tried So Far

  • Using guidance from this question, I made use of Oracle's SDO_GEOM package to perform the distance calculation.
  • In terms of efficiency, I followed the recommendation in this guide to set up indexes on each long/lat column as well as code in a where-clause to limit to a 20 mile radius in an attempt to filter out invalid longs/lats right off the bat, thus reducing superfluous calculations.
  • I can add parallelism to the query, but I feel as though this is a brute-force method of decreasing runtime. While I think supplementing with parallelism is workable, I want to come to a solution that makes the query run efficiently before I throw processors at it.

Data Setup

I'm working on an Oracle 19c database that has two datasets:

1. A list of HOME_IDs and their associated longitudes and latitudes

create table tmp_homes (
    home_id number not null,
    home_long float not null,
    home_lat float not null,
    primary key(home_id)
) nologging compress pctfree 0
;

This list can be hundreds of thousands of records in size.

An index is set up for each long/lat column.

2. A list of STORE_IDs and their associated longitudes and latitudes

create table tmp_stores (
    store_id number not null,
    store_long float not null,
    store_lat float not null,
    primary key(store_id)
) nologging compress pctfree 0
;

This list is roughly one-thousand records in size.

An index is set up for each long/lat column.

Query

create table tmp_homes_to_stores compress nologging pctfree 0 as
select *
from (
    select
    h.home_id,
    s.store_id,
    sdo_geom.sdo_distance(
      sdo_geometry(2001, 4326, sdo_point_type(h.home_long, h.home_lat, null), null, null),
      sdo_geometry(2001, 4326, sdo_point_type(s.store_long, s.store_lat, null), null, null),
      0.01,
      'unit=KM'
    ) as distance,
    s.radius
    from tmp_homes h
    cross join (
        select store_id, store_long, store_lat, 32.1869 as radius, 111.045 as distance_unit, 0.0174532925 as deg2rad--, 57.2957795 as rad2deg
        from tmp_stores
    ) s
    where h.home_lat between s.store_lat - (s.radius / s.distance_unit) and s.store_lat + (s.radius / s.distance_unit)
    and h.home_long between s.store_long - (s.radius / (s.distance_unit * cos(s.deg2rad * (s.store_lat)))) and s.store_long + (s.radius / (s.distance_unit * cos(s.deg2rad * (s.store_lat))))
)
where distance <= radius -- 32.1869km = 20.00mi
;

This query works well if I run it for a handful of records. Unfortunately, the moment I test it on a sizeable fraction of my working data, it takes hours to run. What modifications or tricks can I employ to make this query run significantly faster?

Note

The query in its current state will return all STORE_IDs associated to a HOME_ID within a 20 mile radius. The next step is to order the output by distance for each HOME_ID and choose the record with the shortest distance to a store. For reference, that query looks like this:

select home_id, store_id, distance
from (
    select
    hs.*,
    row_number() over(partition by home_id order by distance asc) as distance_rank
    from tmp_homes_to_stores hs
)
where distance_rank = 1
;
0

There are 0 best solutions below