I have 2 dataframes - STORE_LIST_A (50,000 rows) & STORE_LIST_B (30,000 rows). Both these dataframes contain these 3 columns - STORE_ID, LATITUDE, LONGITUDE
| STORE_ID | LATITUDE | LONGITUDE |
|---|---|---|
| ABPS693P | 8.0951349 | 77.5472305 |
| ABTS730P | 8.1024627 | 77.5581893 |
What I want to do is to calculate the distance of each store in STORE_LIST_A with all the stores in STORE_LIST_B, and only store those pairs which are within a distance of 1000 metre of each other.
I have tried to do this using a nested loop where each store in STORE_LIST_A is compared to each STORE in STORE_LIST_B and each pair within 1000 metre of each other is added to an empty dataframe using rbind.
But the problem with this approach is that it takes days for this loop to run, since there are 50000 rows in STORE_LIST_A and 30000 rows in STORE_LIST_B, so essentially there are 50000*30000 distance calculations. Can someone please suggest a computationally efficient way to deal with this problem?
Ok here an attempt :) I ran a few times against memory limits, but it probably depends on how close your stores are. I found a nice data set with all companies in San Francisco so ran the code fully on 30k * 50k combinations BUT my criteria was set on roughly 250 m apart.
libraries used
just to prepare a similar data set
Lets play around, with n of 10000 we create your size, make smaller to just test the code. After this we have the sample data to play with.
solution
This looks a bit redundant, but roughly 0.008 decimal difference equals about 1km, we are not interested in any distance calculation exceeding that. Note on the SF data I made it smaller as we would have too many combinations! I decided to use
foverlapsas post filtering on cartesian join did result in 2.5 bilion combinations, good luck :)Note that in A we set the ranges, in B we "clone" the start/end. We basically split the data and do the latitudes and longitudes seperately, if either one of them is more than 1km apart, the distance between A and B will be too!
Now we have to inner join the two as the distance is only small enough if both distances between longitude and latitude are small enough. This below on this data resulted in "only" 5.3M matches for which you have to calculate the distance.
Do some cleaning up and filter on the actually calculated distances, note that this is still pretty slow! But you know we only have to calculate this for 5.6M cases here and not for 1500M :)
sample output (from a smaller test run)