I have to dataset, one with some location (lat,lon), that's test, and one with the lat/lon information of all zip codes in NYC, that's test2.
test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862,
-73.882202, -73.934113, -73.992203, -74.00563, -73.975189, -73.97448,
-73.974838, -73.981377, -73.955093), dropoff_latitude = c(40.773617,
40.744175, 40.715923, 40.749203, 40.726158, 40.729824, 40.763599,
40.754135, 40.759987, 40.765224)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
test2 <- structure(list(latitude = c(40.853017, 40.791586, 40.762174,
40.706903, 40.825727, 40.739022, 40.750824, 40.673138, 40.815559,
40.754591), longitude = c(-73.91214, -73.94575, -73.94917, -73.82973,
-73.81752, -73.98205, -73.99289, -73.81443, -73.90771, -73.976238
), borough = c("Bronx", "Manhattan", "Manhattan", "Queens", "Bronx",
"Manhattan", "Manhattan", "Queens", "Bronx", "Manhattan")), class = "data.frame", row.names = c(NA,
-10L))
I am now trying to join these two datasets so that in the end for every trip_count
I get one borough
. So far I used difference_left_join
for that like this:
test %>% fuzzyjoin::difference_left_join(test2,by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), max_dist = 0.01)
Even though this approach works, as the datasets get larger this join creates a lot of multiple matches and so I end up with a dataset that is sometimes ten times as large as the inital one test
. Does anyone has a different approach to solving this without creating multpile matches? Or is there any way I can force the join to always just use one match for every row in test
? I would highly appreciate it!
EDIT: Solving this problem R dplyr left join - multiple returned values and new rows: how to ask for the first match only? would also solve mine. So maybe one of you has an idea about that!
You could you the
geo_join
functions and return the distance between matches and then filter down to the closest match.You may want to adjust the value for "max_dist" down to reduce the number of matches, it should improve the performance but may generate too many NAs.
Update
Rounding to 3 decimal places is at most a 70 meter/230 ft offset. Rounding to fewer decimal digits reduces the number of unique points but increases the maximun offset.
Here is how I would handle rounding the drop-off location and performing the join. It adds complexity, but may help with the memory issues. I have not considered the
group_by
function here but that could also work.