I have two large dataframes of longitude/latitude coordinates, CoastalStates_Tax
and CoastalStates
, which are mostly the same except CoastalStates_Tax
has a few million more coordinates. I want to figure out which rows in CoastalStates_Tax
are not in CoastalStates
, but still need to be able to track the indices of the missing rows wherever they are in the Tax
dataset.
This is what CoastalStates_Tax
looks like:
RecordID_b PROPERTY LEVEL LONGITUDE PROPERTY LEVEL LATITUDE
1 132381977 -77.06421 39.16937
2 132381978 -77.18106 39.08811
3 132381979 -77.03353 39.02414
4 132381980 -77.09930 39.00716
5 132381981 -77.25450 39.10422
6 132381982 -77.02797 39.08087
And CoastalStates
:
RecordID PROPERTY LEVEL LONGITUDE PROPERTY LEVEL LATITUDE
1 1 -80.24787 25.85063
2 2 -80.14940 25.84582
3 3 -80.13115 25.85699
4 4 -80.37275 25.77741
5 5 -80.12095 25.82633
6 6 -80.39949 25.73273
I tried using the dplyr anti_join
function with anti_join(CoastalStates_Tax,CoastalStates,by=c("PROPERTY LEVEL LONGITUDE","PROPERTY LEVEL LATITUDE"))
, but it only gives me 4,635,393 rows.
The difference in rows between the two datasets is 4,637,029, so I'm missing a about 1600 rows but I can't figure out why. Am I misusing anti_join
, and if so, any suggestions on other ways to go about this?
My suggestion will be to
round
both longitude and latitude to 3 decimal places (accuracy up to 110 meters) and than convert tocharacter
before joining on those columns.An attempt can be as: