Inconsistent results when matching closest lat/long points in r using sf and st_distance()

71 Views Asked by At

I have a large dataset where each row is a station. I need to find the closest station within each year but where a different type of equipment was used. I then want to either combine these rows into a new dataset where I have the lat/long and other station info for each pair of stations replicated next to each other in the same row, OR have some kind of index so I know which rows are related. I have managed to do this following this answer and plotted it, but it seems some stations have been linked to stations which are obviously not the closest. I don't understand if this is due to the way I have plotted the data or the way I have joined the closest stations - I would appreciate any pointers with this! I would also be interested in a more efficient way!

Thanks in advance for any help!!

Example code:

library(ggplot2)
library(plotly)
library(sf)

#data
set.seed(123)
latitude <- runif(100, 72, 81)
longitude <- runif(100, 20, 60)
gear <- factor(sample(1:2, 100, replace = TRUE))
year <- factor(sample(c(2020, 2021), 100, replace = TRUE))
orig.data <- data.frame(latitude, longitude, gear, year)


orig.data$lat<-orig.data$latitude # duplicating lat/long columns 
orig.data$lon<-orig.data$longitude
df = st_as_sf(orig.data, coords=5:6) # making last 2 columns sf coordinates
# creating distance matrix
dm = st_distance(df)
ijd = data.frame(expand.grid(i=1:nrow(dm), j=1:nrow(dm)))
ijd$distance = c(dm)

# these following lines are a clunky way of copying the important info for each station pair
ijd$year.i = df$year[ijd$i] 
ijd$year.j = df$year[ijd$j]
ijd$gear.i = df$gear[ijd$i]
ijd$gear.j = df$gear[ijd$j]
ijd$latitude.j = df$latitude[ijd$j]
ijd$longitude.j = df$longitude[ijd$j]
ijd$latitude.i = df$latitude[ijd$i]
ijd$longitude.i = df$longitude[ijd$i]

# Filter out different gears and keep matching years. 
# This ensures a point can't be a nearest neighbour of itself.
ijd = ijd[ijd$year.i == ijd$year.j,]
ijd = ijd[ijd$gear.i != ijd$gear.j,]

# selecting the closest stations
# Split into data frames for each i point.
ijd.split = split(ijd, ijd$i)

nearest = function(d){
  d = d[order(d$distance),]
  d[1:min(c(nrow(d),1)),]
}

dn = lapply(ijd.split,nearest)
nnij = do.call(rbind, dn)

# removing duplicated equipment types
nnij2<-subset(nnij, as.factor(gear.i)==1)

# plotting closest stations using 'geom_segment'
# plot clearly shows some stations are joined to ones further away than the logical 'closest' station
ggplot(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i))+geom_point()+geom_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
  geom_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
  facet_wrap(~year.i)

# issue persists when projecting coordinates
ggplotly(basemap(limits=c(25,40,72,79))+
           geom_spatial_point(data = nnij2, aes(x = longitude.i, y = latitude.i, shape = gear.i)) +
           geom_spatial_point(data = nnij2, aes(x = longitude.j, y = latitude.j, shape = gear.j))+
           geom_spatial_segment(data = nnij2, aes(x = longitude.i, y = latitude.i, xend = longitude.j, yend = latitude.j, colour = distance))+
           facet_wrap(~year.i))

The red arrows in the image highlight one of the questionable joins - the top point should have been joined to the one on the right, but instead has been linked to the one below.

The red arrows here highlight one of the questionable joins - the top point should have been joined to the one on the right, but instead has been linked to the one below.

0

There are 0 best solutions below