I have a Spark DataFrame with latitude and longitude where I'm trying to calculate the distance between the coordinates and a polyline.

The dataframe I'm working with is large (about 10 billion observations) and I load using spark_read_parquet(), so it's originally loaded as a Spark dataframe. I'm wondering how to convert that to a Spark Spatial dataframe so then--using geospark--I can use sf methods to calculate the distance from the points to the line.

(Or the ultimate goal is to calculate the distance between the lat/lon in the Spark dataframe to one other polyline, so if there's another way to go about this then that works too).

The below code sets up an example with dummy data

## Setup
library(dplyr)
library(sparklyr)
library(geospark)

sc <- spark_connect(master="local")
register_gis(sc)

## Make dummy dataframe data
df <- data.frame(uid = 1:10,
                 latitude = 1:10,
                 longitude = 1:10)

s_df <- sdf_copy_to(sc = sc, x = df, overwrite = T)

## Make dummy polyline data
polyl <- rbind(c(0,3),c(0,4),c(1,5),c(2,5)) %>% st_linestring()
polyl_sf <- data.frame(id = 1, geom = st_sfc(polyl)) %>% st_as_sf(crs = 4326)

The below code converts from a datafrasme to a spatial dataframe (sf object)

## Convert from dataframe to spatial dataframe
df %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

But the below code returns the below error

## Attempt to convert from Spark dataframe to Spark spatial dataframe
s_df %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
Error in UseMethod("st_as_sf") : 
  no applicable method for 'st_as_sf' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Or not sure if this should somehow be done using st_point?

0

There are 0 best solutions below