Generating 'geometry' column based on lat/long from geoparquet set, before running spark SQL

228 Views Asked by At

I would greatly appreciate any comments and help on a seemingly trivial issue I have.

I am running SparkSQL geospatial query from a PySpark Client.

One of the input datasets later used in SQL is a set of points in geoparquet format and stored in Amazon S3. It is loaded as an pyspark.sql.DataFrame object like this:

input = spark.read.parquet("S3 URI")

Later on I would run a SQL like:

st_contains(ST_GeomFromWKT(input.geometry)

Unfortunately, the input dataset was initiated with latitude and longitude columns (both in degrees), but with no 'geomoetry' column. So I was trying to add the 'geometry' column to the PySpark DataFrame right before running the SQL query.

I realized that due to the immutable nature of DataFrame type, none of the following works:

input['geometry'] = [Point(xy) for xy in zip(input.location_longitude, input.location_latitude)]

input['geometry'] = input.apply(lambda row:Point(row['location_longitude'],row['location_latitude']), axis=1)

input = input.withColumn('geometry', Point(input.location_longitude, input.location_latitude))

input = input.withColumn('geometry', "Point (" + str(input.location_longitude) + " " + str(input.location_latitude) + ")" )

I wonder if it is still possible to populate a 'geometry' column to the PySpark DataFrame on-the- fly, without having to recreate the geopareuquet input dataset with a geometry column based on WKT format. I realize that even if it is possible, it would potentially be an intensive operation to do so.

Thank you very much!

0

There are 0 best solutions below