I would greatly appreciate any comments and help on a seemingly trivial issue I have.
I am running SparkSQL geospatial query from a PySpark Client.
One of the input datasets later used in SQL is a set of points in geoparquet format and stored in Amazon S3. It is loaded as an pyspark.sql.DataFrame object like this:
input = spark.read.parquet("S3 URI")
Later on I would run a SQL like:
st_contains(ST_GeomFromWKT(input.geometry)
Unfortunately, the input dataset was initiated with latitude and longitude columns (both in degrees), but with no 'geomoetry' column. So I was trying to add the 'geometry' column to the PySpark DataFrame right before running the SQL query.
I realized that due to the immutable nature of DataFrame type, none of the following works:
input['geometry'] = [Point(xy) for xy in zip(input.location_longitude, input.location_latitude)]
input['geometry'] = input.apply(lambda row:Point(row['location_longitude'],row['location_latitude']), axis=1)
input = input.withColumn('geometry', Point(input.location_longitude, input.location_latitude))
input = input.withColumn('geometry', "Point (" + str(input.location_longitude) + " " + str(input.location_latitude) + ")" )
I wonder if it is still possible to populate a 'geometry' column to the PySpark DataFrame on-the- fly, without having to recreate the geopareuquet input dataset with a geometry column based on WKT format. I realize that even if it is possible, it would potentially be an intensive operation to do so.
Thank you very much!