Unable to read shapefile using sedona and pyspark, for file in hdfs

471 Views Asked by At

UPDATE:

Passing the shapefile folder instead of .shp as a file and using the jar file: sedona-spark-shaded-3.0_2.12-1.4.0.jar seemed to do the work. Thanks to Jia Yu - Apache Sedona!

I have the naturalearth_lowres shapefile stored in my hdfs, and I am trying to read via the below python file:

spark-submit --jars /usr/local/spark/jars/sedona-core-3.0_2.13-1.4.0.jar,/usr/local/spark/jars/sedona-python-adapter-3.0_2.12-1.4.0.jar,/usr/local/spark/jars/sedona-sql-3.0_2.12-1.4.0.jar,/usr/local/spark/jars/geotools-wrapper-1.4.0-28.2.jar,/usr/local/spark/jars/sedona-python-adapter-3.0_2.12-1.4.0.jar pyspark_read_sedona.py

from pyspark.sql import SparkSession
from sedona.utils.adapter import Adapter
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer

spark = SparkSession. \
    builder. \
    appName("NaturalEarthCities"). \
    config("spark.serializer", KryoSerializer.getName). \
    config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
    getOrCreate()

SedonaRegistrator.registerAll(spark)

from sedona.core.formatMapper.shapefileParser import ShapefileReader

shapefile_location = "hdfs:/naturalearth_lowres/naturalearth_lowres.shp"

spatial_rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext, shapefil$

spatial_df = Adapter.toDf(spatial_rdd, spark)

spatial_df.createOrReplaceTempView("naturalearth_cities")

result_df = spark.sql("SELECT * FROM naturalearth_cities")
result_df.show()

but, I'm getting the following error:

Traceback (most recent call last):
  File "/home/bigdata/ronnit/pyspark_read_sedona.py", line 21, in <module>
    spatial_rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext, shapefile_location)
  File "/home/bigdata/anaconda3/lib/python3.7/site-packages/sedona/core/formatMapper/shapefileParser/shape_file_reader.py", line 42, in readToGeometryRDD
    inputPath
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD.
: java.lang.ArrayIndexOutOfBoundsException: 0
    at scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:193)
    at scala.collection.convert.Wrappers$SeqWrapper.get(Wrappers.scala:74)
    at org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readFieldNames(ShapefileReader.java:188)
    at org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD(ShapefileReader.java:82)
    at org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD(ShapefileReader.java:66)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:745)

My setup configurations:

Apache Sedona:

Name: apache-sedona
Version: 1.4.0
Summary: Apache Sedona is a cluster computing system for processing large-scale spatial data
Home-page: https://sedona.apache.org
Author: Apache Sedona
Author-email: [email protected]
License: Apache License v2.0
Location: /home/bigdata/anaconda3/lib/python3.7/site-packages
Requires: shapely, attrs

PySpark:

Name: pyspark
Version: 3.3.0
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: [email protected]
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /home/bigdata/anaconda3/lib/python3.7/site-packages
Requires: py4j
Required-by: geospark

I'm trying to read the shapefile using sedona and run spatial queries on top of it.

I read somewhere that the ArrayIndexOutOfBoundsException in this case was caused because it was trying to access an array at Index 0, but the array was empty. I tried the below to ensure that there wasn't any issue with the file:

  1. Checked the file path provided, which was correct.
  2. Checked the file content and was able to access and print them using geopandas.
  3. Ensured correct dependencies were installed as well.
  4. Checked whether the file permissions were granted as well which were rw---r-r.

Please let me know if anything else needs to be added to address this.

1

There are 1 best solutions below

1
On

First of all, only the following jars are needed:

/usr/local/spark/jars/geotools-wrapper-1.4.0-28.2.jar,/usr/local/spark/jars/sedona-spark-shaded-3.0_2.12-1.4.0.jar

Secondly, the path of your shapefile is wrong. See here: https://sedona.apache.org/1.4.1/tutorial/rdd/#from-shapefile

Given the following shapefile structure:

- shapefile1
- shapefile2
- myshapefile
    - myshapefile.shp
    - myshapefile.shx
    - myshapefile.dbf
    - myshapefile...
    - ...

The python code should be

from sedona.core.formatMapper.shapefileParser import ShapefileReader

shape_file_location="hdfs://Download/myshapefile"

ShapefileReader.readToGeometryRDD(sc, shape_file_location)

In a nutshell, the path should point to the shapefile folder name, not the shp file.