Pyspark: why does the ST_intersects function return duplicated rows?

785 Views Asked by At

I am using the ST_Intersects function of geospark to make the intersection between points and polygons.

queryOverlap = """
        SELECT p.ID, z.COUNTYNS as zone, p.date, timestamp, p.point
        FROM gpsPingTable as p, zoneShapes as z
        WHERE ST_Intersects(p.point, z.geometry)
    """
pingsDay = spark.sql(queryOverlap)
pingsDay.show()

Why does return for each row a duplicate?

+--------------------+--------+----------+-------------------+--------------------+
|                  ID|    zone|      date|          timestamp|               point|
+--------------------+--------+----------+-------------------+--------------------+
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-17|2020-03-17 12:29:24|POINT (-122.38825...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|45cdaabc-a804-46b...|01529224|2020-03-18|2020-03-18 11:21:27|POINT (-122.38851...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|aae0bb4e-4899-4ce...|01531402|2020-03-18|2020-03-18 06:58:03|POINT (-122.23097...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-17|2020-03-17 17:32:46|POINT (-119.43811...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-18|2020-03-18 08:21:34|POINT (-119.41080...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|f9b58c70-0665-4f5...|01531928|2020-03-19|2020-03-19 00:26:43|POINT (-119.43623...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 06:30:43|POINT (-122.22106...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|fb768b89-b92a-4f0...|01531402|2020-03-18|2020-03-18 07:57:47|POINT (-122.22102...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|a32f727d-566b-4ad...|01529224|2020-03-18|2020-03-18 14:38:13|POINT (-122.59499...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
|ad7e4d7e-f8e5-45b...|01529224|2020-03-18|2020-03-18 07:58:51|POINT (-122.14959...|
+--------------------+--------+----------+-------------------+--------------------+
1

There are 1 best solutions below

0
On

The most obvious reason would be if the points or zones in source tables are not unique. If there are duplicate points or zones, you obviously get duplicates

Check source tables for uniqueness:

SELECT p.ID, p.date count(*) c
FROM gpsPingTable as p 
GROUP BY ID, data HAVING c > 1

This will report duplicate points. And this will report duplicate zones:

SELECT z.COUNTYNS as zone, COUNT(*) c
FROM zoneShapes as z
GROUP BY zone HAVING c > 1