I am new to Python and also Spark. I've an pair RDD containing (key, List) but some of the values are duplicate. RDD is of the form (zipCode,streets) I want a pair RDD which does not contain duplicates. I am trying to achieve it using python. Can anyone please help on this.
(zipcode, streets)
streetsGroupedByZipCode = zipCodeStreetsPairTuple.groupByKey()
dayGroupedHosts.take(2)
[(123456, <pyspark.resultiterable.ResultIterable at 0xb00518ec>),
(523900, <pyspark.resultiterable.ResultIterable at 0xb005192c>)]
zipToUniqueStreets = streetsGroupedByZipCode.map(lambda (x,y):(x,y.distinct()))
Above one does not work
I'd do something like this :
distinct on a tuple doesn't work as you said, so group your list by tuple, and get only the first element at the end.
gives :