I would like to work with RDD pairs of Tuple2<byte[], obj>
, but byte[]
s with the same contents are considered as different values because their reference values are different.
I didn't see any to pass in a custom comparer. I could convert the byte[]
into a String
with an explicit charset, but I'm wondering if there's a more efficient way.
Custom comparers are insufficient because Spark uses the
hashCode
of the objects to organize keys in partitions. (At least the HashPartitioner will do that, you could provide a custom partitioner that can deal with arrays)Wrapping the array to provide proper
equals
andhashCode
should address the issue. A lightweight wrapper should do the trick:A quick test: