I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?
The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like
(0, 1, 2, 0, 1)
and if you have another instance covering the data 3,4,4 it would look like(0, 0, 1, 2,0)
. Now all you have to do is define a+
operation that let you combine(0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1)
, covering the data 3,3,2,5 and 3,4,4.Let's just do that, and call the class
StatMonoid
:This class contains the sequence of 5 counters, and define a
+
operation that let it be combined with other counters.We also need a convenience method to build it, this could be a constructor in
StatMonoid
, in the companion object, or just a plain method, as you prefer:This allows us to easily compute instance of the statistic covering one single piece of data, for example:
And it also allows us to combine them together simply by adding them:
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of
StatMonoid
above, and finally combine them all together for each kind of food:Which yields:
Now, for the side story, if you wonder why I call the class
StateMonoid
it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.