I'm new to Big Data and Apache Spark (and an undergrad doing work under a supervisor).
Is it possible to apply a function (i.e. a spline) to only partitions of the RDD? I'm trying to implement some of the work in the paper here.
The book "Learning Spark" seems to indicate that this is possible, but doesn't explain how.
"If you instead have many small datasets on which you want to train different learning models, it would be better to use a single- node learning library (e.g., Weka or SciKit-Learn) on each node, perhaps calling it in parallel across nodes using a Spark
map()."
Im not 100% sure that I am following, but there are a number of partition methods, such as
mapPartitions. These operators hand you theIteratoron each node, and you can do whatever you want to the data and pass it back through a newIterator