I know that k-means clustering is the one of simplest unsupervised learning algorithm. Looking at the source code of streaming k-means clustering packaged in MLlib, I find the terms: training data, test data, predict, and train.
This makes me think that this streaming K-means might be supervised. So, is this algorithm supervised or unsupervised?
This is a code example of using streaming k-means:
package org.apache.spark.examples.mllib
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKMeansExample {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println( "Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
val conf = new SparkConf().setMaster("localhost").setAppName
("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans().setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
}
}
It depends, but most would classify k-means as unsupervised.
This is about k-means normally , so ideally I believe spark is following the same- https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala