I know that k-means clustering is the one of simplest unsupervised learning algorithm. Looking at the source code of streaming k-means clustering packaged in MLlib, I find the terms: training data, test data, predict, and train.
This makes me think that this streaming K-means might be supervised. So, is this algorithm supervised or unsupervised?
This is a code example of using streaming k-means:
package org.apache.spark.examples.mllib
import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingKMeansExample {
def main(args: Array[String]) {
if (args.length != 5) {
System.err.println( "Usage: StreamingKMeansExample " +
"<trainingDir> <testDir> <batchDuration> <numClusters> <numDimensions>")
System.exit(1)
}
val conf = new SparkConf().setMaster("localhost").setAppName
("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))
val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)
val model = new StreamingKMeans().setK(args(3).toInt)
.setDecayFactor(1.0)
.setRandomCenters(args(4).toInt, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()
}
}
K-means (streaming or regular) is a clustering algorithm. Clustering algorithms are by definition unsupervised. That is, you don't know the natural groupings (labels) of your data and you want to automatically group similar entities together.
The term
trainhere refers to "learning" the clusters (centroids).The term
predictrefers to predicting which cluster a new point belongs to.