Is the Streaming k-means clustering predefined in MLlib library of spark supervised or unsupervised?

1.6k Views Asked by At

I know that k-means clustering is the one of simplest unsupervised learning algorithm. Looking at the source code of streaming k-means clustering packaged in MLlib, I find the terms: training data, test data, predict, and train.

This makes me think that this streaming K-means might be supervised. So, is this algorithm supervised or unsupervised?

This is a code example of using streaming k-means:

package org.apache.spark.examples.mllib

import org.apache.spark.SparkConf
import org.apache.spark.mllib.clustering.StreamingKMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.streaming.{Seconds, StreamingContext}


object StreamingKMeansExample {

    def main(args: Array[String]) {
        if (args.length != 5) {
            System.err.println(  "Usage: StreamingKMeansExample " +
                "<trainingDir> <testDir> <batchDuration> <numClusters>         <numDimensions>")
            System.exit(1)
        }

        val conf = new SparkConf().setMaster("localhost").setAppName
        ("StreamingKMeansExample")
        val ssc = new StreamingContext(conf, Seconds(args(2).toLong))

        val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
        val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

        val model = new StreamingKMeans().setK(args(3).toInt)
        .setDecayFactor(1.0)
        .setRandomCenters(args(4).toInt, 0.0)

        model.trainOn(trainingData)
        model.predictOnValues(testData.map(lp => (lp.label,      lp.features))).print()

        ssc.start()
        ssc.awaitTermination()
    }
}
2

There are 2 best solutions below

0
On

It depends, but most would classify k-means as unsupervised.

Other than specifying the number of clusters, k-means “learns” the clusters on its own without any information about which cluster an observation belongs to. k-means can be semi-supervised.

This is about k-means normally , so ideally I believe spark is following the same- https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

0
On

K-means (streaming or regular) is a clustering algorithm. Clustering algorithms are by definition unsupervised. That is, you don't know the natural groupings (labels) of your data and you want to automatically group similar entities together.

The term train here refers to "learning" the clusters (centroids).

The term predict refers to predicting which cluster a new point belongs to.