I wondered where the equivalent of NamedVector (from Mahout) was in Spark’s MLlib. It’s certainly useful to have some way of identifying a vector from some other fashion than the data itself. When running mahout I always use the NamedVector class when clustering (or the -nv option from the command line), and always wondered why anyone would do otherwise because without which it makes locating the vectors difficult.

R users will recall a nifty feature that elegantly solves the named

vector problem: the rowname, that can be used to identify rows in a

matrix (or dataframe) that we use to send to our machine learning

algorithms. This is better than spending a column in our dataframe

for identifying data.

In the case of Spark, there is no such class or capability. There is

Vectors.dense, and Vectors.sparse, each of which takes only items of

type double. No identifiers are allowed, unless one wants to use up

one item of the vector as a numeric key. This is a bad idea; as that

key will then be treated as a variable, and it’s **not** a variable.

The reasoning behind this, I think, is that a NamedVector is

essentially a key-value pair, with the name as a key and the Vector as

a value. Spark does have key values pairs with a 2-item Tuple with

the first item as key, and the second item as the value. So, in

theory, there’s no need for a named vector.

To avoid confusion, there is also something totally different called a

labeled point. LabeledPoint to be exact. This does include a label,

but this label is used as the outcome variable of a classification or

regression training (or any other supervised algorithm). For example,

for a spam classifier, the outcome variable would be 0.0 = “not spam”

and 1.0 = “spam”. It’s not used to identify the vector. And, it like

everything else about the vector it is a double type — despite the

term “label.”

At first glance, it seems like being able to name or key our vectors

would be useful. This is especially true in clustering, where we want

to keep track of which of our clustered points matches with each cluster.

For classification and regression problems it’s not SO critical, as we’re

really more concerned with training the model to predict future data

rather than get the values from our training set.

It seems that Spark’s creators decided to make clustering run a bit

like classification by training a “model” that can then run predict(),

which to those with prior experience in machine learning is a little

odd. After all, isn’t clustering an unsupervised algorithm? We don’t

usually train a model in unsupervised learning, the word “train” after

all implies supervised learning.

The advantage of this approach is that clustering ends up working just

like the prediction models, in fact clustering “models” can then

double as prediction models once trained. And, being able to do the

prediction removes the need for storing a name with the vector. Or

even storing vectors at all along with the model.

So with kmeans we can do something like this.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
// assume we have a csv file with name, feature1, feature2 val data = sc.textFile("data.csv") val NamesAndVectors = data.map { s => val splitData = s.split(',') val doubles = splitData.drop(1).map(_.toDouble) val vectors = Vectors.dense(doubles) (splitData(0), vectors) } // Cluster the data into two classes using KMeans val numClusters = 2 // Value of K in Kmeans val clusters = KMeans.train(NamesAndVectors.map(case (k,v) => v), numClusters, 20) val groupedClusters = NamesandData.groupBy{rdd => clusters.predict(rdd._2)} |

Note that we’re calling predict() at the end.

This is fine for kmeans, as the final work performing the predict() is

pretty simple, just has to find the closest centroid. Still, it seems

like we’re doing work twice, once to do the clustering, and again to

get the cluster assignment, which the original output of kmeans should

have given us.

In fact, KMeansModel doesn’t give us a list of cluster membership from

the trained model. It has exactly two members:

k : Double // Number of cluster centroids

val clusterCenters : Array[Array[Double]]

//An array of the cluster centers (centroids)

And two methods:

predict(point: Array[Double]) : Int

// Gives us an integer for cluster membership

computeCost(data: RDD[Array[Double]]) : Double

// Calculate WSSSE (cost) of data

That’s it. So despite the fact that the vectors must have gone into

making the model, they aren’t saved with the model. The only way to

get them is to call predict(). Note how this is totally different

from the way mahout works.

So that’s the reason. Hope this clears up how clustering vectors works in MLib.

It was a bit confusing for me too — I think my past knowledge of mahout hurt me.

Cheers!