Why Spark’s MLlib Clustering does not have a named vector (NamedVector) class

I wondered where the equivalent of NamedVector (from Mahout) was in Spark’s MLlib. It’s certainly useful to have some way of identifying a vector from some other fashion than the data itself. When running mahout I always use the NamedVector class when clustering (or the -nv option from the command line), and always wondered why anyone would do otherwise  because without which it makes locating the vectors difficult.

R users will recall a nifty feature that elegantly solves the named
vector problem: the rowname, that can be used to identify rows in a
matrix (or dataframe) that we use to send to our machine learning
algorithms. This is better than spending a column in our dataframe
for identifying data.

In the case of Spark, there is no such class or capability. There is
Vectors.dense, and Vectors.sparse, each of which takes only items of
type double. No identifiers are allowed, unless one wants to use up
one item of the vector as a numeric key. This is a bad idea; as that
key will then be treated as a variable, and it’s not a variable.

The reasoning behind this, I think, is that a NamedVector is
essentially a key-value pair, with the name as a key and the Vector as
a value. Spark does have key values pairs with a 2-item Tuple with
the first item as key, and the second item as the value. So, in
theory, there’s no need for a named vector.

To avoid confusion, there is also something totally different called a
labeled point. LabeledPoint to be exact. This does include a label,
but this label is used as the outcome variable of a classification or
regression training (or any other supervised algorithm). For example,
for a spam classifier, the outcome variable would be 0.0 = “not spam”
and 1.0 = “spam”. It’s not used to identify the vector. And, it like
everything else about the vector it is a double type — despite the
term “label.”

At first glance, it seems like being able to name or key our vectors
would be useful. This is especially true in clustering, where we want
to keep track of which of our clustered points matches with each cluster.
For classification and regression problems it’s not SO critical, as we’re
really more concerned with training the model to predict future data
rather than get the values from our training set.

It seems that Spark’s creators decided to make clustering run a bit
like classification by training a “model” that can then run predict(),
which to those with prior experience in machine learning is a little
odd. After all, isn’t clustering an unsupervised algorithm? We don’t
usually train a model in unsupervised learning, the word “train” after
all implies supervised learning.

The advantage of this approach is that clustering ends up working just
like the prediction models, in fact clustering “models” can then
double as prediction models once trained. And, being able to do the
prediction removes the need for storing a name with the vector. Or
even storing vectors at all along with the model.

So with kmeans we can do something like this.

Note that we’re calling predict() at the end.

This is fine for kmeans, as the final work performing the predict() is
pretty simple, just has to find the closest centroid. Still, it seems
like we’re doing work twice, once to do the clustering, and again to
get the cluster assignment, which the original output of kmeans should
have given us.

In fact, KMeansModel doesn’t give us a list of cluster membership from
the trained model. It has exactly two members:

k : Double // Number of cluster centroids
val clusterCenters : Array[Array[Double]]
//An array of the cluster centers (centroids)

And two methods:

predict(point: Array[Double]) : Int
// Gives us an integer for cluster membership
computeCost(data: RDD[Array[Double]]) : Double
// Calculate WSSSE (cost) of data

That’s it. So despite the fact that the vectors must have gone into
making the model, they aren’t saved with the model. The only way to
get them is to call predict(). Note how this is totally different
from the way mahout works.

So that’s the reason. Hope this clears up how clustering vectors works in MLib.
It was a bit confusing for me too — I think my past knowledge of mahout hurt me.


One thought on “Why Spark’s MLlib Clustering does not have a named vector (NamedVector) class”

  1. Hi Tim,

    Is there a typo on line 13 where it reads “NamesAndData” instead of something like NamesAndVectors?

    I’m starting to learn spark so I enjoyed this write up!

Leave a Reply

Your email address will not be published. Required fields are marked *