R: error installing R package devtools on Linux

As I attempted to install the R package “devtools” on my Ubuntu 14.04 laptop I encountered an error as follows:


> install.packages("devtools")
Installing package into ‘/home/tfox/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
also installing the dependencies ‘mime’, ‘bitops’, ‘brew’, ‘httr’, ‘RCurl’, ‘memoise’, ‘whisker’, ‘evaluate’, ‘rstudioapi’, ‘jsonlite’, ‘roxygen2’

... SNIP ...

* installing *source* package ‘RCurl’ ...
** package ‘RCurl’ successfully unpacked and MD5 sums checked
checking for curl-config... no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
* removing ‘/home/tfox/R/x86_64-pc-linux-gnu-library/3.2/RCurl’
Warning in install.packages :
installation of package ‘RCurl’ had non-zero exit status

ERROR: dependency ‘RCurl’ is not available for package ‘httr’
* removing ‘/home/tfox/R/x86_64-pc-linux-gnu-library/3.2/httr’
Warning in install.packages :
installation of package ‘httr’ had non-zero exit status
ERROR: dependencies ‘httr’, ‘RCurl’ are not available for package ‘devtools’
* removing ‘/home/tfox/R/x86_64-pc-linux-gnu-library/3.2/devtools’
Warning in install.packages :
installation of package ‘devtools’ had non-zero exit status

Oops, the error indicates we have an OS dependency on a shared library
which we don’t have. Noticing this we need the development version
of libcurl. Let’s install it and try again.

For Ubuntu / Mint / Debian:


sudo apt-get install libcurl4-gnutls-dev

For CentOS / Fedora / RHEL:


$ sudo yum -y install libcurl libcurl-devel

Once we install this, we can re-install devtools and it
installs fine.

Starting a Hadoop Cluster in EC2 — Starting the Cluster

For those experimenting with Hadoop, the quickest way to get going is to spin up a cluster in Amazon AWS’s Elastic Compute Cloud, better known as EC2. Of course, this is not just for beginners, as many companies, particularly smaller ones, rely heavily on AWS.

I am going to walk through the step by step process of setting up a 4-node Hadoop cluster on EC2. The first part of this is to just set an image. Of course, the absolute easiest way to go would be to use a pre-built Hortonworks or Cloudera image, but we’ll focus here on using a generic CentOS linux image and going from there.

There’s really nothing at all here in this part intrinsic to Hadoop. This is just setting up four identical CentOS nodes.

First, sign up for Amazon AWS, and provide your billing information, etc.

Once you’ve done that, go to http://aws.amazon.com/ and log in.

Once there, navigate to the EC2 page as shown.

ec2-1-marked

Once there, click “Launch Instance”

ec2-2-marked

Click on “AMI Marketplace”. Select CentOS.

ec2-3-marked

On the next screen, select xlarge size or larger. This is required to run Hadoop.

ec2-4-marked

Select the number of instances (in this case, 4).  Press “Review and Launch.”

ec2-5-marked

Select the amount of storage needed. If you are simply testing Hadoop, the default should be fine.

ec2-6-marked

Select a tag for the instance, as shown. This will apply to all the instances, appended by a number.

ec2-7-marked

Confirm the use of SSD.

ec2-8-marked

Review the parameters set for the new instances.

ec2-9-marked

Following this, you’ll need to create a key for AWS, unless you already have one. I’m assuming here that you don’t already have one. Once you create a new key, it will be sent to you.

ec2-10

Open up a shell window (if you run linux), and save the key to your computer. Copy the key, and chmod it to 400. Your .ssh directory is a good place.


$ cd ~/Downloads
$ chmod 400 my-aws-key-pair.pem
$ mv my-aws-key-pair.pem ~/.ssh/

ec2-11

Upon running the instances, you may see this warning message. However, it shouldn’t impact your instances. Wait for a minute, and then select EC2 once again to see the running instances.

ec2-error-1

As you can see, your instances are running, but their status is “Initializing.” It will take some time for them to be available. Five minutes or so should be enough.

ec2-12

Scroll over to locate the public IP addresses for each instance. You will also need the domain names as well as the internal domain name and internal IP address. All EC2 instances have two IP addresses and domain names: internal, and external. Internal IP addresses should be used from inside EC2 instances, as you will not need to pay bandwidth charges. From the outside world (like your PC/Mac), you will need to use the external addresses. You will use these when you SSH, and when you view web pages hosted by the machines from your local machine.

ec2-14

After selecting an instance, details will be shown in the bottom. It is here that you retrieve the internal IP addresses, internal IP addresses and the public domain name.

ec2-15

Now you can use your newly generated ssh key and ssh to one of your new instances. In this case, the username would be centos, although from some other instances the default name should be root or ec2-user. From the command line, it looks like this:


$ ssh -i ~/.ssh/my-aws-key-pair.pem centos@your.public.ip.address

Windows users can use putty or bitvise as an SSH client. Mac users can use command line as shown, or other Mac ssh clients.

You can see the prompt once you are in. Now that you are, you can proceed with the installation of Hadoop, or whatever other tools you prefer.

ec2-16

Eventually, you’ll probably terminate your instances, by right clicking and selecting terminate. Once you do, you’ll see this. Don’t do this before it’s neceessary, however.

ec2-17

Proceed onward to our next tutorial on installing Hadoop. (Once it becomes available).

Building Spark with Maven: the PermGen space error and Java heap space error

Spark doesn’t build properly with maven out of the box. While this is clearly stated in the Building Spark Page in the documentation, it’s easy to miss.

You’ll get an error like the following:

$ mvn clean package
.....

[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] PermGen space -> [Help 1]

[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.10/classes...
[ERROR] Java heap space -> [Help 1]

Basically, this says that the PermGen space and/or Java heap space have been exceeded. While Java heap space issues are so common that I instinctively bump that up when seeing issues (i.e., -Xmx2g), the Perm space has to be set separately.

That said, the solution is pretty straightforward.

That should fix the problem and allow spark to be built properly. I also had issues building spark on my home directory, which was encrypted. This was a second issue, and I posted a workaround for it in an earlier post.

Clustering with K-Means with Spark and MLlib

MLlib provides a parallelized clustering algorithm called kmeans||, which allows us to have an efficiently parallelized algorithm on Spark.  Clustering is an unsupervised machine learning that helps us discover natural patterns in data.

What is k-means?  It’s about the simplest algorithm out there, it essentially starts with a given number (k) of centroids, and randomly iterates the centroids until they converge.  Points are then assigned to the nearest centroid, making a cluster.  It works great for data which is clusterable by circles/spheres (actually hyperspheres).  For data which has more convoluted patterns, such as rings and other shapes, we can use hierarchical clustering.

The other limitation is that we have to know k in advance to make the algorithm work.  Sometimes we know what we’re looking for, but usually, we don’t.  That means we often end up having to run clustering many times, measuring how much data we capture.

If we have a dataset we want to cluster, the first step is to convert it to a vector class.   MLlib offers two: Vectors.dense and Vectors.sparse.  The latter is very good one-hot encoding (is_red, is_blue, is_green, etc), and especially for encoding text vectors, such as tfidf or word2vec.  I’ll talk in another post about how to vectorize text.

Instead, let’s just use Vectors.dense and we can use a dataset near and dear to R users: mtcars.  It’s one of R’s standard example datasets which gives some statistics on a few different models of cars.  We can extract mtcars from R by using write.csv, and we can use the file as mtcars.csv, but we’ll remove the header row for simplicity.  Of course, it’d be silly to use Spark for such a tiny dataset (as we could easily just use R), but it serves the purpose of an example.

Great. Now we have some vectors.   We had to drop the name associated with each car and so our vectors are nameless — more on that later.

Now we need to make a KMeansModel object.   This may seem strange at first glance since in R and Mahout, there’s no model associated with K-means since there’s no training involved in an unsupervised ML algorithm.  Probably for the sake of consistency, MLlib treats Kmeans as a model that has to be “trained” with data, and then can be applied to new data using predict(), as if it was performing classification. While odd, this actually is a bonus because it easily allows us to use our clusters as a classification model for unseen data.

So clusters in this case is the KMeansModel object.  We chose a “K” value of 2, which probably isn’t going to get good results with this dataset.  How do we check that? We can use computecost()

The Spark documentation calls the cost WSSSE (Within Set Sum of Squared Errors).   Typically this should get better as k gets higher, but higher values of k may not produce very useful clusters (lots of clusters-of-one, for instance).

Intuitively, we should set k to be just before a point of inflection wherein the law of diminishing returns sets in, sometimes called the “elbow method.” But we should also look at where we start getting lots of small and meaningless clusters.

So now we have a KMeansModel set with our value of k. What does that give us? It assigns a number to each cluster (in the case of k=2, then just 0,1), but remember that we dropped the name for each vector. So we know which vector is in each cluster, but how do we relate this to the original data?  As I’ve done this exercise in Mahout, I was looking for the NamedVector class, which unfortunately doesn’t exist in Spark. The Spark team apparently doesn’t feel one is needed.

In Spark, the right way to do this is to join back the vector to the original data. To do that, we need to create a pair of names and vectors.

So that gives us our clustering results. As we said before, we can call predict() on new data that we might have, to see which cluster it would correspond to.

The new data doesn’t actually change the model, however. That’s frozen in time forever until we train a new one. There is, however, another class called StreamingKmeans, which will actually adjust clusters to new data, so we can use it in a streaming fashion. We’ll talk about that another time.

Installing Spark on encrypted file systems: Resolving the “File Name Too Long” error.

I mostly run Linux on the systems I work on. Though I prefer CentOS for server and cloud instances, due to better compatibility with Hadoop, I’m more partial to Ubuntu or especially Linux Mint for my workstation Linux, especially on my laptop.

Ubuntu and Mint both use ecryptfs (http://ecryptfs.org) for home folder encryption, which is optional. I use it because I don’t like the idea of my files being accessible in case I lose track of my laptop.

While encryptfs is great, it creates some complications while building spark, because there’s apparently some limitations with very long filenames.

Once I try to build spark (for example, with maven), I get an error after much maven spam:


$ mvn -DskipTests clean package
....
....
[error] uncaught exception during compilation: java.io.IOException
[error] File name too long

One easy workaround is not to build spark in the home folder. It doesn’t need to be there. Putting it in /usr/local, /opt, or something else is probably ok. But it is true that for most, the home folder is the first place it’s going to go, and so it’s nice to have another workaround.

There is a jira on the subject here:
https://issues.apache.org/jira/browse/SPARK-4820

The solution is to edit pom.xml, and add the following lines in compile options:

<arg>-Xmax-classfile-name</arg>
<arg>128</arg>

For me on Spark 1.3.0, that worked out to line 1130 in the pom.xml

If you used sbt instead of maven to build, the solution per the jira is similar:

scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),