For those experimenting with Hadoop, the quickest way to get going is to spin up a cluster in Amazon AWS’s Elastic Compute Cloud, better known as EC2. Of course, this is not just for beginners, as many companies, particularly smaller ones, rely heavily on AWS.
I am going to walk through the step by step process of setting up a 4-node Hadoop cluster on EC2. The first part of this is to just set an image. Of course, the absolute easiest way to go would be to use a pre-built Hortonworks or Cloudera image, but we’ll focus here on using a generic CentOS linux image and going from there.
There’s really nothing at all here in this part intrinsic to Hadoop. This is just setting up four identical CentOS nodes.
First, sign up for Amazon AWS, and provide your billing information, etc.
Once you’ve done that, go to http://aws.amazon.com/ and log in.
Once there, navigate to the EC2 page as shown.
Once there, click “Launch Instance”
Click on “AMI Marketplace”. Select CentOS.
On the next screen, select xlarge size or larger. This is required to run Hadoop.
Select the number of instances (in this case, 4). Press “Review and Launch.”
Select the amount of storage needed. If you are simply testing Hadoop, the default should be fine.
Select a tag for the instance, as shown. This will apply to all the instances, appended by a number.
Confirm the use of SSD.
Review the parameters set for the new instances.
Following this, you’ll need to create a key for AWS, unless you already have one. I’m assuming here that you don’t already have one. Once you create a new key, it will be sent to you.
Open up a shell window (if you run linux), and save the key to your computer. Copy the key, and chmod it to 400. Your .ssh directory is a good place.
$ cd ~/Downloads
$ chmod 400 my-aws-key-pair.pem
$ mv my-aws-key-pair.pem ~/.ssh/
Upon running the instances, you may see this warning message. However, it shouldn’t impact your instances. Wait for a minute, and then select EC2 once again to see the running instances.
As you can see, your instances are running, but their status is “Initializing.” It will take some time for them to be available. Five minutes or so should be enough.
Scroll over to locate the public IP addresses for each instance. You will also need the domain names as well as the internal domain name and internal IP address. All EC2 instances have two IP addresses and domain names: internal, and external. Internal IP addresses should be used from inside EC2 instances, as you will not need to pay bandwidth charges. From the outside world (like your PC/Mac), you will need to use the external addresses. You will use these when you SSH, and when you view web pages hosted by the machines from your local machine.
After selecting an instance, details will be shown in the bottom. It is here that you retrieve the internal IP addresses, internal IP addresses and the public domain name.
Now you can use your newly generated ssh key and ssh to one of your new instances. In this case, the username would be centos, although from some other instances the default name should be root or ec2-user. From the command line, it looks like this:
$ ssh -i ~/.ssh/my-aws-key-pair.pem email@example.com
Windows users can use putty or bitvise as an SSH client. Mac users can use command line as shown, or other Mac ssh clients.
You can see the prompt once you are in. Now that you are, you can proceed with the installation of Hadoop, or whatever other tools you prefer.
Eventually, you’ll probably terminate your instances, by right clicking and selecting terminate. Once you do, you’ll see this. Don’t do this before it’s neceessary, however.
Proceed onward to our next tutorial on installing Hadoop. (Once it becomes available).