Many of you may be interested in how to get going with Spark. Let’s look at a walkthrough.
Step 0: Get your system Ready.
You need to download a working jdk, if you don’t already have it. We recommend at least Java 7.
Now, you need to download Scala as well, if you don’t have it already
Use the following link: https://www.scala-lang.org/download/
Once ready, open a shell window (or windows command prompt) and test and make sure scala and sbt are both installed and in your path.
$ scala
$ sbt
Windows users have a few extra steps.
First, download winutils.exe from https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe
Put this in a directory and add to your path. (example C:\Winutils\) This provides support for running hadoop libraries in windows. For more info, please see this link: https://wiki.apache.org/hadoop/WindowsProblems
Run the following command (again, Windows users only)
C:\winutils\bin\winutils.exe chmod 777 C:\tmp\hive
HADOOP_HOME=c:\winutils\
Ok, now we should be ready to run Spark.
Step 1: Download Spark
You can download the latest spark from http://spark.apache.org/downloads.html
Here is a link you can use:
wget spark-2.1.0-bin-hadoop2.7.tgz
Once here, I like to put the spark directory named spark in my home directory. Mac and Linux users can do that as follows:
$ mv spark-2.1.0-bin-hadoop2.7 ~/spark
Windows users can do something similar from a command prompt
> rename spark-2.1.0-bin-hadoop2.7 c:\spark
Step 2: Run Spark
You can run spark as follows
$ ~/spark/bin/start-all.sh #Mac/Linux
c:\spark\bin\start-all.exe (Windows)
Spark is now running on your machine!
Step 3: Check out the Spark UI
Go to localhost:8080. This will be your Spark master. It should look something like this.
Check the following things out:
- How many Masters are running? How Many Workers?
- What nodes are they running on?
- What is the memory availability?