Using the Spark Shell

Please see my previous post for using Spark.

We are going to go over using the spark shell.

Step 1:  Running the Spark shell

Start the Spark shell, by running

$ ~/spark/spark-shell

or

c:\spark\spark-shell.exe

You should see something like this:

 

Step 2: Check out spark UI

Now that your shell is started, you should be able to browse to http://localhost:4040 and check out the spark UI.  Note that this is a different port (4040 vs 8080) from the previous example.

STEP 3: Spark context

Within Spark shell, variable sc is the SparkContext Type sc in scala prompt and see what happens. Your output might look like this

To see all methods in sc variable, type sc. and double-TAB This will show all the available methods on sc variable. (This only works on Scala shell for now)

Try the following:

==> Print the name of application name sc.appName

==> Find the ‘Spark master’ for the shell sc.master


STEP 4: Load a file

Let’s load an example file: README.md

twinkle twinkle little star how I wonder what you are up above the world so high like a diamond in the sky twinkle twinkle little star

Let’s load the file:

==> What is the ‘type’ of f ?
hint : type f on the console

==> Inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print the first line / record from RDD
hint : f.first()

==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print first 3 lines of RDD
hint : f.take(???) (provide the correct argument to take function)

==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print all the content from the file
hint : f.collect()

==> How many lines are in the file?
hint : f.count()

==> Inspect the ‘Jobs’ section in Shell UI (in browser)
Also inspect the event time line

Getting Started With Spark

Many of you may be interested in how to get going with Spark.   Let’s look at a walkthrough.

Step 0: Get your system Ready.

You need to download a working jdk, if you don’t already have it.  We recommend at least Java 7.

Now, you need to download Scala as well, if you don’t have it already

Use the following link: https://www.scala-lang.org/download/

Once ready, open a shell window (or windows command prompt) and test and make sure scala and sbt are both installed and in your path.

$ scala

$ sbt

Windows users have a few extra steps.

First, download winutils.exe from https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/winutils.exe

Put this in a directory and add to your path. (example C:\Winutils\)  This provides support for running hadoop libraries in windows.  For more info, please see this link:  https://wiki.apache.org/hadoop/WindowsProblems

Run the following command (again, Windows users only)

C:\winutils\bin\winutils.exe chmod 777 C:\tmp\hive

HADOOP_HOME=c:\winutils\

Ok, now we should be ready to run Spark.

Step 1:  Download Spark

You can download the latest spark from http://spark.apache.org/downloads.html

Here is a link you can use:

wget spark-2.1.0-bin-hadoop2.7.tgz

Once here, I like to put the spark directory named spark in my home directory.  Mac and Linux users can do that as follows:

$ mv spark-2.1.0-bin-hadoop2.7 ~/spark

Windows users can do something similar from a command prompt

> rename spark-2.1.0-bin-hadoop2.7 c:\spark

Step 2: Run Spark

You can run spark as follows

$ ~/spark/bin/start-all.sh  #Mac/Linux

c:\spark\bin\start-all.exe  (Windows)

Spark is now running on your machine!

Step 3: Check out the Spark UI

Go to localhost:8080. This will be your Spark master.  It should look something like this.

Check the following things out:

  1. How many Masters are running?  How Many Workers?
  2. What nodes are they running on?
  3. What is the memory availability?