Please see my previous post for using Spark.
We are going to go over using the spark shell.
Step 1: Running the Spark shell
Start the Spark shell, by running
$ ~/spark/spark-shell
or
c:\spark\spark-shell.exe
You should see something like this:
|
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_72) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. ... scala> |
Step 2: Check out spark UI
Now that your shell is started, you should be able to browse to http://localhost:4040 and check out the spark UI. Note that this is a different port (4040 vs 8080) from the previous example.
STEP 3: Spark context
Within Spark shell, variable sc
is the SparkContext Type sc
in scala prompt and see what happens. Your output might look like this
|
scala<span class="pl-k">></span> sc <span class="pl-v">res0</span>: org.apache.spark.<span class="pl-en">SparkContext</span> <span class="pl-k">=</span> org.apache.spark.<span class="pl-en">SparkContext</span><span class="pl-k">@</span>5019fb90 |
To see all methods in sc variable, type sc.
and double-TAB
This will show all the available methods on sc
variable. (This only works on Scala shell for now)
Try the following:
==> Print the name of application name sc.appName
==> Find the ‘Spark master’ for the shell sc.master
STEP 4: Load a file
Let’s load an example file: README.md
twinkle twinkle little star how I wonder what you are up above the world so high like a diamond in the sky twinkle twinkle little star
Let’s load the file:
|
<span class="pl-k">val</span> <span class="pl-en">f</span> <span class="pl-k">=</span> sc.textFile(<span class="pl-s"><span class="pl-pds">"~/spark/README.md</span><span class="pl-pds">"</span></span>) |
==> What is the ‘type’ of f ?
hint : type f
on the console
==> Inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print the first line / record from RDD
hint : f.first()
==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print first 3 lines of RDD
hint : f.take(???)
(provide the correct argument to take function)
==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?
==> Print all the content from the file
hint : f.collect()
==> How many lines are in the file?
hint : f.count()
==> Inspect the ‘Jobs’ section in Shell UI (in browser)
Also inspect the event time line