Running PySpark with Jupyter Notebooks

Many users would like to run jupyter notebooks with PySpark.  Unfortunately, it’s not so simple as simply starting “jupyter notebook”  That will open a jupyter notebook server, but pyspark code will not run.

What we need to do is run jupyter inside of Pyspark.  To do that, we can set the following variables:

One way to run this would be the following:

This will open up a jupyter notebook window in which pyspark code will run in your pyspark shell.

If you’re running on a cluster, then you can do pass the arguments to the cluster node.



Using the Spark Shell

Please see my previous post for using Spark.

We are going to go over using the spark shell.

Step 1:  Running the Spark shell

Start the Spark shell, by running

$ ~/spark/spark-shell



You should see something like this:


Step 2: Check out spark UI

Now that your shell is started, you should be able to browse to http://localhost:4040 and check out the spark UI.  Note that this is a different port (4040 vs 8080) from the previous example.

STEP 3: Spark context

Within Spark shell, variable sc is the SparkContext Type sc in scala prompt and see what happens. Your output might look like this

To see all methods in sc variable, type sc. and double-TAB This will show all the available methods on sc variable. (This only works on Scala shell for now)

Try the following:

==> Print the name of application name sc.appName

==> Find the ‘Spark master’ for the shell sc.master

STEP 4: Load a file

Let’s load an example file:

twinkle twinkle little star how I wonder what you are up above the world so high like a diamond in the sky twinkle twinkle little star

Let’s load the file:

==> What is the ‘type’ of f ?
hint : type f on the console

==> Inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print the first line / record from RDD
hint : f.first()

==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print first 3 lines of RDD
hint : f.take(???) (provide the correct argument to take function)

==> Again, inspect Spark Shell UI on port 4040, do you see any processing done? Why (not)?

==> Print all the content from the file
hint : f.collect()

==> How many lines are in the file?
hint : f.count()

==> Inspect the ‘Jobs’ section in Shell UI (in browser)
Also inspect the event time line

Welcome to the wonderful world of Machine Learning and Big Data!

I’m Timothy Fox, consultant, trainer, and enthusiast. I’m passionate about machine learning, and big data, and I happen to be so fortunate that I’m able to pursue my passions on a daily basis.

I’ve long been a fan of R and Python, which I’ve used in addition to my background in Java in solving challenging problems. Not having a stats of math background (other than that of an engineering student), I’ve come to learn how to use these tools to accomplish these tasks.

Four years ago, I discovered Hadoop, an amazing framework combining distributed data with distributed processing, neatly solving many of the most vexing problems that we learned about in university about distributed systems.

Hadoop isn’t going anywhere, but much has been built on top of that platform; so now Hadoop is more of a diverse ecosystem with many components. One of the most exciting is Spark, whose MLlib component has really excited me. Many of my blog posts in the near term are going to be about MLLib.

So that’s this blog, in a nutshell.