Jupyter with Spark Setup

Recently doing a poc for work, learnt one thing or two about jupyter and spark. Here is the environmental summary I experienced

  • jupyter version 4.2.1
  • spark version 2.0.2 with hadoop2.7
  • python version 2.7.12
  • docker version 1.11.2, build b9f10c9/1.11.2

Simplified version – Docker

# pull the docker image from jupyter org
docker pull jupyter/all-spark-notebook
# run the image with the name & data dir you want to mount
docker run -d --name hans-notebook -v <local_src_dir>:/home/data -p 8888:8888 jupyter/all-spark-notebook

A little bit complicated version – manual setup

  1. Manual installation of each components
    #create virtualenv
    virtualenv jupyter_poc
    source ./jupyter_poc/bin/activate
    #install jupyter
    pip install jupyter
    #generate jupyter config file by running
    jupyter notebook --generate-config
    #download spark
    wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz
    #un-zip the file
    tar -zxvf spark-2.0.2-bin-hadoop2.7.tgz .
  2. In the jupyter config file, we need to simple initial setup for file ~/.jupyter/jupyter_notebook_config.py

    #open up the ip
    c.NotebookApp.ip = '*'
    #disable browser opening since its not localhost dev
    c.NotebookApp.open_browser = False
    #specify the port
    c.NotebookApp.port = 8888
    #temp disable the password token
    c.NotebookApp.token = u''
  3. After downloading spark, we need to export it in the PATH in ~/.bash_profile ( or ~/.bashrc )

    export PATH
    export SPARK_HOME=$HOME/spark-2.0.2-bin-hadoop2.7
    export PYSPARK_SUBMIT_ARGS='--master local[*] pyspark-shell'
  4. Also we need to introduce pyspark into jupyter python env in ~/.ipython/profile_default/startup/00-default-setup.py

    import os
    import sys
    spark_home = os.environ.get("SPARK_HOME")
    if not spark_home:
        raise ValueError('SPARK_HOME environment varialble is not set')
    sys.path.insert(0, os.path.join(spark_home, 'python'))
    sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
    execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

After these setup, just open the <ip>:8888 on the browser, the jupyter welcome page should be shown.
Happy coding with spark! 😛