Accessing PySpark from a Jupyter Notebook

Jupyter
Spark
Published

4 Jul 2017 12:00

It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this.

  1. Install the findspark package. bash pip3 install findspark
  2. Make sure that the SPARK_HOME environment variable is defined
  3. Launch a Jupyter Notebook. bash jupyter notebook
  4. Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module. See below for a simple example.

A Jupyter notebook using the findspark and pyspark packages.