Accessing PySpark from a Jupyter Notebook

It’d be great to interact with PySpark from a Jupyter Notebook. This post describes how to get that set up. It assumes that you’ve installed Spark like this.

  1. Install the findspark package.
    pip3 install findspark
  2. Make sure that the SPARK_HOME environment variable is defined
  3. Launch a Jupyter Notebook.
    jupyter notebook
  4. Import the findspark package and then use findspark.init() to locate the Spark process and then load the pyspark module. See below for a simple example.
A Jupyter notebook using the `findspark` and `pyspark` packages.