I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop.
- Download the latest release of Spark here.
- Unpack the archive.
tar -xvf spark-2.1.1-bin-hadoop2.7.tgz
- Move the resulting folder and create a symbolic link so that you can have multiple versions of Spark installed.
sudo mv spark-2.1.1-bin-hadoop2.7 /usr/local/ sudo ln -s /usr/local/spark-2.1.1-bin-hadoop2.7/ /usr/local/spark cd /usr/local/spark
SPARK_HOME to your environment.
- Start a standalone master server. At this point you can browse to 127.0.0.1:8080 to view the status screen.
- Start a worker process.
To get this to work I had to make an entry for my machine in
- Test out the Spark shell. You’ll note that this exposes the native Scala interface to Spark.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information. scala> println("Spark shell is running") Spark shell is running scala>
To get this to work properly it might be necessary to first set up the path to the Hadoop libraries.
- Maybe Scala is not your cup of tea and you’d prefer to use Python. No problem!
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ Using Python version 2.7.13 (default, Jan 19 2017 14:48:08) SparkSession available as 'spark'. >>>
Of course you’ll probably want to interact with Python via a Jupyter Notebook, in which case take a look at this.
- Finally, if you prefer to work with R, that’s also catered for.
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.1 /_/ SparkSession available as 'spark'. > spark Java ref type org.apache.spark.sql.SparkSession id 1 >
- When you are done you can shut down the slave and master Spark processes.