Install and configure Spark
By: Date: July 27, 2017 Categories: Apache Spark Tags: , , , , , , , , , ,

Download and configure on Mac/Windows

  1. Download the version of Spark that you want to work on from here
  2. Copy the downloaded tgz file to a folder where you want it to reside. Either double click the package or run tar -xvzf /path/to/yourfile.tgz command which will extract the spark package.
  3. Navigate to bin folder and start ./spark-shell and you should be in the scala command prompt as shown in the following picture

 

 

 

 

 

 

For windows, you will need to extract the tgz spark package using 7zip, which can be downloaded freely. and then you need to run the ./spark-shell.cmd and If everything goes fine you have installed Spark successfully.

Scala install is not needed for spark shell to run as the binaries are included in the prebuilt spark package. But we will need to install Java 8. There are plenty of Java install blogs, please refer one of them for installing and configuring Java either on Mac or Windows.

As we will be focussing on Java API of Spark, I’d recommend installing latest Eclipse IDE and Maven packages too. You are good if you have Maven installed in your Eclipse alone. If you wish to run your pom.xml from command line then you need it on your OS as well. Again, there are plenty of good blogs covering this topic, please refer one of them.

Setting up a standalone Spark Cluster

We will need the spark cluster setup as we will be submitting our Java Spark jobs to the cluster. We will setup a cluster which has 2 slave nodes.

On Mac

If you have brew configured then all you need to do is just run:

brew install apache-spark

You spark binaries/package gets installed in /usr/local/Cellar/apache-spark folder. Setup the SPARK_HOME now:

vi ~/.bashrc

export SPARK_HOME=/usr/loca/Cellar/apache-spark/$version/libexec
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Once you have the installed the binaries either using manual download method or via brew then proceed to next steps that will help us setup a local spark cluster with 2 workers and 1 master.

Open <SPARK_HOME>/conf/slaves file in a text editor and add “localhost” on a newline.

Add following to your <SPARK_HOME>/conf/spark-env.sh file:

export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=1
export SPARK_WORKER_DIR=/PathToSparkDataDir/
SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_IP=127.0.0.1

Note: Both slaves and spark-env files will be already present in the conf directory, you will have to rename them from .template to slaves and spark-env.sh respectively. SPARK_WORKER_INSTANCES here will give us two worker instances on localhost machine. Executor and worker memory configurations are also defined here. We will see more on what Worker, Executor etc are?

Lets start the master first now by running <SPARK_HOME>/sbin/start-master.sh and If you can access http://127.0.0.1:8080/ then your master is up and running. Now we need to start the slaves, <SPARK_HOME>/sbin/start-slave.sh spark://127.0.0.1:7077. Under workers section in the master UI http://127.0.0.1:8080/you should be seeing two worker instances with their worker ids. So, our cluster is up and running. This type of cluster setup is called standalone cluster.

On Windows

Above scripts will not be able to start the cluster on windows. For getting up and running with your cluster on windows use the

In one command prompt window run<SPARK_HOME>/bin/spark-class.cmd org.apache.spark.deploy.master.Master
In second command prompt run <SPARK_HOME>/bin/spark- class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077