Download and configure on Mac/Windows
- Download the version of Spark that you want to work on from here
- Copy the downloaded tgz file to a folder where you want it to reside. Either double click the package or run
tar -xvzf /path/to/yourfile.tgzcommand which will extract the spark package.
- Navigate to bin folder and start ./spark-shell and you should be in the scala command prompt as shown in the following picture
For windows, you will need to extract the tgz spark package using 7zip, which can be downloaded freely. and then you need to run the ./spark-shell.cmd and If everything goes fine you have installed Spark successfully.
Scala install is not needed for spark shell to run as the binaries are included in the prebuilt spark package. But we will need to install Java 8. There are plenty of Java install blogs, please refer one of them for installing and configuring Java either on Mac or Windows.
As we will be focussing on Java API of Spark, I’d recommend installing latest Eclipse IDE and Maven packages too. You are good if you have Maven installed in your Eclipse alone. If you wish to run your pom.xml from command line then you need it on your OS as well. Again, there are plenty of good blogs covering this topic, please refer one of them.
Setting up a standalone Spark Cluster
We will need the spark cluster setup as we will be submitting our Java Spark jobs to the cluster. We will setup a cluster which has 2 slave nodes.
If you have brew configured then all you need to do is just run:
brew install apache-spark
You spark binaries/package gets installed in
/usr/local/Cellar/apache-spark folder. Setup the SPARK_HOME now:
vi ~/.bashrc export SPARK_HOME=/usr/loca/Cellar/apache-spark/$version/libexec export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Once you have the installed the binaries either using manual download method or via brew then proceed to next steps that will help us setup a local spark cluster with 2 workers and 1 master.
<SPARK_HOME>/conf/slaves file in a text editor and add “localhost” on a newline.
Add following to your
Note: Both slaves and spark-env files will be already present in the conf directory, you will have to rename them from .template to slaves and spark-env.sh respectively. SPARK_WORKER_INSTANCES here will give us two worker instances on localhost machine. Executor and worker memory configurations are also defined here. We will see more on what Worker, Executor etc are?
Lets start the master first now by running
<SPARK_HOME>/sbin/start-master.sh and If you can access
http://127.0.0.1:8080/ then your master is up and running. Now we need to start the slaves,
<SPARK_HOME>/sbin/start-slave.sh spark://127.0.0.1:7077. Under workers section in the master UI
http://127.0.0.1:8080/you should be seeing two worker instances with their worker ids. So, our cluster is up and running. This type of cluster setup is called standalone cluster.
Above scripts will not be able to start the cluster on windows. For getting up and running with your cluster on windows use the
In one command prompt window run
In second command prompt run
<SPARK_HOME>/bin/spark- class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077