If you haven’t read the previous article about MapReduce, I’d highly recommend reading it because that will set a good foundation to appreciate Sparks existence.
Apache Spark – Introduction
I want to get to the practical exercises quickly and I think there are enough resources on the internet to explain theoretical view of the framework. I will keep the introduction quick and short but will try to cover important high level details. Apache Spark is a cluster computing platform designed to be fast and general purpose. Its greatest power lies in in-memory data processing. It supports interactive and iterative in-memory computation. A developer programs to the API in language of choice from various languages (Java, Scala, Python etc.) that Spark supports.
We will see other important topics in coming posts and for simplicity we will stick to just getting familiarized with Spark architecture.
Apache Spark Ecosystem
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed data‐ sets (RDDs), which are Spark’s main programming abstraction.
Spark SQL allows for working with structured data. It allows querying data via SQL and it supports many sources of data, including Hive tables, Parquet, and JSON.
Spark Streaming is a Spark component that enables processing of live streams of data.
Spark comes with a library containing common machine learning (ML) functionality, called MLlib.
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.
So, Spark is a unified stack containing multiple closely integrated components. Spark Core is the driving force for all the components that depend on it. Lets put all these components in the context and see what Spark can do for us?
- It can be seen as a unified computation engine
- It can perform deeper iterative and complex analytics
- It can efficiently query and analyze structured data via its SQL API
- Near real time analytics are possible with Spark streaming
- Support for Machine learning and graph oriented computations
- It will work with or without a Hadoop cluster
- Provides a Scala and python shell, which are extremely easy to get up and running with Spark
- Many languages to choose from, which makes it the most active Apache project right now