Spark 2.0 provides a more matured eco-system, a unified data abstraction API and setting some new benchmarks in performance boosts with some non-backward compatible changes. Here, we try to see some important things to learn/remember before we migrate our existing spark projects to spark 2.0. Following is not a complete list of points but presents a general idea of whats new in spark 2.0 and how it relates to your application built on spark 1.x versions.
- org.apache.spark.sql.SparkSession: We were earlier creating SparkContext and SqlContext separately but in the Spark 2.0 we have SparkSession which is the entry point to programming Spark with the Dataset API. We can get SparkContext (sc) and SqlContext both in the SparkSession. Though, SparkSession encapsulates SparkContext and SqlContext, you still can use these two older APIs individually, however the Spark documentation recommends to use SparkSession.
- To create a SparkSession, following code snippet can be used: SparkSession.builder().getOrCreate();
- Java Spark context can be created using following code snippet: JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
-
org.apache.spark.sql.DataFrame is replaced with the a unified org.apache.spark.sql.Dataset API. In fact, DataFrame API is not available in spark 2.0. When you have changed your spark version from 1.x to 2.0, you’d see compilation errors in all the places where you had previously used DataFrames. Dataset API was added in Spark 1.6 but became the de-facto data abstraction API in spark 2.0.
-
DataFrame was only constructed from non-jvm objects, but Dataset is a unified api that can construct jvm and non-jvm objects to make use of its optimized execution engine. Datasets also uses the same efficient off-heap storage mechanism as the DataFrame API. Following snippet provides an example of converting a Json consisting of person details into a jvm managed Dataset of persons.
- Dataset<Person> people = sparkSession.read().json(“…”).as(Encoders.bean(Person.class));
- DataFrame.registerTempTable(“TABLE_NAME”) is deprecated and should be replaced with Dataset.createOrReplaceTempView(“TABLE_NAME”) and then sql queries can be executed agains this view.
- Spark 2.0 is solid, complete and improved but it gets tricky as all the other APIs that helped us play with Dataframes like DataFrameReader, DataFrameWriter, ..read().json(“…”) are still present but all of them work internally with Datasets. It is worth an effort to look at these APIs’ source code in spark 1.x and spark 2.0 versions.
- ALL the operations (transformations/actions) that you could do on dataframes are applicable to datasets.
- RDDs can be created using sparkSession.sparkContext().textFile(…) and are equivalent with Dataset API way of creating Dataset using sparkSession.read().text(…) which in simpler terms mean that in the first way we create a JavaRDD<String> and in second we create a Dataset<Row>.
- RDDs are still the core abstraction and supported in spark 2.0. But, ability to the Datasets to process both jvm and non-jvm managed data types, help eliminate the need of programming against RDDs. In a case where programming against RDD API is must, they are still supported. Everything that was done using RDD is possible using the unified dataset api but in many cases programmers may need call different transformations/actions.
- In spark 2.0, SparkSession is used to register the UDFs instead of SqlContext. Following is a snippet, sparkSession.udf().register(…)
- Iterable[] replaced with Iterator[] spark 2.0 replaces Iterable<T> with the Iterator for classes implementing PairFlatMapFunction, return type for the call() method that user implements Iterator now.
- Some external data ingestion connectors like streaming-flume, streaming-twitter that were built-in spark are now moved out of Spark and can still be brought in as dependencies. This move helps development of these connectors independent of spark releases.
- CSV support is built in spark 2.0, in earlier 1.x versions, we used spark-csv as a dependency.
- From Spark 2.1.0 version, support for Java 7 has been deprecated, this shouldn’t matter a lot if you are planning to move to Spark 2.0.