Apache Spark DataFrame
So, lets recall RDD(Resilient Distributed Datasets)? It is an immutable distributed collection of objects, it is an Interface. OK! we have also seen how to apply transformations in previous post. They are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured data. RDDs are core data abstraction in Spark.
What is DataFrame?
Quoting Databricks, “In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.”
Why DataFrame? How does it improve over RDD?
In a nutshell, DataFrame is schema and data together. JSON is a good example to correlate this, you always have data and schema together in JSON. Schema in JSON is the nested tree of keys. A relational table is another good example.
Allowing Spark to manage schema and pass only data between nodes, avoids expensive Java Serialization. Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set.
DataFrames allow transformations and actions, but in more relational flavor. One can create a DataFrame from an RDD as well, provided they can pass in the schema for the RDD. Transformations and actions on DataFrames are processed by the Optimizer (called Catalyst which constructs a Tree of operators by optimizing the query and then generating the byte-code to run the job), and you are not required to pass arbitrary functions like in RDD. This introduces us to SparkSQL, which is a SQL on Hadoop tool. SQL Programmers can leverage this API and start writing SQL queries against DataFrames.
Lets do some Spark Java programming against DataFrame API.