Tag: rdd

Migrating to Spark 2.0
By: Date: October 4, 2017 Categories: Apache Spark Tags: , , , , , , , ,

Spark 2.0 provides a more matured eco-system, a unified data abstraction API and setting some new benchmarks in performance boosts with some non-backward compatible changes. Here, we try to see some important things to learn/remember before we migrate our existing spark projects to spark 2.0. Following is not a complete list of points but presents…

Read More →
Relational SetTheory Tranformation
By: Date: August 12, 2017 Categories: Apache Spark Tags: , , , , , , , , , , , , , ,

Relation/Set Theory transformations We will be playing with this following program to understand the three important set theory based transformations. package com.mishudi.learn.spark.dataframe; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; public class RelationalOrSetTheoryTransformations { public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(“RelationalOrSetTheoryTransformations”); JavaSparkContext ctx = new JavaSparkContext(sparkConf); //…

Read More →
Apache Spark Transformation – DataFrame
By: Date: August 7, 2017 Categories: Apache Spark Tags: , , , , , , , ,

Apache Spark Transformation – DataFrame DataFrame can be create from any structured dataset like JSON, relational table, parquet or an existing RDD with defined schema. Following program creates a DataFrame and queries using sql. Here is the json we will use to play with, copy these following lines into a file and save it in <SPARK_HOME>/bin…

Read More →
Spark DataFrame
By: Date: August 7, 2017 Categories: Apache Spark Tags: , , , , ,

Apache Spark DataFrame So, lets recall RDD(Resilient Distributed Datasets)? It is an immutable distributed collection of objects, it is an Interface. OK! we have also seen how to apply transformations in previous post. They are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured…

Read More →
Apache Spark
By: Date: July 26, 2017 Categories: Apache Spark Tags: , , , , , ,

If you haven’t read the previous article about MapReduce, I’d highly recommend reading it because that will set a good foundation to appreciate Sparks existence. Apache Spark – Introduction I want to get to the practical exercises quickly and I think there are enough resources on the internet to explain theoretical view of the framework….

Read More →