Quick introduction to Apache Livy Apache Livy is a service that enables access to spark cluster over REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. There is…
Read More →Tag: Spark
Apache Spark introduced Dataset API that unified the programming experience, improving upon the performance/experience and reducing the learning curve for spark developers. This is a great link to get familiar with Dataset. If the link doesn’t work at when you are reading this post, google is your friend. I want to save time and get…
Read More →Apache Spark Transformation – DataFrame DataFrame can be create from any structured dataset like JSON, relational table, parquet or an existing RDD with defined schema. Following program creates a DataFrame and queries using sql. Here is the json we will use to play with, copy these following lines into a file and save it in <SPARK_HOME>/bin…
Read More →Apache Spark DataFrame So, lets recall RDD(Resilient Distributed Datasets)? It is an immutable distributed collection of objects, it is an Interface. OK! we have also seen how to apply transformations in previous post. They are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured…
Read More →Apache Spark Transformations In this post we will be focussing on general Apache Spark transformation against RDDs. We will keep it simple but try to go as deep as we can. Download link is provided at the bottom for you to run the programs and try it with your input. Goal is to get familiar…
Read More →The Word Count program in Java we saw here was written using lambda expression supported in Java 8. So, we passed functions are arguments to our transformation calls like mapToPair() and reduceByKey() etc. In this post we will try to write more detailed implementations of the lambda expressions that we used, as these are still fairly new…
Read More →This is a simple exercise and following are the steps for setting up a Maven project in eclipse: Create a new Maven project in Eclipse as shown below: From package explorer view, goto New -> Other -> Maven -Select Maven project -> Fill in group id, artifact id, package name and click finish You should…
Read More →Download and configure on Mac/Windows Download the version of Spark that you want to work on from here Copy the downloaded tgz file to a folder where you want it to reside. Either double click the package or run tar -xvzf /path/to/yourfile.tgz command which will extract the spark package. Navigate to bin folder and start ./spark-shell…
Read More →If you haven’t read the previous article about MapReduce, I’d highly recommend reading it because that will set a good foundation to appreciate Sparks existence. Apache Spark – Introduction I want to get to the practical exercises quickly and I think there are enough resources on the internet to explain theoretical view of the framework….
Read More →MapReduce – Quick Intro If you are reading this page, then I assume you have heard about MapReduce. Let us understand MR framework quickly, as understanding of this is much needed for someone to appreciate Apache Spark. MapReduce is the core de facto data processing framework of Apache Hadoop. The beauty of this framework was…
Read More →