MapReduce – Quick Intro
If you are reading this page, then I assume you have heard about MapReduce. Let us understand MR framework quickly, as understanding of this is much needed for someone to appreciate Apache Spark. MapReduce is the core de facto data processing framework of Apache Hadoop. The beauty of this framework was that developer would program against the API and just worry about the business logic while the framework did all the heavy lifting of parallelizing execution, fault tolerant, data locality, scaling etc. I was delighted to get an opportunity to work on this great framework. In a nut shell developers were responsible to write a map() and a reduce() function. These two could be tagged as pair of functions as output of map() was passed as input to reduce(). If you were processing a log file then in a default setting each line would be sent as a input to the map() until all lines of that specific block of the file is complete. Now, all the output from map() is sent as input to the reduce(). To reduce() method, framework sends <key, (collection of values)>. Framework guarantees that all the values of a key will go to the same reducer, partitioner does this magic. When reduce method finishes successfully, job ends.
This was a very quick intro to a MR job flow and it still is fascinating. So, where are the problems and why is spark becoming more and more dominant? Here are some internal details, that may help?
- Each map instance processes a input split, so it has to read it from the disk. Number of mappers are in direct proportional to input splits.
- Each map will put its output in a in-memory circular buffer and when it is filled to 80%, it starts spilling to disk
- So, intermediate map outputs are also written back to the disk but not to hdfs
- When all mappers are done processing, Application master signals the reducer to start. Reducer using the event fetcher threads pulls outputs of each mapper to the node that it is running on
- So reducer will transfer all mappers output over the network, which is expensive
- Now reduce will perform shuffle and sort
- During the sort phase if reducer cannot do a in-memory sorting, it will spill the data to disk and this is yet another I/O
- Finally, reducers output again is written to disk
I have tried to highlight the pain points, any good programmer can point to all the possible optimization that once can implement to reduce data transfer and I/Os. Based on the use case, I’d agree to all the best practices you might be thinking. These are highlights of MR, not the entire framework in action:-)
So to be precise, problems we have:
- Lot of I/O is going on, which is a major bottle neck
- If you had to do various different iterations of the file, you will have to write different/series of MR jobs
- There is no/minimal in-memory data processing
- Batch oriented data processing
- One more important thing, MR needs a hadoop ecosystem to work, where YARN typically is the cluster manager