WHY Spark is better than Map Reduce?

Hadoop

In this blog, I will provide you the list of benefits of using Spark over Map.

In my previous blog I bring in my view of understanding about Hadoop. Don’t miss it if you have not read it yet: Click here to read it.

Basic difference

Map Reduce was developed during the early and mid-2000s when RAMs weren’t as cheap and most CPUs were 32 bit. Thus, it was designed to rely heavily on disk I/O. Spark on the other hand (RDD to be exact) was build in an era of 64-bit computers that could address TBs of RAM that has become a lot cheaper. Thus, Spark is first and foremost an in-memory technology and hence a lot faster. In Spark, the RAM utilization is mostly maxed out.

About Spark
About Spark: Image Source

I will list some points that will give you a detailed comparison:

Problem of Replication in Map Reduce

One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive because it incurs replication i.e the size of the dataset in disk I/O and a similar amount of network I/O. In Spark the output of operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage.

In-memory caching abstraction

The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don’t need to be read from disk for each operation.

Spark launches tasks much faster

The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.

Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark’s shuffle implementation works very similarly to MapReduce’s: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.

About the Author

priyabrat

Priyabrat Bishwal

is a Data Engineer at Societe Generale Global Solution Centre. A big data enthusiast and passionate in the area of data science and Machine Learning . In addition, He is currently pursuing M. Tech programme on Data Science from Bit Pilani. He likes writing blogs and always eager to help students from science background. You can reach out to Priyabrat at [email protected] For more detail, follow him on his: linkedin page