1.Compare Spark vs Hadoop MapReduce




Compare Spark vs Hadoop


 Hadoop vs Spark



 Produces large number of nodes Highly scalablesSpark Cluster(8000 Nodes)


Does not leverage the memory of the hadoop cluster to maximum. save data on memory with the use of RDD’s.

Disk usage

MapReduce is disk oriented. Spark caches data in-memory and ensures low latency.


Only batch processing is supported Supports real-time processing through spark streaming.


Is bound to hadoop. Is not bound to Hadoop.
Streaming Engine  Map-Reduce Apache spark straming micro batches
Data Flow  Map-Reduce Map-Reduce Direct Acyclic Graph-DAG
Computation Model Map-Reduce batch oriented model   Collect and process
Performance Slow due to batch processing Fast

It is almost 100 times faster than Hadoop

Fault Tolerance Highly fault tolerant due to Map-Reduce  Recovery available without extra code It allows the partition recovery
Interactivity Other than Pig and Hive, it has no interactive mode It has interactive modes
Difficulty It is tough to learn It has high level modules hence it is easy
Data caching Hard disk In-memory
Perform iterative jobs Average Excellent
Independent of Hadoop No Yes
Machine learning applications Average Excellent

Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.

  • Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
  • Spark is easier to program as it comes with an interactive mode.
  • It provides complete recovery using lineage graph whenever something goes wrong.