SPARK vs HADOOP
1.Compare Spark vs Hadoop MapReduce
|Produces large number of nodes||Highly scalable – sSpark Cluster(8000 Nodes)|
|Does not leverage the memory of the hadoop cluster to maximum.||save data on memory with the use of RDD’s.|
|MapReduce is disk oriented.||Spark caches data in-memory and ensures low latency.|
|Only batch processing is supported||Supports real-time processing through spark streaming.|
|Is bound to hadoop.||Is not bound to Hadoop.|
|Streaming Engine||Map-Reduce||Apache spark straming – micro batches|
|Data Flow||Map-Reduce||Map-Reduce Direct Acyclic Graph-DAG|
|Computation Model||Map-Reduce batch oriented model||Collect and process|
|Performance||Slow due to batch processing||Fast
It is almost 100 times faster than Hadoop
|Fault Tolerance||Highly fault tolerant due to Map-Reduce||Recovery available without extra code It allows the partition recovery|
|Interactivity||Other than Pig and Hive, it has no interactive mode||It has interactive modes|
|Difficulty||It is tough to learn||It has high level modules hence it is easy|
|Data caching||Hard disk||In-memory|
|Perform iterative jobs||Average||Excellent|
|Independent of Hadoop||No||Yes|
|Machine learning applications||Average||Excellent|
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
- Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
- Spark is easier to program as it comes with an interactive mode.
- It provides complete recovery using lineage graph whenever something goes wrong.