SPARK vs HADOOP
1.Compare Spark vs Hadoop MapReduce
Scalability |
Produces large number of nodes | Highly scalable – sSpark Cluster(8000 Nodes) |
Memory |
Does not leverage the memory of the hadoop cluster to maximum. | save data on memory with the use of RDD’s. |
Disk usage |
MapReduce is disk oriented. | Spark caches data in-memory and ensures low latency. |
Processing |
Only batch processing is supported | Supports real-time processing through spark streaming. |
Installation |
Is bound to hadoop. | Is not bound to Hadoop. |
Streaming Engine | Map-Reduce | Apache spark straming – micro batches |
Data Flow | Map-Reduce | Map-Reduce Direct Acyclic Graph-DAG |
Computation Model | Map-Reduce batch oriented model | Collect and process |
Performance | Slow due to batch processing | Fast
It is almost 100 times faster than Hadoop |
Fault Tolerance | Highly fault tolerant due to Map-Reduce | Recovery available without extra code It allows the partition recovery |
Interactivity | Other than Pig and Hive, it has no interactive mode | It has interactive modes |
Difficulty | It is tough to learn | It has high level modules hence it is easy |
Data caching | Hard disk | In-memory |
Perform iterative jobs | Average | Excellent |
Independent of Hadoop | No | Yes |
Machine learning applications | Average | Excellent |
Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.
- Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
- Spark is easier to program as it comes with an interactive mode.
- It provides complete recovery using lineage graph whenever something goes wrong.