SPARK BASICS-1
.What is Spark?Why Spark?
- Spark is the third generation distributed data processing platform.
- It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.
- Apache Spark is an open-source framework used mainly for Big Data analysis, machine learning and real-time processing.
- Apache Spark is a cluster computing framework which runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of wide variety of data from multiple sources.
- In Spark, a task is an operation that can be a map task or a reduce task.
- Spark Context handles the execution of the job and also provides API’s in different languages i.e., Scala, Java and Python to develop applications and faster execution as compared to MapReduce.
- Apache Spark is an Open Source Project from the Apache Software Foundation.
- Apache Spark is a data processing engine and is being used in data processing and data analytics.
- It has inbuilt libraries for Machine Learning, Graph Processing, and SQL Querying.
- Spark is horizontally scalable and is very efficient in terms of speed when compared to big data giant Hadoop.
- The framework provides a fully-functional interface for programmers and developers – this interface does a great job in aiding in various complex cluster programming and machine learning tasks.
- Spark is a fast, easy-to-use and flexible data processing framework.
- It has an advanced execution engine supporting cyclic data flow and in-memory computing.
- Spark can run on Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra and others.
.What are the key features of Apache Spark that you like?
- Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
- It has built-in APIs in multiple languages like Java, Scala, Python and R
- It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.
- Allows Integration with Hadoop and files included in HDFS.
- Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter
- Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
- Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing.
- Support for Several Programming Languages –
- Spark code can be written in any of the four programming languages, namely Java, Python, R, and Scala.
- It also provides high-level APIs in these programming languages. Additionally, Apache Spark provides shells in Python and Scala.
- The Python shell is accessed through the ./bin/pyspark directory, while for accessing the Scala shell one needs to go to the .bin/spark-shell directory.
- Lazy Evaluation – Apache Spark makes use of the concept of lazy evaluation, which is to delay the evaluation up until the point it becomes absolutely compulsory.
- Machine Learning – For big data processing, Apache Spark’s MLib machine learning component is useful. It eliminates the need for using separate engines for processing and machine learning.
- Multiple Format Support –
- Apache Spark provides support for multiple data sources, including Cassandra, Hive, JSON, and Parquet.
- The Data Sources API offers a pluggable mechanism for accessing structured data via Spark SQL.
- These data sources can be much more than just simple pipes able to convert data and pulling the same into Spark.
- Real-Time Computation – Spark is designed especially for meeting massive scalability requirements. Thanks to its in-memory computation, Spark’s computation is real-time and has less latency.
- Speed –
- For large-scale data processing, Spark can be up to 100 times faster than Hadoop MapReduce.
- Apache Spark is able to achieve this tremendous speed via controlled portioning.
- The distributed, general-purpose cluster-computing framework manages data by means of partitions that help in parallelizing distributed data processing with minimal network traffic.
- Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.
- Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory.
- It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed.
- This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
- Hadoop Integration –
- Spark offers smooth connectivity with Hadoop.
- In addition to being a potential replacement for the Hadoop MapReduce functions,
- Spark is able to run on top of an extant Hadoop cluster by means of YARN for resource scheduling.
- Combines SQL, streaming, and complex analytics:
- In addition to simple “map” and “reduce” operations,
- Spark supports
- SQL queries,
- streaming data, and
- complex analytics such as machine learning and graph algorithms out-of-the-box.
- Not only that, users can combine all these capabilities seamlessly in a single workflow.
- Ease of Use:Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps.
- Runs Everywhere:
- Spark runs on Hadoop, Mesos, standalone, or in the cloud.
- It can access diverse data sources including HDFS, Cassandra, HBase, S3.
.What are some of the more notable features of Spark?
- speed, multi-format support, and inbuilt libraries.
- Allows Integration with Hadoop and files included in HDFS.
- Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
- Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
- Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
- It allows the integration with Hadoop and files including HDFS.
- It consists of RDD (Resilient Distributed Datasheets) it can be cached across multiple computing nodes in a cluster.
Clarify the key highlights of Apache Spark.
- Polyglot
- Speed
- Multiple Format Support
- Lazy Evaluation
- Real Time Computation
- Hadoop Integration
- Machine Learning
.Features, Pros and Cons of Apache Spark
Apache Spark is a tool for large data processing and execution. It also offers high-level operators which effortlessly develop a parallel application for the processing. The most prominent Apache Spark features are:
- Polyglot – multiple languages platform
- It is proficient in speed
- It also supports the multiple formats
- Provides real-time computation in data
- Efficient machine learning
Pros associated
- In memory computation of spark.
- Reusability of the spark code.
- Offers fault tolerance.
Cons associated
- Sometimes it becomes a bottleneck when it comes to cost efficiency.
- As compared to Apache Flink, Apache spark has higher latency.
.Can you explain how you can use Apache Spark along with Hadoop?
- Having compatibility with Hadoop is one of the leading advantages of Apache Spark.
- The duo makes up for a powerful tech pair.
- Using Apache Spark and Hadoop allows for making use of Spark’s unparalleled processing power in line with the best of Hadoop’s HDFS and YARN abilities.
- Following are the ways of using Hadoop Components with Apache Spark:
- Batch & Real-Time Processing – MapReduce and Spark can be used together where the former handles the batch processing and the latter is responsible for real-time processing
- HDFS – Spark is able to run on top of the HDFS for leveraging the distributed replicated storage
- MapReduce – It is possible to use Apache Spark along with MapReduce in the same Hadoop cluster or independently as a processing framework
- YARN – Spark applications can run on YARN
.What are the downsides of Spark?
Spark utilizes the memory.
following mistakes:
- end up running everything on the local node instead of distributing work over to the cluster.
- hit some webservice too many times by the way of using multiple clusters.
- The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
- The second mistake is possible in Map-Reduce too.
- While writing Map-Reduce, user may hit a service from inside of map() or reduce() too many times.
- This overloading of service is also possible while using Spark.
- Doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefiting from a file management system
- Higher latency but consequently, lower throughput
- No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is a micro-batch processing and not truly real-time data processing
- Lesser number of algorithms available
- Spark streaming doesn’t support record-based window criteria
- The work needs to be distributed over multiple clusters instead of running everything on a single node
- While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck
What are the disadvantages of using Apache Spark over Hadoop MapReduce?
- Apache spark does not scale well for compute intensive jobs and consumes large number of system resources.
- Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data.
- Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or Apache hadoop.
.What advantages does Spark offer over Hadoop MapReduce?
- Enhanced Speed – MapReduce makes use of persistent storage for carrying out any of the data processing tasks. On the contrary, Spark uses in-memory processing that offers about 10 to 100 times faster processing than the Hadoop MapReduce.
- Multitasking – Hadoop only supports batch processing via inbuilt libraries. Apache Spark, on the other end, comes with built-in libraries for performing multiple tasks from the same core, including batch processing, interactive SQL queries, machine learning, and streaming.
- No Disk-Dependency – While Hadoop MapReduce is highly disk-dependent, Spark mostly uses caching and in-memory data storage.
- Iterative Computation – Performing computations several times on the same dataset is termed as iterative computation. Spark is capable of iterative computation while Hadoop MapReduce isn’t.
- Spark provides inbuilt libraries for most of the multidimensional task as compared to map reduce.
- Spark can perform multiple computations within the same datasheet.
- MapReduce makes use of persistence storage for any of the data processing tasks.
- Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
- Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
- In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using Oozie/shell script. This mechanism is very time consuming and the map-reduce task has heavy latency.
- And quite often, translating the output out of one MR job into the input of another MR job might require writing another code because Oozie may not suffice.
- In Spark, you can basically do everything using single application/console (pyspark or scala console) and get the results immediately.
- Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
- Spark kind of equals to MapReduce and Oozie put together.
- Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
- Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
- Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
- Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
.Is there is a point of learning MapReduce, then?
Yes. For the following reason:
- MapReduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
- When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce paradigm is still very relevant.
- Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better.
.Which one will you choose for a project –Hadoop MapReduce or Apache Spark?
- The answer to this question depends on the given project scenario – as it is known that Spark makes use of memory instead of network and disk I/O.
- However, Spark uses large amount of RAM and requires dedicated machine to produce effective results.
- So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.
.When should you choose Apache Spark?
- When the application needs to scale.
- When the application needs both batch and real-time processing of records.
- When the application needs to connect to multiple databases like Apache Cassandra, Apache Mahout, Apache HBase, SQL databases, etc.
- When the application should be able to query structured datasets cumulatively present across different database platforms.
.How can you compare Hadoop and Spark in terms of ease of use?
- Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time.
- Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.
.How is Spark different from MapReduce? Is Spark faster than MapReduce?
Yes, Spark is faster than MapReduce.
-
There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map.
-
Spark tries to keep the data “in-memory” as much as possible.
- In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.
.List some use cases where Spark outperforms Hadoop in processing.
- Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
- Spark is preferred over Hadoop for real time querying of data
- Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
Real Time Processing: Spark is favored over Hadoop for constant questioning of information. for example
- Securities exchange Analysis,
- Banking,
- Healthcare,
- Telecommunications, and so on.
Big Data Processing: Spark runs upto multiple times quicker than Hadoop with regards to preparing medium and enormous estimated datasets.
.How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
.Why is Spark faster than Hadoop?
Spark is so fast because it uses a
- state-of-the-art DAG scheduler,
- a query optimizer, and
- a physical execution engine.
.Which programming languages could be used for Spark Application Development?
One can use following programming languages.
- Java
- Scala
- Python
- Clojure
- R (Using SparkR)
- SQL (Using SparkSQL)
Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries.
Scala shell can be easily accessed through the ./bin/spark-shell and Python can be accessed through ./bin/pyspark.
Among them, Scala is the most popular because Apache Spark is written in Scala.
.What record frameworks does Spark support?
The accompanying three document frameworks are upheld by Spark:
- Hadoop Distributed File System (HDFS).
- Local File framework.
- Amazon S3
- Hadoop Distributed File System (HDFS).
- Local File system.
- S3
.Which data sources can Spark access?
Spark can access data from hundreds of sources. Some of them are :
- HDFC
- Apache Cassandra
- Apache HBase
- Apache Hive
- Parquet file
- JSON Datasets
- SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language.
- It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack.
- In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool.
.How might you limit information moves when working with Spark?
The different manners by which information moves can be limited when working with Apache Spark are:
Communicate and Accumilator factors
.What are the different dimensions of constancy in Apache Spark?
- Apache Spark naturally endures the mediator information from different mix tasks, anyway it is regularly proposed that clients call persevere () technique on the RDD on the off chance that they intend to reuse it.
- Sparkle has different tirelessness levels to store the RDDs on circle or in memory or as a mix of both with various replication levels.
What are the disservices of utilizing Apache Spark over Hadoop MapReduce?
- Apache Spark’s in-memory ability now and again comes a noteworthy barrier for cost effective preparing of huge information.
- Likewise, Spark has its own record the board framework and consequently should be incorporated with other cloud based information stages or apache hadoop.
What is the most commonly used programming language used in Spark?
- A great representation of the basic interview questions on Spark, this one should be a no-brainer.
- Even though there are plenty of developers that like to use Python, Scalaremains the most commonly used language for Spark.
.What is the upside of Spark apathetic assessment?
Apache Spark utilizes sluggish assessment all together the advantages:
- Apply Transformations tasks on RDD or “stacking information into RDD” isn’t executed quickly until it sees an activity. Changes on RDDs and putting away information in RDD are languidly assessed. Assets will be used in a superior manner if Spark utilizes sluggish assessment.
- Lazy assessment advances the plate and memory utilization in Spark.
- The activities are activated just when the information is required. It diminishes overhead.
Explain about the popular use cases of Apache Spark
Apache Spark is mainly used for
- Iterative machine learning.
- Interactive data analytics and processing.
- Stream processing
- Sensor data processing
.Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
.How does Spark handle distributed processing?
- Spark provides an abstraction to the distributed processing through Spark RDD API.
- A general user does not need to worry about how data is processed in a distributed cluster.
- There are some exceptions though. When you optimize an application for performance, you should understand about operations and actions which require the data transfer between nodes.
While processing data from HDFS, does it execute code near data?
Yes, it does in most cases. It creates the executors near the nodes that contain data.
To use Spark on an existing Hadoop Cluster, do we need to install Spark on all nodes of Hadoop?
Since Spark runs as an application on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes. So, you do not need to install the Spark on all nodes. When a job is submitted, the Spark will be installed temporarily on all nodes on which execution is needed.
: Can you explain how to minimize data transfers while working with Spark?
: Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:
- Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
- Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
- Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs
- Use broadcast variables to join small and large RDDs;
- Using Broadcast Variable– Broadcast variable enhances the efficiency of joins between small and large RDDs.
- Using Accumulators – Accumulators help update the values of variables in parallelwhile executing.
- The most common way is to avoid operations ByKey, repartitionor any other operations which trigger shuffles.
What are the common mistakes developers make when running Spark applications?
- Hitting the web service several times by using multiple clusters.
- Run everything on the local node instead of distributing it.
Developers need to be careful with this, as Spark makes use of memory for processing.
- Maintaining the required size of shuffle blocks.
- Spark developer often make mistakes with managing directed acyclic graphs (DAG’s.)