SPARK-3

SPARK BASICS-1

 

25vs of data1854519145c6c3d9978c7ef602cb8b0468747470733a2f2f692e706f7374696d672e63632f7a473246346366682f686466732d6d61702d522e706e67 (1)68747470733a2f2f692e706f7374696d672e63632f7a473246346366682f686466732d6d61702d522e706e67accuaccumulatorApache-Spark-Executor-01Webapache-spark-lazy-evaluationApache-Spark-Use-Cases-768x402-1batchBest-Apache-Spark-Certifications-768x402-1big dataanabroadcastchallenge mrcheatsheet-imageclstrclusetColumnar_Storage_FormatdagDAG-Schedulerdasource2data load1datas1Download-a-Printable-PDF-of-this-Cheat-SheetdsorceFault-Tolerance-in-Apache-Spark-min-1Features-of-Apache-Spark-01Features-of-Apache-Spark-1Graph-of-Spark-Stagesh2h3hadoophadoop9hadoop12Hadoop-and-mapreduce-cheat-sheethadoop-hdfs-commands-cheatsheet-A4hcltr4hcorehcsrt7hcstr2hdfs1hdfs2HeartbeatReceiver’s-Heartbeat-Message-Handler-01-3hecoheco2hiveinternals-of-job-execution-in-apache-sparkInternals-of-job-execution-in-sparkjso2json1Launching-tasks-on-executor-using-TaskRunners-01Limitations-of-Apache-Spark-01limitations-of-apache-spark-1lin2name nodeparallel1part2part3part4part5partition1problem1problem2PySpark_CheatSheet-1 (1)PySpark_CheatSheet-1realreal3s2s3s4s6s7s8s9s10s11s12s13s14s15sataload2schemasec nodesolition big datasolution2spark1Spark-Stage-An-Introduction-to-Physical-Execution-planSpark-StagesSpark-Use-Cases-01-1sq4sq5sq6sq8sq9sqoopsqoop2sssstateful operstructure of dataSubmitting-a-Job-in-Sparkwindowed opez2z3z4zoo

 

.What is Spark?Why Spark?

  • Spark is the third generation distributed data processing platform.
  • It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.
  • Apache Spark is an open-source framework used mainly for Big Data analysis, machine learning and real-time processing.
  •  Apache Spark is a cluster computing framework which runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of wide variety of data from multiple sources.
  • In Spark, a task is an operation that can be a map task or a reduce task.
  • Spark Context handles the execution of the job and also provides API’s in different languages i.e., Scala, Java and Python to develop applications and faster execution as compared to MapReduce.
  • Apache Spark is an Open Source Project from the Apache Software Foundation.
  • Apache Spark is a data processing engine and is being used in data processing and data analytics.
  • It has inbuilt libraries for Machine Learning, Graph Processing, and SQL Querying.
  • Spark is horizontally scalable and is very efficient in terms of speed when compared to big data giant Hadoop.
  • The framework provides a fully-functional interface for programmers and developers – this interface does a great job in aiding in various complex cluster programming and machine learning tasks.
  • Spark is a fast, easy-to-use and flexible data processing framework.
  • It has an advanced execution engine supporting cyclic data  flow and in-memory computing. 
  • Spark can run on Hadoop, standalone or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, Cassandra and others.

.What are the key features of Apache Spark that you like?

  • Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
  • It has built-in APIs in multiple languages like Java, Scala, Python and R
  • It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.
  • Allows Integration with Hadoop and files included in HDFS.
  • Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter
  • Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
  • Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing.
  • Support for Several Programming Languages
    • Spark code can be written in any of the four programming languages, namely Java, Python, R, and Scala.
    • It also provides high-level APIs in these programming languages. Additionally, Apache Spark provides shells in Python and Scala.
    • The Python shell is accessed through the ./bin/pyspark directory, while for accessing the Scala shell one needs to go to the .bin/spark-shell directory.
  • Lazy Evaluation – Apache Spark makes use of the concept of lazy evaluation, which is to delay the evaluation up until the point it becomes absolutely compulsory.
  • Machine Learning – For big data processing, Apache Spark’s MLib machine learning component is useful. It eliminates the need for using separate engines for processing and machine learning.
  • Multiple Format Support
    • Apache Spark provides support for multiple data sources, including Cassandra, Hive, JSON, and Parquet.
    • The Data Sources API offers a pluggable mechanism for accessing structured data via Spark SQL.
    • These data sources can be much more than just simple pipes able to convert data and pulling the same into Spark.
  • Real-Time Computation – Spark is designed especially for meeting massive scalability requirements. Thanks to its in-memory computation, Spark’s computation is real-time and has less latency.
  • Speed
    • For large-scale data processing, Spark can be up to 100 times faster than Hadoop MapReduce.
    • Apache Spark is able to achieve this tremendous speed via controlled portioning.
    • The distributed, general-purpose cluster-computing framework manages data by means of partitions that help in parallelizing distributed data processing with minimal network traffic.
    • Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.
    • Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory.
    • It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed.
    • This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
  • Hadoop Integration
    • Spark offers smooth connectivity with Hadoop.
    • In addition to being a potential replacement for the Hadoop MapReduce functions,
    • Spark is able to run on top of an extant Hadoop cluster by means of YARN for resource scheduling.
  • Combines SQL, streaming, and complex analytics:
    •  In addition to simple “map” and “reduce” operations,
    • Spark supports
      • SQL queries,
      • streaming data, and
      • complex analytics such as machine learning and graph algorithms out-of-the-box.
    • Not only that, users can combine all these capabilities seamlessly in a single workflow.
  • Ease of Use:Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps.
  • Runs Everywhere: 
    • Spark runs on Hadoop, Mesos, standalone, or in the cloud.
    • It can access diverse data sources including HDFS, Cassandra, HBase, S3.

.What are some of the more notable features of Spark?

  • speed, multi-format support, and inbuilt libraries.
  • Allows Integration with Hadoop and files included in HDFS.
  • Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
  • Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
  • Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
  • It allows the integration with Hadoop and files including HDFS.
  • It consists of RDD (Resilient Distributed Datasheets) it can be cached across multiple computing nodes in a cluster.

Clarify the key highlights of Apache Spark.

  • Polyglot
  • Speed
  • Multiple Format Support
  • Lazy Evaluation
  • Real Time Computation
  • Hadoop Integration
  • Machine Learning

.Features, Pros and Cons of Apache Spark

Apache Spark is a tool for large data processing and execution. It also offers high-level operators which effortlessly develop a parallel application for the processing. The most prominent Apache Spark features are:

  • Polyglot – multiple languages platform
  • It is proficient in speed
  • It also supports the multiple formats
  • Provides real-time computation in data
  • Efficient machine learning

Pros associated

  • In memory computation of spark.
  • Reusability of the spark code.
  • Offers fault tolerance.

Cons associated

  • Sometimes it becomes a bottleneck when it comes to cost efficiency.
  • As compared to Apache Flink, Apache spark has higher latency.

.Can you explain how you can use Apache Spark along with Hadoop?

  • Having compatibility with Hadoop is one of the leading advantages of Apache Spark.
  • The duo makes up for a powerful tech pair.
  • Using Apache Spark and Hadoop allows for making use of Spark’s unparalleled processing power in line with the best of Hadoop’s HDFS and YARN abilities.
  • Following are the ways of using Hadoop Components with Apache Spark:
    • Batch & Real-Time Processing – MapReduce and Spark can be used together where the former handles the batch processing and the latter is responsible for real-time processing
    • HDFS – Spark is able to run on top of the HDFS for leveraging the distributed replicated storage
    • MapReduce – It is possible to use Apache Spark along with MapReduce in the same Hadoop cluster or independently as a processing framework
    • YARN – Spark applications can run on YARN

.What are the downsides of Spark?

Spark utilizes the memory.

following mistakes:

  •  end up running everything on the local node instead of distributing work over to the cluster.
  •  hit some webservice too many times by the way of using multiple clusters.
  • The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole data on a single node.
  • The second mistake is possible in Map-Reduce too.
  • While writing Map-Reduce, user may hit a service from inside of map() or reduce() too many times.
  • This overloading of service is also possible while using Spark.
  • Doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefiting from a file management system
  • Higher latency but consequently, lower throughput
  • No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is a micro-batch processing and not truly real-time data processing
  • Lesser number of algorithms available
  • Spark streaming doesn’t support record-based window criteria
  • The work needs to be distributed over multiple clusters instead of running everything on a single node
  • While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck

What are the disadvantages of using Apache Spark over Hadoop MapReduce?

  • Apache spark does not scale well for compute intensive jobs and consumes large number of system resources.
  • Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data.
  • Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data platforms or Apache hadoop. 

.What advantages does Spark offer over Hadoop MapReduce?

  • Enhanced Speed – MapReduce makes use of persistent storage for carrying out any of the data processing tasks. On the contrary, Spark uses in-memory processing that offers about 10 to 100 times faster processing than the Hadoop MapReduce.
  • Multitasking – Hadoop only supports batch processing via inbuilt libraries. Apache Spark, on the other end, comes with built-in libraries for performing multiple tasks from the same core, including batch processing, interactive SQL queries, machine learning, and streaming.
  • No Disk-Dependency – While Hadoop MapReduce is highly disk-dependent, Spark mostly uses caching and in-memory data storage.
  • Iterative Computation – Performing computations several times on the same dataset is termed as iterative computation. Spark is capable of iterative computation while Hadoop MapReduce isn’t.
  • Spark provides inbuilt libraries for most of the multidimensional task as compared to map reduce.
  • Spark can perform multiple computations within the same datasheet.
  • MapReduce makes use of persistence storage for any of the data processing tasks.
  • Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  • Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
  • In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using Oozie/shell script. This mechanism is very time consuming and the map-reduce task has heavy latency.
  • And quite often, translating the output out of one MR job into the input of another MR job might require writing another code because Oozie may not suffice.
  • In Spark, you can basically do everything using single application/console (pyspark or scala console) and get the results immediately.
  • Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
  • Spark kind of equals to MapReduce and Oozie put together.
  • Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
  • Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
  • Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  • Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.

.Is there is a point of learning MapReduce, then?

Yes. For the following reason:

  • MapReduce is a paradigm used by many big data tools including Spark. So, understanding the MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
  • When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce paradigm is still very relevant.
  • Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you understand the Mapreduce then you will be able to optimize your queries better.

.Which one will you choose for a project –Hadoop MapReduce or Apache Spark?

  • The answer to this question depends on the given project scenario – as it is known that Spark makes use of memory instead of network and disk I/O.
  • However, Spark uses large amount of RAM and requires dedicated machine to produce effective results.
  • So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization. 

.When should you choose Apache Spark?

  • When the application needs to scale.
  • When the application needs both batch and real-time processing of records.
  • When the application needs to connect to multiple databases like Apache Cassandra, Apache Mahout, Apache HBase, SQL databases, etc.
  • When the application should be able to query structured datasets cumulatively present across different database platforms.

.How can you compare Hadoop and Spark in terms of ease of use?

  • Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time.
  • Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.

.How is Spark different from MapReduce? Is Spark faster than MapReduce?

  Yes, Spark is faster than MapReduce.

  • There is no tight coupling in Spark i.e., there is no mandatory rule that reduce must come after map.

  • Spark tries to keep the data “in-memory” as much as possible.

    • In MapReduce, the intermediate data will be stored in HDFS and hence takes longer time to get the data from a source but this is not the case with Spark.

 .List some use cases where Spark outperforms Hadoop in processing.

  • Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
  • Spark is preferred over Hadoop for real time querying of data
  • Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

Real Time Processing: Spark is favored over Hadoop for constant questioning of information. for example

  • Securities exchange Analysis,
  • Banking,
  • Healthcare,
  • Telecommunications, and so on.

Big Data Processing: Spark runs upto multiple times quicker than Hadoop with regards to preparing medium and enormous estimated datasets. 

.How Spark uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

.Why is Spark faster than Hadoop?

Spark is so fast because it uses a

  • state-of-the-art DAG scheduler,
  • a query optimizer, and
  • a physical execution engine.

 .Which programming languages could be used for Spark Application Development?
One can use following programming languages.

  • Java
  • Scala
  • Python
  • Clojure
  • R (Using SparkR)
  • SQL (Using SparkSQL)

Also, by the way of piping the data via other commands, we should be able to use all kinds of programming languages or binaries. 

 Scala shell can be easily accessed through the ./bin/spark-shell and Python can be accessed through ./bin/pyspark.

 Among them, Scala is the most popular because Apache Spark is written in Scala.

 

.What record frameworks does Spark support?

The accompanying three document frameworks are upheld by Spark:

  • Hadoop Distributed File System (HDFS).
  • Local File framework.
  • Amazon S3 
What file systems Spark support?
  • Hadoop Distributed File System (HDFS).
  • Local File system.
  • S3

 .Which data sources can Spark access?
Spark can access data from hundreds of sources. Some of them are :

  • HDFC
  • Apache Cassandra
  • Apache HBase
  • Apache Hive
  • Parquet file
  • JSON Datasets
  • SparkSQL is a Spark component that supports querying data either via SQL or via the Hive Query Language.
  • It originated as the Apache Hive port to run on top of Spark (in place of MapReduce) and is now integrated with the Spark stack.
  • In addition to providing support for various data sources, it makes it possible to weave SQL queries with code transformations which results in a very powerful tool. 

.How might you limit information moves when working with Spark?

The different manners by which information moves can be limited when working with Apache Spark are:

Communicate and Accumilator factors

.What are the different dimensions of constancy in Apache Spark?

  • Apache Spark naturally endures the mediator information from different mix tasks, anyway it is regularly proposed that clients call persevere () technique on the RDD on the off chance that they intend to reuse it.
  • Sparkle has different tirelessness levels to store the RDDs on circle or in memory or as a mix of both with various replication levels.

What are the disservices of utilizing Apache Spark over Hadoop MapReduce?

  • Apache Spark’s in-memory ability now and again comes a noteworthy barrier for cost effective preparing of huge information.
  • Likewise, Spark has its own record the board framework and consequently should be incorporated with other cloud based information stages or apache hadoop.

What is the most commonly used programming language used in Spark?

  • A great representation of the basic interview questions on Spark, this one should be a no-brainer.
  • Even though there are plenty of developers that like to use Python, Scalaremains the most commonly used language for Spark.

Can you use Apache Spark alongside Hadoop?

They both make a powerful group together. Hadoop’s HDFS is being run by Spark on top. MapReduce of Hadoop can be used with Spark. Many Spark applications are run on YARN. MapReduce with Spark provides batch and real-time processing respectively.

What are the demerits of Spark?

  • Spark uses more storage compared to Hadoop etc.
  • Developers should be cautious while running applications in Spark.
  • The work needs to be distributed over manifold clusters.
  • “in-memory” capability becomes a bottleneck for cost-efficient processing.
  • Spark consumes a vast amount of data.

What are the demerits of Spark?

  • Spark uses more storage compared to Hadoop etc.
  • Developers should be cautious while running applications in Spark.
  • The work needs to be distributed over manifold clusters.
  • “in-memory” capability becomes a bottleneck for cost-efficient processing.
  • Spark consumes a vast amount of data.

What are the limitations of Apache Spark?
limitations of Apache Spark. they are:

Limitations of Apache Spark:

1. No File Management System
Apache Spark relies on other platforms like Hadoop or some another cloud-based Platform for file management system. This is one of the major issues with Apache Spark.

2. Latency
While working with Apache Spark, it has higher latency.

3. No support for Real-Time Processing
In Spark Streaming, the arriving live stream of data is divided into batches of the pre-defined interval, and each batch of data is treated like Spark Resilient Distributed Database (RDDs). Then these RDDs are processed using the operations like map, reduce, join etc. The result of these operations is returned in batches. Thus, it is not real-time processing but Spark is near real-time processing of live data. Micro-batch processing takes place in Spark Streaming.

4. Manual Optimization
Manual Optimization is required to optimize Spark jobs. Also, it is adequate to specific datasets. we need to control manually if we want to partition and cache in Spark to be correct.

5. Less no. of Algorithm
Spark MLlib lags behind in terms of a number of available algorithms like Tanimoto distance.

6. Window Criteria
Spark does not support record based window criteria. It only has time-based window criteria.

7. Iterative Processing
In Spark, the data iterates in batches and each iteration is scheduled and executed separately.

8. Expensive
when we want cost-efficient processing of big data In-memory capability can become a bottleneck as keeping data in memory is quite expensive. At that time the memory consumption is very high, and it is not handled in a user-friendly manner. The cost of Spark is quite high because Apache Spark requires lots of RAM to run in-memory.

no real time processing but Spark has near real-time processing of live data.
2)Its “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
3)Does not have its file management system, so you need to integrate with hadoop, or other cloud based data platform.
4)It consumes a lot of Memory,and issues around memory consumption are not handled in a user friendly Manner.
5)Spark is sky high as the cost of storing large amount of data in-memory is expensive.
6)Manual optimization is required for correct partitioning and caching of data in Spark.
7)In Spark, the data iterates in batches and each iteration is scheduled and executed separately.
)Spark is near real-time processing
2)Transferring the data from any RDBMS to HDFS(vice versa) using spark is not mature way. We can get the data in parallel from RDBMS and represented as Dataframe. It is very rigid way of doing this. We have to provide lower,upper bound, partition column(Don’t thing this partition is same as hive parttitioning. For spark jdbc api, we have give these attributes to read data in parallel from RDMBS to spark dataframe). Sqoop is more mature way of doing this.
Spark SQL is a Spark interface to work with Structured and Semi-Structured data (data that as defined fields i.e. tables). It provides abstraction layer called DataFrame and DataSet through with we can work with data easily. One can say that DataFrame is like a table in a relational database. Spark SQL can read and write data in a variety of Structured and Semi-Structured formats like Parquets, JSON, Hive. Using SparkSQL inside Spark application is the best way to use it. This empowers us to load data and query it with SQL. we can also combine it with “regular” program code in Python, Java or Scala.

Spark sql is a module in apache spark for structured and semi structured data processing.The interface provided by spark sql provides spark more information about the structure of data and computation being performed on the data.It integrates relational processing with spark’s functional programming.It offers much tighter integration of relational processing with procedural processing through declarative dataframe API’s. Dataframe api’s and Dataset api’s are the way to interact with spark sql.
Define the various type of transformation in Apache Spark Streaming.

. what are all the file formats supported by spark ?

Avro, parquest, json, xml, csv, tsv, snappy, orc, rc are the file formats supported by spark.
Raw files as well as the structured file formats also supported by spark for efficient reading.

. What are all the internal daemons used in spark?

ACLs, BlockManager, Memestore, DAGScheduler, SparkContext, Driver, Worker,Executor, Tasks.

.What is the upside of Spark apathetic assessment?

Apache Spark utilizes sluggish assessment all together the advantages:

  • Apply Transformations tasks on RDD or “stacking information into RDD” isn’t executed quickly until it sees an activity. Changes on RDDs and putting away information in RDD are languidly assessed. Assets will be used in a superior manner if Spark utilizes sluggish assessment.
  • Lazy assessment advances the plate and memory utilization in Spark.
  • The activities are activated just when the information is required. It diminishes overhead.

  Explain about the popular use cases of Apache Spark

 

Apache Spark is mainly used for

  • Iterative machine learning.
  • Interactive data analytics and processing.
  • Stream processing
  • Sensor data processing

 .Is Apache Spark a good fit for Reinforcement learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

 .How does Spark handle distributed processing?

  • Spark provides an abstraction to the distributed processing through Spark RDD API.
  • A general user does not need to worry about how data is processed in a distributed cluster.
  • There are some exceptions though. When you optimize an application for performance, you should understand about operations and actions which require the data transfer between nodes.

 

 While processing data from HDFS, does it execute code near data?

Yes, it does in most cases. It creates the executors near the nodes that contain data.

 To use Spark on an existing Hadoop Cluster, do we need to install Spark on all nodes of Hadoop?

Since Spark runs as an application on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes. So, you do not need to install the Spark on all nodes. When a job is submitted, the Spark will be installed temporarily on all nodes on which execution is needed.

Can you explain how to minimize data transfers while working with Spark?
: Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:

  • Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
  • Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
  • Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs
  •  Use broadcast variables to join small and large RDDs;
  • Using Broadcast Variable– Broadcast variable enhances the efficiency of joins between small and large RDDs.
  • Using Accumulators – Accumulators help update the values of variables in parallelwhile executing.
  • The most common way is to avoid operations ByKey, repartitionor any other operations which trigger shuffles.

What are the common mistakes developers make when running Spark applications?

  • Hitting the web service several times by using multiple clusters.
  • Run everything on the local node instead of distributing it.

Developers need to be careful with this, as Spark makes use of memory for processing.

  • Maintaining the required size of shuffle blocks.
  • Spark developer often make mistakes with managing directed acyclic graphs (DAG’s.)