SPARK-2

SPARK STAGE

Spark Executor

SPARK DAG

SPARK CLUSTER MANAGER

——————————————————————————————————————————————————————————

SPARK STAGE

Spark-Stage-An-Introduction-to-Physical-Execution-plan

  • A stage is nothing but a step in a physical execution plan.
  • It is basically a physical unit of the execution plan.
  •  It covers the types of Stages in Spark which are of two types: ShuffleMapstage in Spark and ResultStage in spark.
  • It is a set of parallel tasks i.e. one task per partition.
  • In other words, each job which gets divided into smaller sets of tasks is a stage.
  • Although, it totally depends on each other. However, we can say it is as same as the map and reduce stages in MapReduce.

Submitting-a-Job-in-Spark

 

  • We can associate the spark stage with many other dependent parent stages.
  • However, it can only work on the partitions of a single RDD
  • Also, with the boundary of a stage in spark marked by shuffle dependencies.
  • Ultimately,  submission of Spark stage triggers the execution of a series of dependent parent stages.
  • Although, there is a first Job Id present at every stage that is the id of the job which submits stage in Spark.

 

 

Spark-Stages

Types of Spark Stages

Stages in Apache spark have two categories

1. ShuffleMapStage in Spark

2. ResultStage in Spark

1. ShuffleMapStage in Spark

ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG.

  • It produces data for another stage(s). 
  • In a job in Adaptive Query Planning / Adaptive Scheduling, we can consider it as the final stage in Apache Spark and it is possible to submit it independently as a Spark job for Adaptive Query Planning.
  • In addition,  at the time of execution, a Spark ShuffleMapStage saves map output files. We can fetch those files by reduce tasks.
  • When all map outputs are available, the ShuffleMapStage is considered ready.
  •  Although, output locations can be missing sometimes. Two things we can infer from this scenario.
    • Those are partitions might not be calculated or are lost.
    • However, we can track how many shuffle map outputs available.
    • To track this, stages uses outputLocs &_numAvailableOutputs internal registries.
  • We consider ShuffleMapStage in Spark as an input for other following Spark stages in the DAG of stages.
  • Basically, that is shuffle dependency’s map side.
  • It is possible that there are various multiple pipeline operations in ShuffleMapStage like map and filter, before shuffle operation.
  • We can share a single ShuffleMapStage among different jobs.

 

DAG-Scheduler

2. ResultStage in Spark

  • By running a function on a spark RDD Stage that executes a Spark action in a user program is a ResultStage.
  • It is considered as a final stage in spark.
  • ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark.
  • It also helps for computation of the result of an action

Graph-of-Spark-Stages

Getting StageInfo For Most Recent Attempt

latestInfo method which helps to know the most recent StageInfo.`
latestInfo: StageInfo

Stage Contract

It is a private[scheduler] abstract contract.
abstract class Stage {
def findMissingPartitions(): Seq[Int]
}

Method To Create New Apache Spark Stage

 basic method by which we can create a new stage in Spark. The method is:

makeNewStageAttempt(

numPartitionsToCompute: Int,

taskLocalityPreferences: Seq[Seq[TaskLocation]] = Seq.empty): Unit

  • Basically, it creates a new TaskMetrics.
  • With the help of RDD’s SparkContext,
  • we register the internal accumulators.
  • We can also use the same Spark RDD that was defined when we were creating Stage.
  • In addition, to set latestInfo to be a StageInfo, from Stage we can use the following:
    • nextAttemptId,
    • numPartitionsToCompute, &
    • taskLocalityPreferences, 
    • increments nextAttemptId counter.
  • important thing to note is that we use this method only when DAGScheduler submits missing tasks for a Spark stage.

Spark Executor

In Apache Spark, some distributed agent is responsible for executing tasks, this agent is what we call Spark Executor.

Apache-Spark-Executor-01

 What is Spark Executor

  • Basically, we can sayExecutors in Spark are worker nodes.
  • Those help to process in charge of running individual tasks in a given Spark job.Moreover, we launch them at the start of a Spark application.
  • Then it typically runs for the entire lifetime of an application.
  • As soon as they have run the task, sends results to the driver.
  • Executors also provide in-memory storage for Spark RDDs that are cached by user programs through Block Manager. 
    In addition, for the complete lifespan of a spark application, it runs.
  • That infers the static allocation of Spark executor. However, we can also prefer for dynamic allocation.
  • Moreover, with the help of Heartbeat Sender Thread, it sends metrics and heartbeats.
  • One of the advantage we can have as many executors in Spark as data nodes.
  • Moreover also possible to have as many cores as you can get from the cluster.
  • The other way to describe Apache Spark Executor is either by their id, hostname, environment (as SparkEnv), or classpath.
  • The most important point to note is Executor backends exclusively manage Executor in Spar
  • Executors have two jobs. To begin with,
    • they run the assignments that make up the application and return results to the driver.
    • Second, they give in-memory stockpiling to RDDs that are stored by client programs.
  • At the point when SparkContext associate with a bunch chief, it obtains an Executor on hubs in the group.
  • Agents are Spark forms that run calculations and store the information on the laborer hub. The last errands by SparkContext are exchanged to agents for their execution.

SparkContext links to a cluster manager and obtains an Executor on cluster nodes. Executors are processes of Spark that run calculations and save the data on the operative node. The last SparkContext tasks are relocated to executors and executed.

 

HeartbeatReceiver’s-Heartbeat-Message-Handler-01-3

. Conditions to Create Spark Executor

Some conditions in which we create Executor in Spark is:

  • When CoarseGrainedExecutorBackend receives RegisteredExecutor message. Only for Spark Standalone and YARN.
  • While Mesos’s MesosExecutorBackend registered on Spark.
  • When LocalEndpoint is created for local mode.

. Creating Spark Executor Instance

By using the following, we can create the Spark Executor:

  • From Executor ID.
  • By using SparkEnv we can access the local MetricsSystem as well as BlockManager. Moreover, we can also access the local serializer by it.
  • From Executor’s hostname.
  • To add to tasks’ classpath, a collection of user-defined JARs. By default, it is empty.
  • By flag whether it runs in local or cluster mode (disabled by default, i.e. cluster is preferred)

 

Moreover, when creation is successful, the one INFO messages pop up in the logs. That is:
INFO Executor: Starting executor ID [executorId] on host [executorHostname]

. Heartbeater — Heartbeat Sender Thread

Basically, with a single thread, heartbeater is a daemon ScheduledThreadPoolExecutor.
We call this thread pool a driver-heartbeater.

. Launching Task — launchTask Method

By using this method, we execute the input serializedTask task concurrently.

  1. launchTask(
  2. context: ExecutorBackend,
  3. taskId: Long,
  4. attemptNumber: Int,
  5. taskName: String,
  6. serializedTask: ByteBuffer): Unit

Launching-tasks-on-executor-using-TaskRunners-01

  1. launchTask(
  2. context: ExecutorBackend,
  3. taskId: Long,
  4. attemptNumber: Int,
  5. taskName: String,
  6. serializedTask: ByteBuffer): Unit
  • Moreover, by using launchTask we use to create a TaskRunner, internally. Then, with the help of taskId, we register it in the runningTasks internal registry.
  • Afterwards, we execute it on “Executor task launch worker” thread pool.

. “Executor Task Launch Worker” Thread Pool — ThreadPool Property

  • Basically, To launch, by task launch worker id. It uses threadPool daemon cached thread pool.
  • Moreover, at the same time of creation of Spark Executor, threadPool is created. Also, shuts it down when it stops.

SPARK DAG

  • (Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD.
  • In Spark DAG, every edge directs from earlier to later in the sequence.
  • On the calling of Action, the created DAG submits to DAG Scheduler which further splits the graph into the stages of the task.

dag-visualization

. What is DAG in Apache Spark?

  • DAG a finite direct graph with no directed cycles.
  • There are finitely many vertices and edges, where each edge directed from one vertex to another.
  • It contains a sequence of vertices such that every edge is directed from earlier to later in the sequence.
  • It is a strict generalization of MapReduce model.
  • DAG operations can do better global optimization than other systems like MapReduce.
  • The picture of DAG becomes clear in more complex jobs.
  • Apache Spark DAG allows the user to dive into the stage and expand on detail on any stage.
  • In the stage view, the details of all RDDs belonging to that stage are expanded.
  • The Scheduler splits the Spark RDD into stages based on various transformation applied. 
  •  Each stage is comprised of tasks, based on the partitions of the RDD, which will perform same computation in parallel. 

The graph here refers to navigation, and directed and acyclic refers to how it is done.

Explain Directed Acyclic Graph in Spark.
What is the function of Directed Acyclic Graph in Spark?

  • In mathematical term, the Directed Acyclic Graph is a graph with cycles which are not directed.
  • DAG is a graph which contains set of all the operations that are applied on RDD.
  • On RDD when any action is called.
  • Spark creates the DAG and submits it to the DAG scheduler.
  • Only after the DAG is built, Spark creates the query optimization plan.
  • The DAG scheduler divides operators into stages of tasks.
  • A stage is comprised of tasks based on partitions of the input data.
  • The DAG scheduler pipelines operators together.
    Fault tolerance is achieved in Spark using the Directed Acyclic Graph.
  • The query optimization is possible in Spark by the use of DAG. Thus, we get the better performance by using DAG.

What is DAG and Stage in spark processing?

FYI the above program, the overall execution plan is as per the DAG scheduler.

For each and every method execution is optimized as per the stages.

 

. Need of Directed Acyclic Graph in Spark

The limitations of Hadoop MapReduce became a key point to introduce DAG in Spark. The computation through MapReduce in three steps:

  • The data is read from HDFS.
  • Then apply Map and Reduce operations.
  • The computed result is written back to HDFS.
  • Each MapReduce operation is independent of each other and HADOOP has no idea of which Map reduce would come next.
  • Sometimes for some iteration, it is irrelevant to read and write back the immediate result between two map-reduce jobs.
  • In such case, the memory in stable storage (HDFS) or disk memory gets wasted.
  • In multiple-step, till the completion of the previous job all the jobs block from the beginning.
  • As a result, complex computation can require a long time with small data volume.

While in Spark, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed.

In this way, we optimize the execution plan, e.g. to minimize shuffling data around. In contrast, it is done manually in MapReduce by tuning each MapReduce step.

. How DAG works in Spark?

  • The interpreter is the first layer, using a Scala interpreter, Spark interprets the code with some modifications.
  • Spark creates an operator graph when you enter your code in Spark console.
  • When we call an Action on Spark RDD at a high level, Spark submits the operator graph to the DAG Scheduler.
  • Divide the operators into stages of the task in the DAG Scheduler.
  • A stage contains task based on the partition of the input data.
  • The DAG scheduler pipelines operators together. For example, map operators schedule in a single stage.
  • The stages pass on to the Task Scheduler.
  • It launches task through cluster manager.
  • The dependencies of stages are unknown to the task scheduler.
  • The Workers execute the task on the slave.

The image below briefly describes the steps of How DAG works in the Spark job execution

internals-of-job-execution-in-apache-spark

At higher level, we can apply two type of RDD transformations:

  • narrow transformation (e.g. map(), filter() etc.) and
  •  wide transformation (e.g. reduceByKey()).
  •  Narrow transformation does not require the shuffling of data across a partition, the narrow transformations will group into single stage while in wide transformation the data shuffles.
  • Hence, Wide transformation results in stage boundaries.
  • Each RDD maintains a pointer to one or more parent along with metadata about what type of relationship it has with the parent.
  • For example, if we call val b=a.map() on an RDD, the RDD b keeps a reference to its parent RDD a, that’s an RDD lineage.

 

. How DAG functions in Spark?

  • At the point when an Action is approached Spark RDD at an abnormal state, Spark presents the heredity chart to the DAG Scheduler.
  • Activities are separated into phases of the errand in the DAG Scheduler.
  • A phase contains errand dependent on the parcel of the info information.
  • The DAG scheduler pipelines administrators together.
  • It dispatches task through group chief.
  • The conditions of stages are obscure to the errand scheduler.
  • The Workers execute the undertaking on the slave.
  • Directed Acyclic Graph – DAG is a graph data structure having edges which are directional and do not have any loops or cycles.
  • People use DAG almost all the time. Let’s take an example of getting ready for office.
dag.png

DAG is a way of representing dependencies between objects.

It is widely used in computing. The examples where it is used in computing are:

  1. Build tools such Apache Ant, Apache Maven, make, sbt
  2. Tasks Dependencies in project management – Microsoft Project
  3. The data model of Git

How to attain fault tolerance in Spark?
Is Apache Spark fault tolerant? if yes, how?
The basic semantics of fault tolerance in Apache Spark is, all the Spark RDDs are immutable. It remembers the dependencies between every RDD involved in the operations, through the lineage graph created in the DAG, and in the event of any failure, Spark refers to the lineage graph to apply the same operations to perform the tasks.

There are two types of failures – Worker or driver failure. In case if the worker fails, the executors in that worker node will be killed, along with the data in their memory. Using the lineage graph, those tasks will be accomplished in any other worker nodes. The data is also replicated to other worker nodes to achieve fault tolerance. There are two cases:

1.Data received and replicated – Data is received from the source, and replicated across worker nodes. In the case of any failure, the data replication will help achieve fault tolerance.

2.Data received but not yet replicated – Data is received from the source but buffered for replication. In the case of any failure, the data needs to be retrieved from the source.

For stream inputs based on receivers, the fault tolerance is based on the type of receiver:

<li style=”list-style-type: none”>
Reliable receiver – Once the data is received and replicated, an acknowledgment is sent to the source. In case if the receiver fails, the source will not receive acknowledgment for the received data. When the receiver is restarted, the source will resend the data to achieve fault tolerance.
Unreliable receiver – The received data will not be acknowledged to the source. In this case of any failure, the source will not know if the data has been received or not, and it will nor resend the data, so there is data loss.
To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will be stored in the executor’s memory. With WAL enabled, these received data will also be stored in the log files.

WAL can be enabled by performing the below:

Setting the checkpoint directory, by using streamingContext.checkpoint(path)
Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.

. How to Achieve Fault Tolerance through DAG?

  • RDD splits into the partition and each node operates on a partition at any point in time. Here, the series of Scala function executes on a partition of the RDD.
  • These operations compose together and Spark execution engine view these as DAG (Directed Acyclic Graph).
  • When any node crashes in the middle of any operation say O3 which depends on operation O2, which in turn O1.
  • The cluster manager finds out the node is dead and assign another node to continue processing.
  • This node will operate on the particular partition of the RDD and the series of operation that it has to execute (O1->O2->O3).  Now there will be no data loss.
  • You can refer this link to learn Fault Tolerance in Apache Spark.

. Working of DAG Optimizer in Spark

  • We optimize the DAG in Apache Spark by rearranging and combining operators wherever possible.
  • For, example if we submit a spark job which has a map() operation followed by a filter operation.
  • The DAG Optimizer will rearrange the order of these operators since filtering will reduce the number of records to undergo map operation.

. Advantages of DAG in Spark

There are multiple advantages of Spark DAG, let’s discuss them one by one:

  • The lost RDD can recover using the Directed Acyclic Graph.
  • Map Reduce has just two queries the map, and reduce but in DAG we have multiple levels. So to execute SQL query, DAG is more flexible.
  • DAG helps to achieve fault tolerance. Thus we can recover the lost data.
  • It can do a better global optimization than a system like Hadoop MapReduce.
  • DAG in Apache Spark is an alternative to the MapReduce.
  • It is a programming style used in distributed systems.

. What is ancestry in Spark? How adaptation to internal failure is accomplished in Spark utilizing Lineage Graph?

  •   At whatever point a progression of changes are performed on a RDD, they are not assessed promptly, however languidly.
  • At the point when another RDD has been made from a current RDD every one of the conditions between the RDDs will be signed in a diagram.
  • This chart is known as the ancestry diagram.
  • Consider the underneath situation
  • Ancestry chart of every one of these activities resembles:
  • First RDD
  • Second RDD (applying map)
  • Third RDD (applying channel)
  • Fourth RDD (applying check)
  • This heredity diagram will be helpful on the off chance that if any of the segments of information is lost.
  • Need to set spark.logLineage to consistent with empower the Rdd.toDebugString() gets empowered to print the chart logs.

SPARK CLUSTER MANAGER

apache-spark-cluster-managers-yarn-mesos-standalone-768x402-1

apache-spark-compatibility-with-hadoop-3

 

Explain about the different cluster managers in Apache Spark

The 3 different clusters managers supported in Apache Spark are:

  • YARN:responsible for resource management in Hadoop
  • Apache Mesos –
    • Has rich resource scheduling capabilities and is well suited to run Spark along with other applications.
    • It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
    • Its priority is to scale down the allocations between several commands in order to provide interfaces when several users run their shells.
    • Generalized/regularly utilized group administrator, additionally runs Hadoop MapReduce and different applications.
    • generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
  • Standalone deployments
    • Well suited for new deployments which only run and are easy to set up.
    • Standalone: An essential administrator to set up a group.
    •  a basic manager to set up a cluster.

On which all platform can Apache Spark run?

Spark can run on the following platforms:

  • YARN (Hadoop): 
    • Since yarn can handle any kind of workload, the spark can run on Yarn.
    • Though there are two modes of execution.
    • One in which the Spark driver is executed inside the container on node and
    • second in which the Spark driver is executed on the client machine.
    • This is the most common way of using Spark.
  • Apache Mesos: 
    • Mesos is an open source good upcoming resource manager.
    • Spark can run on Mesos.
  • EC2:
    • If you do not want to manage the hardware by yourself, you can run the Spark on top of Amazon EC2.
    • This makes spark suitable for various organizations.
  • Standalone
    • If you have no resource manager installed in your organization, you can use the standalone way.
    • Basically, Spark provides its own resource manager.
    • All you have to do is install Spark on all nodes in a cluster, inform each node about all nodes and start the cluster.
    • It starts communicating with each other and run.

 

YARN-client and YARN-cluster (efficient for master-slave architecture)
MESOS (Efficient for master master architecture container orchestration)
KUBERNETES(container orchestration)

Different Running Modes of Apache Spark
Apache Spark can be run in following three mode :

  •  Local mode
  •  Standalone mode
  •  Cluster mode

We can launch spark application in four modes:

1) Local Mode (local[*],local,local[2]…etc)
-> When you launch spark-shell without control/configuration argument, It will launch in local mode
spark-shell –master local[1]
-> spark-submit –class com.df.SparkWordCount SparkWC.jar local[1]

2) Spark Standalone Client/cluster manger:
-> spark-shell –master spark://hduser:7077
-> spark-submit –class com.df.SparkWordCount SparkWC.jar spark://hduser:7077

3) Yarn mode (Client/Cluster mode):
-> spark-shell –master yarn or
(or)
->spark-shell –master yarn –deploy-mode client

Above both commands are same.
To launch spark application in cluster mode, we have to use spark-submit command. We cannot run yarn-cluster mode via spark-shell because when we run spark application, driver program will be running as part application master container/process. So it is not possible to run cluster mode via spark-shell.
-> spark-submit –class com.df.SparkWordCount SparkWC.jar yarn-client
-> spark-submit –class com.df.SparkWordCount SparkWC.jar yarn-cluster

4) Mesos mode:
-> spark-shell –master mesos://HOST:5050

What is Standalone mode?

  • In standalone mode, Spark uses a Master daemon which coordinates the efforts of the Workers, which run the executors.
  • Standalone mode is the default, but it cannot be used on secure clusters.
  • When you submit an application, you can choose how much memory its executors will use, as well as the total number of cores across all executors.

What are client mode and cluster mode?

  • Each application has a driver process which coordinates its execution.
  • This process can run in the foreground (client mode) or in the background (cluster mode).
  • Client mode is a little simpler, but cluster mode allows you to easily log out after starting a Spark application without terminating the application.

How to run spark in Standalone client mode?

  • spark-submit \
  • class org.apache.spark.examples.SparkPi \
  • deploy-mode client \
  • master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
  • $SPARK_HOME/examples/lib/spark-examples_version.jar 10

How to run spark in Standalone cluster mode?

  • spark-submit \
  • class org.apache.spark.examples.SparkPi \
  • deploy-mode cluster \
  • master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
  • $SPARK_HOME/examples/lib/spark-examples_version.jar 10

How to run spark in YARN client mode?

  • spark-submit \
  • class org.apache.spark.examples.SparkPi \
  • deploy-mode client \
  • master yarn \
  • $SPARK_HOME/examples/lib/spark-examples_version.jar 10

How to run spark in YARN cluster mode?

  • spark-submit \
  • class org.apache.spark.examples.SparkPi \
  • deploy-mode cluster \
  • master yarn \
  • $SPARK_HOME/examples/lib/spark-examples_version.jar 10
What is Yarn?

 

  • Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster .
  • Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.
  •  Like Hadoop, YARN is one of the key highlights in Spark, giving a focal and asset the executives stage to convey adaptable activities over the bunch.
  • YARN is a conveyed holder chief, as Mesos for instance, while Spark is an information preparing instrument.
  • Sparkle can keep running on YARN, a similar way Hadoop Map Reduce can keep running on YARN.
  • Running Spark on YARN requires a double dispersion of Spark as based on YARN support.
  • It is mainly concerned with resource management, but is also used to operate across Spark clusters – this is due to it being very scalable.
  •  

.What is YARN mode?

  • In YARN mode, the YARN ResourceManager performs the functions of the Spark Master.
  • The functions of the Workers are performed by the YARN NodeManager daemons, which run the executors.
  • YARN mode is slightly more complex to set up, but it supports security.

What are the various modes in which Spark runs on YARN? (Local vs Client vs Cluster Mode)

Apache Spark has two basic parts:

  1. Spark Driver: Which controls what to execute where
  2. Executor: Which actually executes the logic

While running Spark on YARN, though it is very obvious that executor will run inside containers, the driver could be run either on the machine which user is using or inside one of the containers. The first one is known as Yarn client mode while second is known as Cluster-Mode. The following diagrams should give you a good idea:

YARN client mode: The driver is running on the machine from which client is connected

rdd.pngSpark Interview Questions – Spark RDD Client Mode

YARN cluster mode: The driver runs inside the cluster.

rdd2.png

Local mode: It is only for the case when you do not want to use a cluster and instead want to run everything on a single machine. So Driver Application and Spark Application are both on the same machine as the user.

 

Do you have to introduce Spark on all hubs of YARN bunch?

  •  No, in light of the fact that Spark keeps running over YARN.
  • Flash runs autonomously from its establishment.
  • Sparkle has a few alternatives to utilize YARN when dispatching employments to the group, as opposed to its very own inherent supervisor, or Mesos.
  • Further, there are a few arrangements to run YARN.
  • They incorporate ace, convey mode, driver-memory, agent memory, agent centers, and line.

Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Is it possible to run Spark and Mesos along with Hadoop?

  • Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines.
  • Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

there are many arrangements to run YARN. Some of these are master and deploy-mode or driver-memory etc.

Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes.
So, you just have to install Spark on one node.

 

What are the benefits of using Spark with Apache Mesos?

  • It renders scalable partitioning among various Spark instances and
  • dynamic partitioning between Spark and other big data frameworks.

How can Spark be connected to Apache Mesos?

To connect Spark with Mesos-

Step by step procedure for connecting Apache Spark with Apache Mesos is:

  1. Configure the Spark driver program to connect with Apache Mesos
  2. Put the Spark binary package in a location accessible by Mesos
  3. Install Apache Spark  in the same location as that of the Apache Mesos
  4. Configure the spark.mesos.executor.home property for pointing to the location where the Apache Spark is installed

 

How can we launch Spark application on YARN?
Explain the technique to launch Apache Spark over Hadoop YARN.

  • Apache Spark has two modes of running applications on YARN:
  • cluster and client
  • spark-submit or spark-shell –master yarn-cluster or –master yarn-client

 To run Spark Applications, should we install Spark on all the nodes of a Mesos cluster?

Spark programs can be executed on top of Mesos. So, there is no need to install Spark on the nodes of a Mesos cluster, to run spark applications.

 

What is SPARK UI how to monitor a spark job?

  • Jobs– to view all the spark jobs
  • Stages– to check the DAGs in spark
  • Storages– to check all the cached RDDs
  • Streaming– to check the cached RDDs
  • Spark history server– to check all the logs of finished spark jobs.

 .How do you establish a connection from Apache Mesos to Apache Spark?
A connection to Mesos could be established in two ways.

  • Spark Driver program has to be configured to establish a connection to Mesos.
  • Also, Spark binaries location should be accessible to the Apache Mesos program.
    The other way to connect to Mesos is that installing Spark in the same location as that of Mesos, and configure the property, spark.mesos.executor.home to point to the location of Mesos installation.

.On which platforms can Spark run?
When Spark is run in stand-alone cluster mode, it can run on :

  • Hadoop YARN
  • EC2
  • Mesos
  • Kubernetes