.Explain in brief what is the architecture of Spark?
. Explain the Apache Spark Architecture.
Clarify quickly about the parts of Spark Architecture?
At the architecture level, from a macro perspective, the Spark might look like this:
|5) Interactive Shells or Job Submission Layer|
|4) API Binding: Python, Java, Scala, R, SQL|
|3) Libraries: MLLib, GraphX, Spark Streaming|
|2) Spark Core (RDD & Operations on it)|
|1) Spark Driver -> Executor|
|0) Scheduler or Resource Manager|
0) Scheduler or Resource Manager:
At the bottom is the resource manager.
- This resource manager could be external such YARN or Mesos.
- Or it could be internal if the Spark is running in standalone mode.
- The role of this layer is to provide a playground in which the program can run distributively.
- For example, YARN (Yet Another Resource Manager) would create application master, executors for any process.
1) Spark Driver -> Executor:
- One level above scheduler is the actual code by the Spark which talks to the scheduler to execute.
- This piece of code does the real work of execution.
- The Spark Driver that would run inside the application master is part of this layer.
- Spark Driver dictates what to execute and executor executes the logic.
2) Spark Core (RDD & Operations on it):
- Spark Core is the layer which provides maximum functionality.
- This layer provides abstract concepts such as RDD and the execution of the transformation and actions.
3) Libraries: MLLib,, GraphX, Spark Streaming, Dataframes:
The additional vertical wise functionalities on top of Spark Core is provided by various libraries such as MLLib, Spark Streaming, GraphX, Dataframes or SparkSQL etc.
4) API Bindings are internally calling the same API from different languages.
5) Interactive Shells or Job Submission Layer:
- The job submission APIs provide a way to submit bundled code.
- It also provides interactive programs (PySpark, SparkR etc.) that are also called REPL or Read-Eval-Print-Loop to process data interactively.
Clarify the Apache Spark Architecture. How to Run Spark applications?
- Apache Spark application contains two projects in particular a Driver program and Workers program.
- A group supervisor will be there in the middle of to communicate with these two bunch hubs. Sparkle Context will stay in contact with the laborer hubs with the assistance of Cluster Manager.
- Spark Context resembles an ace and Spark laborers resemble slaves.
- Workers contain the agents to run the activity.
- In the event that any conditions or contentions must be passed, at that point Spark Context will deal with that.
- RDD’s will dwell on the Spark Executors.
- can likewise run Spark applications locally utilizing a string, and on the off chance that you need to exploit appropriated conditions you can take the assistance of S3, HDFS or some other stockpiling framework.
.Explain about the core components of a distributed Spark application.
Apache Spark application contains two programs namely a
Driver program and
A cluster manager will be there in-between to interact with these two cluster nodes.
Spark Context will keep in touch with the worker nodes with the help of Cluster Manager.
Spark Context is like a master and Spark workers are like slaves.
Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that.
RDD’s will reside on the Spark Executors.
You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system
- Driver– The process that
- runs the main () method of the program
- create RDDs
- perform transformations and
- actions on them.
- The Spark driver is the procedure running the sparkle setting .
- in charge of changing over the application to a guided diagram of individual strides to execute on the bunch.
- There is one driver for each application.
- Executor –The worker processes that run the individual tasks of a Spark job.
- Cluster Manager-A pluggable component in Spark,
- to launch Executors and Drivers.
- The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN
.Explain About The Common Workflow Of A Spark Program
- The foremost step in a Spark program involves creating input RDD’s from external data.
- Use various RDD transformations like filter() to create new transformed RDD’s based on the business logic.
- persist() any intermediate RDD’s which might have to be reused in future.
- Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.
How to submit a spark job?
- spark-submit –class org.apache.spark.examples.ClassJobName –master yarn –deploy-mode client –driver-memory 4g –num-executors 2 –executor-memory 2g –executor-cores 10
- in the above sample
- –master is a cluster manager
driver-memory is the actual memory size of the driver
- executor-memory is the actual memory size of the executor
- –num-executors is the total number of executors which are running at the worker nodes.
- –executor-cores number of individual processes that the executor memory can take up
- –master is a cluster manager
.What are the steps that occur when you run a Spark application on a cluster?
The user submits an application using spark-submit.
- Spark-submit launches the driver program and invokes the main() method specified by the user.
- The driver program contacts the cluster manager to ask for resources to launch executors.
- The cluster manager launches executors on behalf of the driver program.
- The driver process runs through the user application. Based on the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks.
- Tasks are run on executor processes to compute and save results.
- If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors and release resources from the cluster manager.
What are the means that happen when you run a Spark application on a group?
The client presents an application utilizing flash submit.
- Spark-submit dispatches the driver program and conjures the principle() technique indicated by the client.
- The driver program contacts the bunch chief to request assets to dispatch agents.
- The group director dispatches agents in the interest of the driver program.
- The driver process goes through the client application. In light of the RDD activities and changes in the program, the driver sends work to agents as errands.
- Tasks are kept running on agent procedures to register and spare outcomes.
- If the driver’s primary() technique ways out or it calls SparkContext.stop(),it will end the agents and discharge assets from the bunch director.
What are the roles and responsibilities of worker nodes in the apache spark cluster?
Is Worker Node in Spark is same as Slave Node?
- Worker node refers to node which runs the application code in the cluster.
- Worker Node is the Slave Node.
- Master node assign work and worker node actually perform the assigned tasks.
- Worker node processes the data stored on the node,
- they report the resources to the master.
- Based on the resource availability Master schedule tasks.
- Apache Spark follows a master/slave architecture, with one master or driver process and more than one slave or worker processes
- The master is the driver that runs the main() program where the spark context is created.
- It then interacts with the cluster manager to schedule the job execution and perform the tasks.
- The worker consists of processes that can run in parallel to perform the tasks scheduled by the driver program.
- These processes are called executors.
- Whenever a client runs the application code, the driver programs instantiates Spark Context, converts the transformations and actions into logical DAG of execution.
- This logical DAG is then converted into a physical execution plan, which is then broken down into smaller physical execution units.
- The driver then interacts with the cluster manager to negotiate the resources required to perform the tasks of the application code.
- The cluster manager then interacts with each of the worker nodes to understand the number of executors running in each of them.
- The role of worker nodes/executors:
- Perform the data processing for the application code
- Read from and write the data to the external sources
- Store the computation results in memory, or disk.
- The executors run throughout the lifetime of the Spark application.
- This is a static allocation of executors.
- The user can also decide how many numbers of executors are required to run the tasks, depending on the workload.
- This is a dynamic allocation of executors.
- Before the execution of tasks, the executors are registered with the driver program through the cluster manager, so that the driver knows how many numbers of executors are running to perform the scheduled tasks.
- The executors then start executing the tasks scheduled by the worker nodes through the cluster manager.
- Whenever any of the worker nodes fail, the tasks that are required to be performed will be automatically allocated to any other worker nodes