SPARK-1

SPARK vs HADOOP

SPARK ARCHITECTURE

SPARK ECOSYSTEM

SPARK COMPONENT

SPARK CORE

SPARK DRIVER

SPARK CONTEXT

SparkSession

——————————————————————————————————————————————————————————

SPARK vs HADOOP

 1.Compare Spark vs Hadoop MapReduce

 

3333

Hadoop-MapReduce-vs-Apache-Spark2

Compare Spark vs Hadoop

 

 Hadoop vs Spark

 

Scalability

 Produces large number of nodes Highly scalablesSpark Cluster(8000 Nodes)

Memory 

Does not leverage the memory of the hadoop cluster to maximum. save data on memory with the use of RDD’s.

Disk usage

MapReduce is disk oriented. Spark caches data in-memory and ensures low latency.

Processing

Only batch processing is supported Supports real-time processing through spark streaming.

Installation

Is bound to hadoop. Is not bound to Hadoop.
 
Streaming Engine  Map-Reduce Apache spark straming micro batches
Data Flow  Map-Reduce Map-Reduce Direct Acyclic Graph-DAG
Computation Model Map-Reduce batch oriented model   Collect and process
Performance Slow due to batch processing Fast

It is almost 100 times faster than Hadoop

Fault Tolerance Highly fault tolerant due to Map-Reduce  Recovery available without extra code It allows the partition recovery
Interactivity Other than Pig and Hive, it has no interactive mode It has interactive modes
Difficulty It is tough to learn It has high level modules hence it is easy
Data caching Hard disk In-memory
Perform iterative jobs Average Excellent
Independent of Hadoop No Yes
Machine learning applications Average Excellent

Simplicity, Flexibility and Performance are the major advantages of using Spark over Hadoop.

  • Spark is 100 times faster than Hadoop for big data processing as it stores the data in-memory, by placing it in Resilient Distributed Databases (RDD).
  • Spark is easier to program as it comes with an interactive mode.
  • It provides complete recovery using lineage graph whenever something goes wrong.

 

 

 

SPARK ARCHITECTURE

.Explain in brief what is the architecture of Spark?

. Explain the Apache Spark Architecture. 

Clarify quickly about the parts of Spark Architecture?

spark-architecture

At the architecture level, from a macro perspective, the Spark might look like this:

Spark Architecture

5) Interactive Shells or Job Submission Layer
4) API Binding: Python, Java, Scala, R, SQL
3) Libraries: MLLib, GraphX, Spark Streaming
2) Spark Core (RDD & Operations on it)
1) Spark Driver -> Executor
0) Scheduler or Resource Manager

0) Scheduler or Resource Manager:

At the bottom is the resource manager.

  • This resource manager could be external such YARN or Mesos.
  • Or it could be internal if the Spark is running in standalone mode.
  • The role of this layer is to provide a playground in which the program can run distributively.
  • For example, YARN (Yet Another Resource Manager) would create application master, executors for any process.

1) Spark Driver -> Executor:

  • One level above scheduler is the actual code by the Spark which talks to the scheduler to execute.
  • This piece of code does the real work of execution.
  • The Spark Driver that would run inside the application master is part of this layer.
  • Spark Driver dictates what to execute and executor executes the logic.
2

2) Spark Core (RDD & Operations on it):

  • Spark Core is the layer which provides maximum functionality.
  • This layer provides abstract concepts such as RDD and the execution of the transformation and actions.

3) Libraries: MLLib,, GraphX, Spark Streaming, Dataframes:

The additional vertical wise functionalities on top of Spark Core is provided by various libraries such as MLLib, Spark Streaming, GraphX, Dataframes or SparkSQL etc.

4) API Bindings are internally calling the same API from different languages.

5) Interactive Shells or Job Submission Layer:

  • The job submission APIs provide a way to submit bundled code.
  • It also provides interactive programs (PySpark, SparkR etc.) that are also called REPL or Read-Eval-Print-Loop to process data interactively.

Clarify the Apache Spark Architecture. How to Run Spark applications?

  • Apache Spark application contains two projects in particular a Driver program and Workers program.
  • A group supervisor will be there in the middle of to communicate with these two bunch hubs. Sparkle Context will stay in contact with the laborer hubs with the assistance of Cluster Manager.
  • Spark Context resembles an ace and Spark laborers resemble slaves.
  • Workers contain the agents to run the activity.
    • In the event that any conditions or contentions must be passed, at that point Spark Context will deal with that.
    • RDD’s will dwell on the Spark Executors.
  • can likewise run Spark applications locally utilizing a string, and on the off chance that you need to exploit appropriated conditions you can take the assistance of S3, HDFS or some other stockpiling framework.

.Explain about the core components of a distributed Spark application.

  • Apache Spark application contains two programs namely a

    • Driver program and

    • Workers program.

  • A cluster manager will be there in-between to interact with these two cluster nodes.

  • Spark Context will keep in touch with the worker nodes with the help of Cluster Manager.

  • Spark Context is like a master and Spark workers are like slaves.

  • Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that.

  • RDD’s will reside on the Spark Executors.

  • You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system

  • Driver– The process that
    • runs the main () method of the program
    • create RDDs
    • perform transformations and
    • actions on them.
    • The Spark driver is the procedure running the sparkle setting .
    •  in charge of changing over the application to a guided diagram of individual strides to execute on the bunch.
    • There is one driver for each application.
  • Executor –The worker processes that run the individual tasks of a Spark job.
  • Cluster Manager-A pluggable component in Spark,
    • to launch Executors and Drivers.
    • The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN

.Explain About The Common Workflow Of A Spark Program

  • The foremost step in a Spark program involves creating input RDD’s from external data.
  • Use various RDD transformations like filter() to create new transformed RDD’s based on the business logic.
  • persist() any intermediate RDD’s which might have to be reused in future.
  • Launch various RDD actions() like first(), count() to begin parallel computation , which will then be optimized and executed by Spark.

How to submit a spark job?

Using spark-submit and just follow the following program
  • spark-submit –class org.apache.spark.examples.ClassJobName –master yarn –deploy-mode client –driver-memory 4g –num-executors 2 –executor-memory 2g –executor-cores 10
  • in the above sample
    • master is a cluster manager
      driver-memory is the actual memory size of the driver
    • executor-memory is the actual memory size of the executor
    • –num-executors is the total number of executors which are running at the worker nodes.
    • executor-cores number of individual processes that the executor memory can take up

.What are the steps that occur when you run a Spark application on a cluster?

 

The user submits an application using spark-submit.

  • Spark-submit launches the driver program and invokes the main() method specified by the user.
  • The driver program contacts the cluster manager to ask for resources to launch executors.
  • The cluster manager launches executors on behalf of the driver program.
  • The driver process runs through the user application. Based on the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks.
  • Tasks are run on executor processes to compute and save results.
  • If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors and release resources from the cluster manager.

What are the means that happen when you run a Spark application on a group?

The client presents an application utilizing flash submit.

  • Spark-submit dispatches the driver program and conjures the principle() technique indicated by the client.
  • The driver program contacts the bunch chief to request assets to dispatch agents.
  • The group director dispatches agents in the interest of the driver program.
  • The driver process goes through the client application. In light of the RDD activities and changes in the program, the driver sends work to agents as errands.
  • Tasks are kept running on agent procedures to register and spare outcomes.
  • If the driver’s primary() technique ways out or it calls SparkContext.stop(),it will end the agents and discharge assets from the bunch director.

What are the roles and responsibilities of worker nodes in the apache spark cluster?

Is Worker Node in Spark is same as Slave Node?

  • Worker node refers to node which runs the application code in the cluster.
  • Worker Node is the Slave Node.
  • Master node assign work and worker node actually perform the assigned tasks.
  • Worker node processes the data stored on the node,
  • they report the resources to the master.
  • Based on the resource availability Master schedule tasks.
  • Apache Spark follows a master/slave architecture, with one master or driver process and more than one slave or worker processes
    •  The master is the driver that runs the main() program where the spark context is created.
    • It then interacts with the cluster manager to schedule the job execution and perform the tasks.
    •  The worker consists of processes that can run in parallel to perform the tasks scheduled by the driver program.
    • These processes are called executors.
  • Whenever a client runs the application code, the driver programs instantiates Spark Context, converts the transformations and actions into logical DAG of execution.
    • This logical DAG is then converted into a physical execution plan, which is then broken down into smaller physical execution units.
  • The driver then interacts with the cluster manager to negotiate the resources required to perform the tasks of the application code.
  • The cluster manager then interacts with each of the worker nodes to understand the number of executors running in each of them.
  • The role of worker nodes/executors:
    •  Perform the data processing for the application code
    •  Read from and write the data to the external sources
    •  Store the computation results in memory, or disk.

 

  • The executors run throughout the lifetime of the Spark application.
  • This is a static allocation of executors.
  • The user can also decide how many numbers of executors are required to run the tasks, depending on the workload.
  • This is a dynamic allocation of executors.
  • Before the execution of tasks, the executors are registered with the driver program through the cluster manager, so that the driver knows how many numbers of executors are running to perform the scheduled tasks.
  • The executors then start executing the tasks scheduled by the worker nodes through the cluster manager.
  • Whenever any of the worker nodes fail, the tasks that are required to be performed will be automatically allocated to any other worker nodes

 

What are the roles and responsibilities of worker nodes in the apache spark cluster? Is Worker Node in Spark is same as Slave Node?

Apache Spark follows a master/slave architecture, with one master or driver process and more than one slave or worker processes

1. The master is the driver that runs the main() program where the spark context is created. It then interacts with the cluster manager to schedule the job execution and perform the tasks.

2. The worker consists of processes that can run in parallel to perform the tasks scheduled by the driver program. These processes are called executors.

Whenever a client runs the application code, the driver programs instantiates Spark Context, converts the transformations and actions into logical DAG of execution. This logical DAG is then converted into a physical execution plan, which is then broken down into smaller physical execution units. The driver then interacts with the cluster manager to negotiate the resources required to perform the tasks of the application code. The cluster manager then interacts with each of the worker nodes to understand the number of executors running in each of them.

The role of worker nodes/executors:

1. Perform the data processing for the application code

2. Read from and write the data to the external sources

3. Store the computation results in memory, or disk.

The executors run throughout the lifetime of the Spark application. This is a static allocation of executors. The user can also decide how many numbers of executors are required to run the tasks, depending on the workload. This is a dynamic allocation of executors.

Before the execution of tasks, the executors are registered with the driver program through the cluster manager, so that the driver knows how many numbers of executors are running to perform the scheduled tasks. The executors then start executing the tasks scheduled by the worker nodes through the cluster manager.

Whenever any of the worker nodes fail, the tasks that are required to be performed will be automatically allocated to any other worker nodes
Worker node refers to node which runs the application code in the cluster. Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master. Based on the resource availability Master schedule tasks.

.How do you configure Spark Application?

  • Spark Application could be configured using properties that could be set directly on a SparkConf object that is passed during SparkContext initialization.
  • Following are the properties that could be configured for a Spark Application.
  • Spark Application Name
    Number of Spark Driver Cores
    Spark Driver’s Maximum Result Size
    Spark Driver’s Memory
    Spark Executors’ Memory
    Spark Extra Listeners
    Spark Local Directory
    Log Spark Configuration
    Spark Master
    Deploy Mode of Spark Driver
    Log Application Information
    Spark Driver Supervise Action
    Reference : Configure Spark Application

What is role of Driver program in Spark Application ?

Driver program is responsible for launching various parallel operations on the cluster.
Driver program contains application’s main() function.
It is the process which is running the user code which in turn create the SparkContext object, create RDDsand performs transformation and action operation on RDD.
Driver program access Apache Spark through a SparkContext object which represents a connection to computing cluster (From Spark 2.0 onwards we can access SparkContext object through SparkSession).
Driver program is responsible for converting user program into the unit of physical execution called task.
It also defines distributed datasets on the cluster and we can apply different operations on Dataset (transformation and action).
Spark program creates a logical plan called Directed Acyclic graph which is converted to physical execution plan by the driver when driver program runs.

SPARK ECOSYSTEM

apache-spark-ecosystem-componentsapachr-spark-ecosystem-components-1

heco

heco2s9

SPARK COMPONENT

Apache Spark Components

eco

s12

s11

 

 

.What are the main components of Spark?

 Explain about the major libraries that constitute the Spark Ecosystem

What are the various libraries available on top of Apache Spark?

  • Spark Core:
    •  Spark Core contains the basic functionality of Spark, including components for
      • task scheduling,
      • memory management,
      • fault recovery,
      • interacting with storage systems, and more.
    • Spark Core is also home to the API that defines RDDs,
    • Base motor for huge scale parallel and disseminated information handling
  • Spark SQL:
    •  for developers
    • Spark SQL is Spark’s package for working with structured data.
    • It allows querying data via SQL as well as the HQL.
  • Spark Streaming
    • This library is used to process real time streaming data.
    • Examples of data streams include logfiles generated by production web servers.
    • It is a very simple library that listens on unbounded data sets or the datasets where data is continuously flowing.
    • The processing pauses and waits for data to come if the source isn’t providing data.
    • This library converts the incoming data streaming into RDDs for the “n”seconds collected data aka batch of data and then run the provided operations on the RDDs.
  • MLlib:
    • Spark comes with a library containing common machine learning (ML) functionality, called MLlib.
    • Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
    • Performs AI in Apache Spark
    • It is machine learning library provided by Spark.
    • It basically has all the algorithms that internal are wired to use Spark Core (RDD Operations) and the data structures required.
    • For example, it provides ways to translate the Matrix into RDD and recommendation algorithms into sequences of transformations and actions.
    • MLLib provides the machine learning algorithms that can run parallelly on many computers.
  • GraphX:
    •  GraphX is a library for generating and computing graphs,manipulating huge graph data structures(e.g., a social network’s friend graph) and performing graph-parallel computations.
    • Spark API for graph parallel computations with basic operators like
      • joinVertices,
      • subgraph,
      • aggregateMessages, etc.
    • converts graphs into RDD internally.
    • Various algorithms such PageRank on graphs are internally converted into operations on RDD.
  • SparkR
    • to promote R Programming in Spark engine.
  • BlinkDB
    • enabling interactive queries over massive data are common Spark ecosystems.  GraphX, SparkR, and BlinkDB are in the incubation stage.

.What is Spark MLlib?

  • MLlib is Spark’s scalable Machine Learning inbuilt library.
  • The library contains many machine learning algorithms and utilities to transform the data and extract useful information or inference from the data. 

.How is MLlib scalable?
Spark’s DataFrames API realizes scalability.

. What kinds of machine learning use cases does MLlib solve?
MLlib contains common learning algorithms that can solve problems like :

  • Clustering
  • Classification
  • Regression
  • Recommendation
  • Topic Modelling
  • Frequent itemsets
  • Association rules
  • Sequential pattern mining 

.How is AI executed in Spark?

  •   MLlib is adaptable AI library given by Spark.
  • It goes for making AI simple and adaptable with normal learning calculations and use cases like bunching, relapse separating, dimensional decrease, and alike.

 What is Spark MLlib?

  • Mahout is a machine learning library for Hadoop, similarly MLlib is a Spark library. MetLib provides different algorithms, that algorithms scale out on the cluster for data processing.
  • Most of the data scientists use this MLlib library.

. What is GraphX?

  • Many times you have to process the data in the form of graphs, because you have to do some analysis on it.
  • It tries to perform Graph computation in Spark in which data is present in files or in RDD’s.
  • GraphX is built on the top of Spark core, so it has got all the capabilities of Apache Spark like fault tolerance, scaling and there are many inbuilt graph algorithms ,there are numerous inbuilt chart calculations too.also. GraphX unifies ETL, exploratory analysis and iterative graph computation within a single system.
  • You can view the same data as both graphs and collections, transform and join graphs with RDD efficiently and write custom iterative algorithms usingthe pregel API.
  • GraphX competes on performance with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.
  • Spark uses GraphX for graph processing to build and transform interactive graphs.
  •  GraphX component enables programmers to reason about structured data at scale.
  • GraphX is a Spark APIfor manipulating Graphs and collections.
  •   Ordinarily you need to process the information as charts, since you need to do some examination on it. It endeavors to perform Graph calculation in Spark in which information is available in documents or in RDD’s.
  •  You can see indistinguishable information from the two charts and accumulations, change and unite diagrams with RDD effectively and compose custom iterative calculations utilizing the pregel API.
  • GraphX contends on execution with the quickest diagram frameworks while holding Spark’s adaptability, adaptation to internal failure and convenience.

.

 Is there any API available for implementing graphs in Spark?

  • GraphX is the API used for implementing graphs and graph-parallel computing in Apache Spark.
  • It extends the Spark RDD with a Resilient Distributed Property Graph.
  • It is a directed multi-graph that can have several edges in parallel.
  • Each edge and vertex of the Resilient Distributed Property Graph has user-defined properties associated with it. The parallel edges allow for multiple relationships between the same vertices.
  • In order to support graph computation, GraphX exposes a set of fundamental operators, such as joinVertices, mapReduceTriplets, and subgraph, and an optimized variant of the Pregel API.
  • The GraphX component also includes an increasing collection of graph algorithms and builders for simplifying graph analytics tasks.

What is the usage of GraphX module in Spark?

  • GraphX is a graph processing library.
  • It can be used to build and transform interactive graphs.
  • Many algorithms are available with GraphX library. PageRank is one of them.

 

SPARK CORE

Define Apache Spark Core.

  • Spark Core is the fundamental unit of the whole Spark project.
  • It provides all sort of functionalities like
    • task dispatching,
    • input-output operations etc.
    • Memory management,
    • fault tolarance,
    • scheduling and
    • monitoring jobs,
    • interacting with store systems are primary functionalities of Spark.
    • basic connectivity with the data sources. For example, HBase, Amazon S3, HDFS etc.
  • Spark makes use of Special data structure known as RDD (Resilient Distributed Dataset). It is the home for API that defines and manipulate the RDDs.
  • Spark Core is distributed execution engine with all the functionality attached on its top. For example, MLlib, SparkSQL, GraphX, Spark Streaming.
  • Thus, allows diverse workload on single platform.
  • Apart from this Spark also provides the
  • SparkCore is a base engine of apache spark framework.

The key features of Apache Spark Core are:

  • It is in charge of essential I/O functionalities.
  • Significant in programming and observing the role of the Spark cluster.
    Task dispatching.
  • Fault recovery.
  • It overcomes the snag of MapReduce by using in-memory computation.

 

What is Spark engine responsibility?

  • Spark engine schedules, distributes and monitors the data application across the spark cluster.
  • Generally, the Spark engine is concerned with establishing, spreading (distributing) and then monitoring the various sets of data spread around various clusters.

SparkCore is the main engine responsible for all of the processes happening within Spark. Keeping that in mind, you probably won’t be surprised to know that it has a bunch of duties – monitoring, memory and storage management, task scheduling, just to name a few.

Define Apache Spark Core.
Spark Core is the fundamental unit of the whole Spark project. It provides all sort of functionalities like task dispatching, scheduling, and input-output operations etc.Spark makes use of Special data structure known as RDD (Resilient Distributed Dataset). It is the home for API that defines and manipulate the RDDs. Spark Core is distributed execution engine with all the functionality attached on its top. For example, MLlib, SparkSQL, GraphX, Spark Streaming. Thus, allows diverse workload on single platform. All the basic functionality of Apache Spark Like in-memory computation, fault tolerance, memory management, monitoring, task scheduling is provided by Spark Core.
Apart from this Spark also provides the basic connectivity with the data sources. For example, HBase, Amazon S3, HDFS etc.

The key features of Apache Spark Core are:

It is in charge of essential I/O functionalities.
Significant in programming and observing the role of the Spark cluster.
Task dispatching.
Fault recovery.
It overcomes the snag of MapReduce by using in-memory computation.

SPARK DRIVER

What is role of Driver program in Spark Application ?

  • Driver program is responsible for launching various parallel operations on the cluster.
  • Driver program contains application’s main() function.
  • It is the process which is running the
    • user code which in turn
    • create the SparkContext object,
    • create RDD 
    • performs transformation and action operation on RDD.
  • Driver program access Apache Spark through a SparkContext object which represents a connection to computing cluster (From Spark 2.0 onwards we can access SparkContext object through SparkSession).
  • Driver program is responsible for converting user program into the unit of physical execution called task.
  • It also defines distributed datasets on the cluster and we can apply different operations on Dataset (transformation and action).
  • Spark program creates a logical plan called Directed Acyclic graph which is converted to physical execution plan by the driver when driver program runs.
  • Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs.
  • In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.
  • The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

Where does Spark Driver run on Yarn?

  • If you are submitting a job with –master client, the Spark driver runs on the client’s machine.
  • If you are submitting a job with –master yarn-cluster, the Spark driver would run inside a YARN container.

SPARK CONTEXT

functions-of-sparkcontext-in-apache-spark (1)functions-of-sparkcontext-in-apache-sparkscntxt3sctnxt2sctxtsessionWebWebsql context

.What is Spark Context?

  • SparkContext instance
    • sets up internal services for a Spark Application and also
    • establishes a connection to a Spark execution environment.
  • Spark Context should be created by Spark Driver Application.
  • SparkContext is the entry point to Spark.
  • Using sparkContext you create RDDs which provided various ways of churning data.
  • When a programmer creates a RDDs, SparkContext connect to the Spark cluster to create a new SparkContext object.
  • SparkContext tell spark how to access the cluster.
  • SparkConf is key factor to create programmer application.

Various ways to create contexts in spark ?

  •  Sparkconext
  •  Sqlcontext
  •  Sparksession
  •  Sqlcontext.sparkcontext

.How does Spark Context in Spark Application pick the value for Spark Master?

That can be done in two ways.

  • Create a new SparkConf object and set the master using its setMaster() method.
    • This Spark Configuration object is passed as an argument while creating the new Spark Context.
      SparkConf conf = new SparkConf().setAppName(“JavaKMeansExample”)
      .setMaster(“local[2]”)
      .set(“spark.executor.memory”,”3g”)
      .set(“spark.driver.memory”, “3g”);
  • JavaSparkContext jsc = new JavaSparkContext(conf);
    <apache-installation-directory>/conf/spark-env.sh file, located locally on the machine, contains information regarding Spark Environment configuration.
  • Spark Master is one the parameters that could be provided in the configuration file.

SparkSession

.

What is the need for SparkSession in Spark?
Starting from Apache Spark 2.0, Spark Session is the new entry point for Spark applications.

Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required – For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required.

But from 2.0, RDD along with DataSet and its subset DataFrame APIs are becoming the standard APIs and are a basic unit of data abstraction in Spark. All of the user defined code will be written and evaluated against the DataSet and DataFrame APIs as well as RDD.

So, there is a need for a new entry point build for handling these new APIs, which is why Spark Session has been introduced. Spark Session also includes all the APIs available in different contexts – Spark Context, SQL Context, Streaming Context, Hive Context.