SPARK BASICS-2
.How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.
Is it mandatory to start Hadoop to run spark application?
- Nonot mandatory, but there is no separate storage in Spark, so it use local file system to store the data.
- You can load data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.
What is File System API?
- FS API can read data from different storage devices like
- HDFS,
- S3 or
- local FileSystem.
- Spark uses FS API to read data from different storage engines.
What are the different methods to run Spark over Apache Hadoop?
- Instead of MapReduce we can use spark on top of Hadoop ecosystem
-spark with HDFS
you can read and write data in HDFS - -spark with Hive
you can read and analyse and write back to the hive
. What are the optimizations that developer can make while working with spark?
Spark is memory intensive, whatever you do it does in memory.
- Firstly, you can adjust how long spark will wait before it times out on each of the phases of data locality (data local –> process local –> node local –> rack local –> Any).
- Filter out data as early as possible. For caching, choose wisely from various storage levels.
- Tune the number of partitions in spark.
What are the different methods to run Spark over Apache Hadoop?
Instead of MapReduce we can use spark on top of Hadoop ecosystem
-spark with HDFS
you can read and write data in HDFS
-spark with Hive
you can read and analyse and write back to the hive
List abstractions of Apache Spark.
What are the abstractions of Apache Spark?
There are several abstractions of Apache Spark:
1. RDD:
An RDD refers to Resilient Distributed Datasets. RDDs are Read-only partition collection of records. It is Spark’s core abstraction and also a fundamental data structure of Spark. It offers to conduct in-memory computations on large clusters. Even in a fault-tolerant manner. For more detailed insights on RDD.follow link: Spark RDD – Introduction, Features & Operations of RDD
2. DataFrames:
It is a Dataset organized into named columns. DataFrames are equivalent to the table in a relational database or data frame in R /Python. In other words, we can say it is a relational table with good optimization technique. It is an immutable distributed collection of data. Allowing higher-level abstraction, it allows developers to impose a structure onto a distributed collection of data,. For more detailed insights on DataFrames. refer link:Spark SQL DataFrame Tutorial – An Introduction to DataFrame
DataFrame:-DataFrame is an abstraction which gives a schema view of data.
Which means it gives us a view of data as columns with column name and types info,
We can think data in data frame like a table in database.
Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance
3. Spark Streaming:
It is a Spark’s core extension, which allows Real-time stream processing From several sources. For example Flume and Kafka. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing. For more detailed insights on Spark Streaming. refer link: Spark Streaming Tutorial for Beginners
4. GraphX
It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data. For more detailed insights on GaphX. refer link: Apache Spark GraphX
SparkR
SparkR is a package for R language to enable R users to leverage the power of Spark from R shell.
Differentiate Apache Spark different from Distributed Storage Management.
Compare Apache Spark different from Distributed Storage Management
Some of the differences between an RDD and Distributed Storage are as follows:
Resilient Distributed Dataset (RDD) is the primary abstraction of data for Apache Sparkframework.
Distributed Storage is simply a file system which works on multiple nodes.
RDDs store data in-memory (unless explicitly cached).
Distributed Storage stores data in persistent storage.
RDDs can re-compute itself in the case of failure or data loss.
If data is lost from the Distributed Storage system it is gone forever (unless there is an internal replication system).
Define journaling in Apache Spark.
How is fault-tolerance achieved through the write-ahead log in Spark?
There are two types of failures in any Apache Spark job – Either the driver failure or the worker failure.
When any worker node fails, the executor processes running in that worker node will be killed, and the tasks which were scheduled on that worker node will be automatically moved to any of the other running worker nodes, and the tasks will be accomplished.
When the driver or master node fails, all of the associated worker nodes running the executors will be killed, along with the data in each of the executors’ memory. In the case of files being read from reliable and fault tolerant file systems like HDFS, zero data loss is always guaranteed, as the data is ready to be read anytime from the file system. Checkpointing also ensures fault tolerance in Spark by periodically saving the application data in specific intervals.
In the case of Spark Streaming application, zero data loss is not always guaranteed, as the data will be buffered in the executors’ memory until they get processed. If the driver fails, all of the executors will be killed, with the data in their memory, and the data cannot be recovered.
To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will be stored in the executor’s memory. With WAL enabled, these received data will also be stored in the log files.
WAL can be enabled by performing the below:
1. Setting the checkpoint directory, by using streamingContext.checkpoint(path)
2. Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.
Define the roles of the file system in any framework?
In order to manage data on computer, one has to interact with the File System directly or indirectly.
When we install Hadoop on our computer, actually there are two file system exists on machine
(1) Local File System ,
(2) HDFS (Hadoop Distributed File System)
HDFS is sits top on of Local File System.
Following are the genera functions of File System (be it Local or HDFS)
Control the data access mechanism (i.e how data stored and retrived)
Manages the metadata about the Files / Folders (i.e. created date, size etc)
Grants the access permission and manage the securities
Efficiently manage the storage space
File system manages user data that is how data is read, write and accessed.
FS is the structure behind how our system stores and organizes data.
FS stores all the information bout the file (metadata) like file name, the length of the contents of a file, and the location of the file in the folder hierarchy—separate from the contents of the file.
The main role of FS is to make sure that what/whoever is accessing the data and whatever action is taken the structure remains consistent
What is Speculative Execution in Spark?
how to enable speculative Execution in Spark?
Is there any drawback of speculative Execution?
> One more point is, Speculative execution will not stop the slow running task but it launch the new task in parallel.
Tabular Form :
Spark Property >> Default Value >> Description
spark.speculation >> false >> enables ( true ) or disables ( false ) speculative execution of tasks.
spark.speculation.interval >> 100ms >> The time interval to use before checking for speculative tasks.
spark.speculation.multiplier >> 1.5 >> How many times slower a task is than the median to be for speculation.
spark.speculation.quantile >> 0.75 >> The percentage of tasks that has not finished yet at which to start speculation.
The Speculative task in Apache Spark is task that runs slower than the rest of the task in the job.It is health check process that verifies the task is speculated, meaning the task that runs slower than the median of successfully completed task in the task sheet. Such tasks are submitted to another worker. It runs the new copy in parallel rather than shutting down the slow task.
In the cluster deployment mode, the thread starts as TaskSchedulerImp1with spark.speculation enabled. It executes periodically every spark.speculation.interval after the initial spark.speculation.interval passes.
Can Spark deployed without Hadoop ? or we need to use Spark and Hadoop together only.
Yes, Apache Spark can run without Hadoop, standalone, or in the cloud. Spark doesn’t need a Hadoop cluster to work. Spark can read and then process data from other file systems as well. HDFS is just one of the file systems that Spark supports.
Spark does not have any storage layer, so it relies on one of the distributed storage systems for distributed computing like HDFS, Cassandra etc.
However, there are a lot of advantages to running Spark on top of Hadoop (HDFS (for storage) + YARN (resource manager)), but it’s not the mandatory requirement. Spark is a meant for distributed computing. In this case, the data is distributed across the computers and Hadoop’s distributed file system HDFS is used to store data that does not fit in memory.
One more reason for using Hadoop with Spark is they both are open source and both can integrate with each other rather easily as compared to other data storage system.
—————–
What mistake do developers generally commit while using Apache Spark?
1) Management of DAG’s– People often do mistakes in DAG controlling. Always try to use reducebykey instead of groupbykey. The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most. Always try to lower the side of maps as much as possible. Try not to waste more time in Partitioning.Try not to shuffle more. Try to keep away from Skews as well as partitions too.
2) Maintain the required size of the shuffle blocks.
——————–
By Default, how many partitions are created in RDD in Apache Spark?
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
However, one can explicitly specify the number of partitions to be created.
Example1:
<li style=”list-style-type: none”>
No Partition is not specified
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Example2:
<li style=”list-style-type: none”>
Following code create the RDD of 10 partitions, since we specify the no. of partitions.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 10)
One can query about the number of partitions in following way :
rdd1.partitions.length
OR
rdd1.getNumPartitions
Best case Scenario is that we should make RDD in following way:
numbers of cores in Cluster = no. of partitions
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).
For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)
—
Why we need compression and what are the different compression format supported?
In Big Data, when we used the compression, it saves the storage space and reduce the network overhead.
One can specify the compression coded while writing the data to HDFS ( Hadoop format)
One can also read the compressed data, for that also we can use compression codec.
Following are the different compression format support in BigData:
* gzip
* lzo
* bzip2
* Zlib
* Snappy
————————–
.What is Shark?
- Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface.
- Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.
- Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
- hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
- Hive is a component of Hortonworks’ Data Platform (HDP).
- Hive provides an SQL-like interface to data stored in the HDP.
- Spark users will automatically get the complete set of Hive’s rich features, including any new features that Hive might introduce in the future.
- The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyzer which is translated to a task plan that Spark can execute.
- It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.
.How can you connect Hive to Spark SQL?
- To place hive-site.xml file in conf directory of Spark.
- Then with the help of Spark session object we can construct a data frame as,
result = spark.sql(“select * from <hive_table>”)
.Can structured data be queried in Spark? If so, how?
Yes, structured data can be queried using :
- SQL
- DataFrame API
.How SparkSQL is different from HQL and SQL?
- SparkSQL is a special component on the spark Core engine that support SQL and Hive Query Language without changing any syntax.
- It’s possible to join SQL table and HQL table.
.What is a Sparse Vector?
- A sparse vector has two parallel arrays –one for indices and the other for values.
- These vectors are used for storing non-zero entries to save space.
- A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays:
- One for indices
- The other for values
An example of a sparse vector is as follows:
Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054))
how to handle data shuffle in spark?
spark history server how to start?
what is udfs and how to use it ?
what is Spark motor duty?
.What is the use of Spark Environment Parameters? How do you configure those?
- Spark Environment Parameters affect the behavior, working and memory usage of nodes in a cluster.
- These parameters could be configured using the local config file spark-env.sh located at <apache-installation-directory>/conf/spark-env.sh.
- Reference: Configure Spark Ecosystem
. What does it mean by Columnar Storage Format?
While converting a tabular or structured data into the stream of bytes we can either store row-wise or we could store column-wise.
In row-wise, we first store the first row and then store the second row and so on. In column-wise, we first store first column and second column.

What is the significance of Sliding Window operation?
- Sliding Window controls transmission of data packets between various computer networks.
- Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data.
- Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
- In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time.
- But with Sliding Window, you can specify how many last batches has to be processed. In the below screen shot, you can see that you can specify the batch interval and how many batches you want to process.
Apart from this, you can also specify when you want to process your last sliding window. For example you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches has to be processed in that window.
What is Sliding Window?
- In Spark Streaming, you need to determine the clump interim.
- In any case, with Sliding Window, you can indicate what number of last clumps must be handled. In the beneath screen shot, you can see that you can indicate the clump interim and what number of bunches you need to process.
What is the hugeness of Sliding Window task?
Sliding Window controls transmission of information bundles between different PC systems. Sparkle Streaming library gives windowed calculations where the changes on RDDs are connected over a sliding window of information. At whatever point the window slides, the RDDs that fall inside the specific window are consolidated and worked upon to create new RDDs of the windowed DStream.
What is Catalyst framework?
- Catalyst framework is a new optimization framework present in Spark SQL.
- It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Name a few companies that use Apache Spark in production.
- Conviva
- Shopify
- Open Table
Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon
Why is BlinkDB used?
- BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars.
- BlinkDB helps users balance ‘query accuracy’ with response time.
What is PageRank Algorithm?
- One of the algorithm in GraphX is PageRank algorithm.
- Pagerank measures the importance of each vertex in a graph assuming an edge from u to v represents an endorsements of v’s importance by u.
- For exmaple, in Twitter if a twitter user is followed by many other users, that particular will be ranked highly.
- GraphX comes with static and dynamic implementations of pageRank as methods on the pageRank object.
What is Distributed?
- RDD can automatically the data is distributed across different parallel computing nodes.
What is Catchable?
- Keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.
What are activities ?
- An activity helps in bringing back the information from RDD to the nearby machine.
- An activity’s execution is the aftereffect of all recently made changes. lessen() is an activity that executes the capacity passed over and over until one esteem assuming left. take() move makes every one of the qualities from RDD to nearby hub.
What is Executor memory?
- You can configure this using the –executor-memory argument to sparksubmit.
- Each application will have at most one executor on each worker, so this setting controls how much of that worker’s memory the application will claim.
- By default, this setting is 1 GB—you will likely want to increase it on most servers.
What is the maximum number of total cores?
- This is the total number of cores used across all executors for an application. By default, this is unlimited; that is, the application will launch executors on every available node in the cluster.
- For a multiuser workload, you should instead ask users to cap their usage. You can set this value through the –total-execution cores argument to spark-submit, or by configuring spark.cores.max in your Spark configuration file.
What is Piping?
- Spark provides a pipe() method on RDDs.
- Spark’s pipe() lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams.
- With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as a String, manipulates that String however you like, and then writes the result(s) as Strings to standard output..
- Spark gives a pipe() technique on RDDs. Sparkle’s pipe() gives us a chance to compose parts of occupations utilizing any language we need as long as it can peruse and keep in touch with Unix standard streams. With pipe(), you can compose a change of a RDD that peruses each RDD component from standard contribution as a String, controls that String anyway you like, and afterward composes the result(s) as Strings to standard yield.
What are Row objects?
- Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.
- Row objects have a number of getter functions to obtain the value of each field given its index.
- The standard getter, get (or apply in Scala), takes a column number and returns an Object type (or Any in Scala) that we are responsible for casting to the correct type.
- For Boolean, Byte, Double, Float, Int, Long, Short, and String, there is a getType() method, which returns that type. For example, get String(0) would return field 0 as a string.
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
- You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’
or
- by dividing the long running jobs into different batches and writing the intermediary results to the disk.
Can you use Spark to access and analyse data stored in Cassandra databases?
- Yes, it is possible to use Apache Spark for accessing as well as analyzing data stored in Cassandra databases using the Spark Cassandra Connector.
- It needs to be added to the Spark project during which a Spark executor talks to a local Cassandra node and will query only local data.
- Connecting Cassandra with Apache Spark allows making queries faster by means of reducing the usage of network for sending data between Spark executors and Cassandra nodes.
How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.
Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.
How Spark achieves fault tolerance?
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, RDD. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.This removes the need for replication to achieve fault tolerance.
What is ‘SCC’?
Although this abbreviation isn’t very commonly used (thus resulting in rather difficult surrounding interview questions on Spark), you might encounter such a question.
SCC stands for “Spark Cassandra Connector”. It is a tool that Spark uses to access the information (data) located in various Cassandra databases.
Is it normal to run all of your processes on a localized node?
No, it is not. This is one of the most common mistakes that Spark developers make – especially when they’re just starting. You should always try to distribute your data flow – this will both hasten the process and make it more fluid.
Does the File System API have a usage in Spark?
Indeed, it does. This particular API allows Spark to read and compose the data from various storage areas (devices).
What is Immutable?
- Once created and assign a value, it’s not possible to change, this property is called Immutability.
- Spark is by default immutable,
- it does not allow updates and modifications.
- Please note data collection is not immutable, but data value is immutable.
What is ‘immutability’?
As the name probably implies, when an item is immutable, it cannot be changed or altered in any way once it is fully created and has an assigned value.
This being one of the Apache Spark interview questions which allow some sort of elaboration, you could also add that by default, Spark (as a framework) has this feature. However, this does not apply to the processes of collecting data – only their assigned values.
What is Pyspark?
- Pyspark is a bunch figuring structure which keeps running on a group of item equipment and performs information unification i.e., perusing and composing of wide assortment of information from different sources.
- In Spark, an undertaking is an activity that can be a guide task or a lessen task.
- Flash Context handles the execution of the activity and furthermore gives API’s in various dialects i.e., Scala, Java and Python to create applications and quicker execution when contrasted with MapReduce.
What are the enhancements that engineer can make while working with flash?
- Flash is memory serious, whatever you do it does in memory.
- Initially, you can alter to what extent flash will hold up before it times out on every one of the periods of information region information neigh borhood process nearby hub nearby rack neighborhood Any.
- Channel out information as ahead of schedule as could be allowed. For reserving, pick carefully from different capacity levels.
- Tune the quantity of parcels in sparkle.
What is the connection between Job, Task, Stage ?
Errand
An errand is a unit of work that is sent to the agent. Each stage has some assignment, one undertaking for every segment. The Same assignment is done over various segments of RDD.
Occupation
The activity is parallel calculation comprising of numerous undertakings that get produced in light of activities in Apache Spark.
Stage
Each activity gets isolated into littler arrangements of assignments considered stages that rely upon one another. Stages are named computational limits. All calculation is impossible in single stage. It is accomplished over numerous stages.