SPARKSQL
DATASET
DATAFRAME
PARTITION
BROADCAST VARIABLE
DSTREAM
PARQUET FILE
ACCUMULATOR
SPARK SESSION
———————————————–
————————————————————————————————————–
SPARKSQL

.What is “Spark SQL”?
- Spark SQL is a Spark interface to work with structured as well as semi-structured data.
- It has the capability to load data from multiple structured sources like “text files”, JSON files, Parquet files, among others.
- Spark SQL provides a special type of RDD called SchemaRDD.
- These are row objects, where each object represents a record.
- SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing.
- Through this module, Spark executes relational SQL queries on the data.
- The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row.
- It is similar to a table in relational database.
- Sparkle SQL is a module in Apache Spark that incorporates social processing(e.g., decisive inquiries and advanced stockpiling) with Spark’s procedural programming API. Flash SQL makes two principle additions.First, it offers a lot more tightly joining among social and procedural handling, through a decisive DataFrame API.Second, it incorporates an exceptionally extensible analyzer, Catalyst.Enormous information applications require a blend of preparing strategies, information sources and capacity groups. The most punctual frameworks intended for these remaining burdens, for example, MapReduce, gave clients an amazing, however low-level, procedural programming interface. Programming such frameworks was grave and required manual enhancement by the client to accomplish elite. Therefore, various new frameworks tried to give an increasingly profitable client experience by offering social interfaces to huge information. Frameworks like Pig, Hive and Shark all exploit revelatory inquiries to give more extravagant programmed improvements.
.Can we do real-time processing using Spark SQL?
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.
.List the functions of Spark SQL.
- Loading data from a variety of structured sources
- Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
- Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
How is SQL implemented in Spark?
PARTITION
What is Partitions?
- Partition is a logical division of the data, this idea derived from Map-reduce (split).
- Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data, and output data everything is Partitioned RDD.
- As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce.
- Partitioning is the process to derive logical units of data to speed up the processing process.
- Everything in Spark is a partitioned RDD.
- It is a smaller division of data.
- It represents a logical portion of a big circulated data set.
- Dividing is the procedure to derive logical data units.
- Spark handles data with these partitions and aid in parallelizing distributed processing with negligible network traffic.
- Spark attempts to read data from close nodes into an RDD.
- Meanwhile, Spark typically accesses distributed divided data, to improve transformation processes it creates walls to hold the chunks.
- The whole thing in Spark is a separated RDD.
- Partitions (also known as slices earlier) are the parts of RDD.
- Each partition is generally located on a different machine.
- Spark runs a task for each partition during the computation.
- If you are loading data from HDFS using textFile(), it would create one partition per block of HDFS(64MB typically).
- Though you can change the number of partitions by specifying the second argument in the textFile() function.
- If you are loading data from an existing memory using sc.parallelize(), you can enforce your number of partitions by passing the second argument.
- You can change the number of partitions later using repartition().
- If you want certain operations to consume the whole partitions at a time, you can use map partition().
- Partitions are done in order to simplify the data as they are the logical distribution of entire data.
- It is similar to the split in MapReduce.
- In order to enhance the processing speed, this logical distribution is carried out.
- Each and every association in Apache Spark is a partitioned RDD.
- As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce.
- Partitioning is the process to derive logical units of data to speed up the processing process.
- Everything in Spark is a partitioned RDD.
- A partition is a super-small part of a bigger chunk of data. Partitions are based on logic – they are used in Spark to manage data so that the minimum network encumbrance would be achieved.
- This being another one of those Spark interview questions that allow some sort of elaboration, you could also add that the process of partitioning is used to derive the before-mentioned small pieces of data from larger chunks, thus optimizing the network to run at the highest speed possible.
How spark partition the data?
- Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.
What is Partitions?
.Repartition and coalesce difference?
PARTITIONS
PARTITIONS :
<li style=”list-style-type: none”>
Partitions also known as ‘Split’ in HDFS, is a logical chunk of data set which may be in the range of Petabyte, Terabytes and distributed across the cluster.
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2) so as the split size.
However, one can explicitly specify the number of partitions to be created.
Partitions are basically used to speed up the data processing.
PARTITIONER :
<li style=”list-style-type: none”>
An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a partition ID, from 0 to (number of partitions – 1)
Partitioner captures the data distribution at the output. A scheduler can optimize future operation based on the type of partitioner. (i.e. if we perform any operation say transformation or action which require shuffling across nodes in that we may need the partitioner. Please refer reduceByKey() transformation in the forum)
Basically there are three types of partitioners in Spark:
(1) Hash-Partitioner (2) Range-Partitioner (3) One can make its Custom Partitioner
Property Name : spark.default.parallelism
Default Value: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
•Local mode: number of cores on the local machine
•Mesos fine-grained mode: 8
•Others: total number of cores on all executor nodes or 2, whichever is larger
Meaning : Default number of partitions in RDDs returned by transformations like join.
Partition in Spark is similar to split in HDFS. A partition in Spark is a logical division of data stored on a node in the cluster. They are the basic units of parallelism in Apache Spark. RDDs are a collection of partitions.When some actions are executed, a task is launched per partition.
By default, partitions are automatically created by the framework. However, the number of partitions in Spark are configurable to suit the needs.
For the number of partitions, if spark.default.parallelism is set, then we should use the value from SparkContext defaultParallelism, othewrwise we suould use the max number of upstream partitions. Unless spark.default.parallelism is set, the number of partitions will be the same as that of the largest upstream RDD, as this would least likely cause an out-of-memory errors.
A partitioner is an object that defines how the elements in a key-value pair RDD are partitioned by key, maps each key to a partition ID from 0 to numPartitions – 1. It captures the data distribution at the output. With the help of partitioner, the scheduler can optimize the future operations. The contract of partitioner ensures that records for a given key have to reside on a single partition.
We should choose a partitioner to use for a cogroup-like operations. If any of the RDDs already has a partitioner, we should choose that one. Otherwise, we use a default HashPartitioner.
There are three types of partitioners in Spark :
a) Hash Partitioner b) Range Partitioner c) Custom Partitioner
Hash – Partitioner : Hash- partitioning attempts to spread the data evenly across various partitions based on the key.
Range – Partitioner : In Range- Partitioning method , tuples having keys with same range will appear on the same machine.
RDDs can be created with specific partitioning in two ways :
i) Providing explicit partitioner by calling partitionBy method on an RDD
ii) Applying transformations that return RDDs with specific partitioners .
By Default, how many partitions are created in RDD in Apache Spark?
<li style=”list-style-type: none”>
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
However, one can explicitly specify the number of partitions to be created.
Example1:
<li style=”list-style-type: none”>
No Partition is not specified
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Example2:
<li style=”list-style-type: none”>
Following code create the RDD of 10 partitions, since we specify the no. of partitions.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 10)
One can query about the number of partitions in following way :
rdd1.partitions.length
OR
rdd1.getNumPartitions
Best case Scenario is that we should make RDD in following way:
numbers of cores in Cluster = no. of partitions
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).
For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)
BROADCAST VARIABLE
Why is there a need for broadcast variables when working with Apache Spark?
- These are read only variables, present in-memory cache on every machine.
- When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.
- Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
- Quite often we have to send certain data such as a machine learning model to every node.
- The most efficient way of sending the data to all of the nodes is by the use of broadcast variables.
- Even though you could refer an internal variable which will get copied everywhere but the broadcast variable is far more efficient.
- It would be loaded into the memory on the nodes only where it is required and when it is required not all the time.
- It is sort of a read-only cache similar to distributed cache provided by Hadoop MapReduce.
- Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks.
- Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters).
- Broadcast variables stored as Array Buffers, which sends read-only values to work nodes.
Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine. - Broadcast variables are also used to provide every node with a copy of a large input dataset.
- Apache Spark tries to distribute broadcast variable by using effectual broadcast algorithms for reducing communication cost.
- Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances the retrieval efficiency.
- Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. They come in handy, for example, if your application needs to send a large, read-only lookup table to all the nodes.
- Sparkle’s second kind of shared variable, communicate factors, enables the program to effectively send an expansive, read-just an incentive to all the laborer hubs for use in at least one Spark tasks. They prove to be useful, for instance, if your application needs to send an extensive, read-just query table to every one of the hubs.
Explain shared variable in Spark.
What is need of Shared variable in Apache Spark?
Shared variables are nothing but the variables that can be used in parallel operations. By default, when Apache Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums
DSTREAM
What is a DStream?
- Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data.
- DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
- DStreams have two operations –
- Transformations that produce a new DStream.
- Output operations that write data to an external system.
Apache Spark Discretized Stream is a gathering of RDDS in grouping .
Essentially, it speaks to a flood of information or gathering of Rdds separated into little clusters. In addition, DStreams are based on Spark RDDs, Spark’s center information reflection. It likewise enables Streaming to flawlessly coordinate with some other Apache Spark segments. For example, Spark MLlib and Spark SQL.
Explain kind of transformation in Spark Streaming DStream.
Different transformations in DStream in Apache Spark Streaming are:
1-map(func) — Return a new DStream by passing each element of the source DStream through a function func.
2-flatMap(func) — Similar to map, but each input item can be mapped to 0 or more output items.
3-filter(func) — Return a new DStream by selecting only the records of the source DStream on which func returns true.
4-repartition(numPartitions) — Changes the level of parallelism in this DStream by creating more or fewer partitions.
5-union(otherStream) — Return a new DStream that contains the union of the elements in the source DStream and
otherDStream.
6-count() — Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
7-reduce(func)— Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one).
8-countByValue() — When called on a DStream of elements of type K, Return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
9-reduceByKey(func, [numTasks])— When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.
10-join(otherStream, [numTasks]) — When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
11-cogroup(otherStream, [numTasks]) — When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
12-transform(func) — Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream.
13-updateStateByKey(func) — Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key.
—————–
What is Starvation scenario in spark streaming
In Apache Spark, the master requires two cores because, one core will be used to run the receiver. Now, at least one core is necessary for processing the received data. The system can not process the data, if the number of allocated cores will not be more than the number of receivers, for the cluster.
Therefore, while running locally or while using a cluster, we at least need 2 cores to be allocated to our system.
Now lets come to Starvation scenario in Spark Streaming,
It refers to this type of problem when some cores are not able to execute at all while others make progress.
Explain the level of parallelism in spark streaming.
In order to reduce the processing time, one need to increase the parallelism.
> In Spark Streaming, there are three ways to increase the parallelism :
(1) Increase the number of receivers : If there are too many records for single receiver (single machine) to read in and distribute so that is bottleneck. So we can increase the no. of receiver depends on scenario.
(2) Re-partition the receive data : If one is not in a position to increase the no. of receivers in that case redistribute the data by re-partitioning.
(3) Increase parallelism in aggregation :
What are the different input sources for Spark Streaming
> TCP Sockets
> Stream of files
> ActorStream
> Apache Kafka
> Apache Flume (Push-based receiver or Pull-based receiver)
> Kinesis
> Above are the various input source (receiver) through which data stream can make their way to the streaming application.
PARQUET FILE
What do you understand by the Parquet file?
Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:
- Able to fetch specific columns for access
- Consumes less space
- Follows type-specific encoding
- Limited I/O operations
- Offers better-summarized data
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.
It is a file of columnar format maintained by numerous data processing systems. It is used to perform both read-write operations in Spark SQL. It is the most desirable data analytics formats. Its columnar approach comes with the following advantages:
- Columnar storage confines IO operations.
- Fetching of specific columns.
- Columnar storage eats less space.
- Better-summarized information and type-specific encrypting.
What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
- Limit I/O operations
- Consumes less space
- Fetches only required columns.
- Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.
What are the benefits of using parquet file-format in Apache Spark?
Parquet is a columnar format supported by many data processing systems. The benifits of having a columnar storage are –
1- Columnar storage limits IO operations.
2- Columnar storage can fetch specific columns that you need to access.
3-Columnar storage consumes less space.
4- Columnar storage gives better-summarized data and follows type-specific encoding.
Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
There are several advantages to columnar formats:
1)Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
2)I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
3)As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable
ACCUMULATOR
What are Accumulators?
- Accumulators are the write only variables which are initialized once and sent to the workers.
- These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
- Only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.
- Spark of-line debuggers called accumulators.
- Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators.
- Only the driver program can read an accumulator value, not the tasks.
-
The write only variables which are initially executed once and send to the workers are accumulators.
On the basis of the logic written, these workers will be updated and sent back to the driver which will process it on the basis of logic.
A driver has the potential to exercise accumulator’s value.
- Accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program.
- One of the most common uses of accumulators is to count events that occur during job execution for debugging purposes.
What are Accumulators in Spark?
Collectors are the compose just factors which are introduced once and sent to the specialists. These specialists will refresh dependent on the rationale composed and sent back to the driver which will total or process dependent on the rationale.
No one but driver can get to the collector’s esteem. For assignments, Accumulators are compose as it were. For instance, it is utilized to include the number blunders seen in RDD crosswise over laborers.
- An accumulator is a good way to continuously gather data from a Spark process such as the progress of an application.
- The accumulator receives data from all the nodes in parallel efficiently.
- Therefore, only the operations in order of operands don’t matter are valid accumulators. Such functions are generally known as associative operations.
- An accumulator a kind of central variable to which every node can emit data.
What are communicated and Accumilators?
Communicate variable:
On the off chance that we have an enormous dataset, rather than moving a duplicate of informational collection for each assignment, we can utilize a communicate variable which can be replicated to every hub at one timeand share similar information for each errand in that hub. Communicate variable assistance to give a huge informational collection to every hub.
Collector:
Flash capacities utilized factors characterized in the driver program and nearby replicated of factors will be produced. Aggregator are shared factors which help to refresh factors in parallel during execution and offer the outcomes from specialists to the driver.
SPARK SESSION
What is the need for SparkSession in Spark?
- Starting from Apache Spark 2.0, Spark Session is the new entry point for Spark applications.
- Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required – For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required.
- But from 2.0, RDD along with DataSet and its subset DataFrame APIs are becoming the standard APIs and are a basic unit of data abstraction in Spark. All of the user defined code will be written and evaluated against the DataSet and DataFrame APIs as well as RDD.
- So, there is a need for a new entry point build for handling these new APIs, which is why Spark Session has been introduced. Spark Session also includes all the APIs available in different contexts – Spark Context, SQL Context, Streaming Context, Hive Context.