SPARK-7

SPARKSQL

DATASET

DATAFRAME

PARTITION

BROADCAST VARIABLE

DSTREAM

PARQUET FILE

ACCUMULATOR

SPARK SESSION 

 

———————————————–

 

————————————————————————————————————–

SPARKSQL

 

ss1ss2ss3ss4ss5ss6ss7ss8ss9

 

 

features-of-apache-spark-sqlFeatures-of-Spark-SQL-01

 

 

ss10ss11ss12ss13ss14ss15ss16

 

Spark-SQL-Optimization-2

 .What is “Spark SQL”?

 

  • Spark SQL is a Spark interface to work with structured as well as semi-structured data.
  • It has the capability to load data from multiple structured sources like “text files”, JSON files, Parquet files, among others.
  • Spark SQL provides a special type of RDD called SchemaRDD.
  • These are row objects, where each object represents a record.
  • SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing.
  • Through this module, Spark executes relational SQL queries on the data.
  • The core of the component supports an altogether different RDD called SchemaRDD, composed of rows objects and schema objects defining data type of each column in the row.
  • It is similar to a table in relational database.
  • Sparkle SQL is a module in Apache Spark that incorporates social processing(e.g., decisive inquiries and advanced stockpiling) with Spark’s procedural programming API. Flash SQL makes two principle additions.First, it offers a lot more tightly joining among social and procedural handling, through a decisive DataFrame API.Second, it incorporates an exceptionally extensible analyzer, Catalyst.Enormous information applications require a blend of preparing strategies, information sources and capacity groups. The most punctual frameworks intended for these remaining burdens, for example, MapReduce, gave clients an amazing, however low-level, procedural programming interface. Programming such frameworks was grave and required manual enhancement by the client to accomplish elite. Therefore, various new frameworks tried to give an increasingly profitable client experience by offering social interfaces to huge information. Frameworks like Pig, Hive and Shark all exploit revelatory inquiries to give more extravagant programmed improvements.

.Can we do real-time processing using Spark SQL?

Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

 

 

 

.List the functions of Spark SQL.

  • Loading data from a variety of structured sources
  • Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
  • Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more

How is SQL implemented in Spark?

To integrate functional programming API with relational processing, Spark SQL is used. It supports enquiring data using SQL, using Hive Query, etc.  Supplementary, it provides provision for many data sources besides making it conceivable to weave SQL requests with code alterations thus ensuing powerful tool.

Spark SQL libraries:

Data Source and DataFrame API, Optimizer and Interpreter and SQL Service

DATASET

Basically, there are 3 different ways to represent data in Apache Spark. Either we can represent it through RDD, or we use DataFrames for same or we can also select DataSets to represent our data in Spark. let’s discuss each of them in detail:

1. RDD
RDD refers to “Resilient Distributed Dataset”. RDD is core abstraction and fundamental data structure of Apache Spark. It is an immutable collection of objects which computes on the different node of the cluster. As we know RDDs are immutable, though we can not make any changes in it we can apply following operations like Transformation and Actions on them.It perform in-memory computations on large clusters in a fault-tolerant manner. Basically, There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and parallelizing already existing collection in driver program.Follow this link to learn Spark RDD in great detail.

2. DataFrame
In DataFrame, data organized into named columns. This table is as similar as a table in a relational database. DataFrames is also an immutable distributed collection of data. It allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to learn Spark DataFrame in detail.

3. Spark Dataset APIs
It is an extension of DataFrame API. It provides type-safe, object-oriented programming interface. It takes advantage of Spark’s Catalyst optimizer, by exposing data fields and expressions to a query planner. Follow this link to learn Spark DataSet in detail.

Different Ways of representing data in Spark are:-

RDD-:Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs:
1) parallelizing an existing collection in your driver program
2) referencing a dataset in an external storage system,
such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
3)Existing RDDs – Creating RDD from already existing RDDs.
By applying transformation operation on existing RDDs we can create new RDD.

DataFrame:-DataFrame is an abstraction which gives a schema view of data.
Which means it gives us a view of data as columns with column name and types info,
We can think data in data frame like a table in database.
-Like RDD, execution in Dataframe too is lazy triggered .-offers huge performance
improvement over RDDs because of 2 powerful features it has:
1. Custom Memory management :Data is stored in off-heap memory in binary format.
This saves a lot of memory space. Also there is no Garbage Collection overhead involved.
By knowing the schema of data in advance and storing efficiently in binary format,
expensive java Serialization is also avoided.
2. Optimized Execution Plans :Query plans are created for execution using Spark catalyst
optimiser. After an optimised execution plan is prepared going through some steps,
the final execution happens internally on RDDs only but thats completely hidden from the
users.

DataSet:-Datasets in Apache Spark are an extension of DataFrame API which provides
type-safe, object-oriented programming interface.
Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and
data fields to a query planner.

Dataset and DataFrame internally does final execution on RDD objects only but the difference
is users do not write code to create the RDD collections and have no control as such over RDDs.

Define Dataset in Apache Spark.

A Dataset is an immutable collection of objects, those are mapped to a relational schema. They are strongly-typed in nature.
There is an encoder, at the core of the Dataset API. That Encoder is responsible for converting between JVM objects and
tabular representation. By using Spark’s internal binary format, the tabular representation is stored that allows to carry out operations on serialized data and improves memory utilization. It also supports automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long) and Scala case classes. It offers many functional transformations (e.g. map, flatMap, filter).

1)Static typing-
With Static typing feature of Dataset, a developer can catch errors at compile time (which saves time and costs).
2)Run-time Safety:-
Dataset APIs are all expressed as lambda functions and JVM typed objects, any mismatch of typed-parameters will be
detected at compile time. Also, analysis error can be detected at compile time too, when using Datasets,
hence saving developer-time and costs.
3)Performance and Optimization
Dataset APIs are built on top of the Spark SQL engine, it uses Catalyst to generate an optimized logical and physical query plan providing the space and speed efficiency.
4) For processing demands like high-level expressions, filters, maps, aggregation, averages, sum,
SQL queries, columnar access and also for use of lambda functions on semi-structured data, DataSets are best.
5) Datasets provides rich semantics, high-level abstractions, and domain-specific APIs

DATAFRAME

 

data frame1

df2DFODFO1jesonjs2

 

Explain DataFrames in Spark?
DataFrame consists of two words data and frame, means data has to be fit in some kind of frame. We can understand a frame as a schema of the relational database.

In Spark, DataFrame is a collection of distributed data over the network with some schema. We can understand it as the data formatted as row/column manner. DataFrame can be created from Hive data, JSON file, CSV, Structured data or raw data that can be framed in structured data. We can also create a DataFrame from RDD if some schema can be applied on that RDD.
Temporary view or table can also be created from DataFrame as it has data and schema. We can also run SQL query on created table/view to get the faster result.
It is also evaluated lazily (Lazy Evaluation) for better resource utilization.
Following are the Benefits of DataFrames.

1.DataFrame is distributed collection of data. In DataFrames, data is organized in named column.

2. They are conceptually similar to a table in a relational database. Also, have richer optimizations.

3. DataFrames empower SQL queries and the DataFrame API.

4. we can process both structured and unstructured data formats through it. Such as: Avro, CSV, elastic search, and Cassandra. Also, it deals with storage systems HDFS, HIVE tables, MySQL, etc.

5. In DataFrames, Catalyst supports optimization(catalyst Optimizer). There are general libraries available to represent trees. In four phases, DataFrame uses Catalyst tree transformation:

– Analyze logical plan to solve references
– Logical plan optimization
– Physical planning
– Code generation to compile part of a query to Java bytecode.

6. The DataFrame API’s are available in various programming languages. For example Java, Scala, Python, and R.

7. It provides Hive compatibility. We can run unmodified Hive queries on existing Hive warehouse.

8. It can scale from kilobytes of data on the single laptop to petabytes of data on a large cluster.

9. DataFrame provides easy integration with Big data tools and framework via Spark core.

There are much more to know about DataFrames. Follow link: Spark SQL DataFrame Tutorial

PARTITION

123456

 

 What is Partitions?

  • Partition is a logical division of the data, this idea derived from Map-reduce (split).
  • Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data, and output data everything is Partitioned RDD.
  • As the name suggests, partition is a smaller and logical division of data  similar to ‘split’ in MapReduce.
  • Partitioning is the process to derive logical units of data to speed up the processing process.
  • Everything in Spark is a partitioned RDD.
  • It is a smaller division of data.
  • It represents a logical portion of a big circulated data set.
  • Dividing is the procedure to derive logical data units.
  • Spark handles data with these partitions and aid in parallelizing distributed processing with negligible network traffic.
  • Spark attempts to read data from close nodes into an RDD.
  • Meanwhile, Spark typically accesses distributed divided data, to improve transformation processes it creates walls to hold the chunks.
  • The whole thing in Spark is a separated RDD.
  • Partitions (also known as slices earlier) are the parts of RDD.
  • Each partition is generally located on a different machine.
  • Spark runs a task for each partition during the computation.
  • If you are loading data from HDFS using textFile(), it would create one partition per block of HDFS(64MB typically).
  • Though you can change the number of partitions by specifying the second argument in the textFile() function.
  • If you are loading data from an existing memory using sc.parallelize(), you can enforce your number of partitions by passing the second argument.
  • You can change the number of partitions later using repartition().
  • If you want certain operations to consume the whole partitions at a time, you can use map partition().
  • Partitions are done in order to simplify the data as they are the logical distribution of entire data.
  • It is similar to the split in MapReduce.
  • In order to enhance the processing speed, this logical distribution is carried out.
  • Each and every association in Apache Spark is a partitioned RDD.
  • As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce.
  • Partitioning is the process to derive logical units of data to speed up the processing process.
  • Everything in Spark is a partitioned RDD.
  • partition is a super-small part of a bigger chunk of data. Partitions are based on logic – they are used in Spark to manage data so that the minimum network encumbrance would be achieved.
  • This being another one of those Spark interview questions that allow some sort of elaboration, you could also add that the process of partitioning is used to derive the before-mentioned small pieces of data from larger chunks, thus optimizing the network to run at the highest speed possible.

How spark partition the data?

  • Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

What is Partitions?

Parcel is a consistent division of the information, this thought got from Map-diminish (split). Consistent information explicitly inferred to process the information. Little lumps of information additionally it can bolster adaptability and accelerate the procedure. Information, moderate information, and yield information everything is Partitioned RDD.

.Why Partitions are constant?

Each change creates new portion. Distributions use HDFS API so fragment is perpetual, flowed and adjustment to inner disappointment. Portion moreover aware of data region.

. What is the role of coalesce () and repartition () in Map Reduce?

  • Both coalesce and repartition are used to modify the number of partitions in an RDD but Coalesce avoids full shuffle.
  • If you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions and this does not require a shuffle.
  • Repartition performs a coalesce with shuffle. Repartition will result in the specified number of partitions with the data distributed using a hash practitioner.

. How do you specify the number of partitions while creating an RDD? What are the functions?

  • You can specify the number of partitions while creating a RDD either by using the sc.textFile or by using parallelize functions as follows:
  • Val rdd = sc.parallelize(data,4)
  • val data = sc.textFile(“path”,4)

.Repartition and coalesce difference?

a. Using repartition spark can increase/decrease number of partitions of data.
b. Using coalesce spark only can reduce the number of partitions of input data
c. Reparition is not efficient than coalesce.

PARTITIONS
PARTITIONS :

<li style=”list-style-type: none”>
Partitions also known as ‘Split’ in HDFS, is a logical chunk of data set which may be in the range of Petabyte, Terabytes and distributed across the cluster.
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2) so as the split size.
However, one can explicitly specify the number of partitions to be created.
Partitions are basically used to speed up the data processing.

PARTITIONER :

<li style=”list-style-type: none”>
An object that defines how the elements in a key-value pair RDD are partitioned by key. Maps each key to a partition ID, from 0 to (number of partitions – 1)
Partitioner captures the data distribution at the output. A scheduler can optimize future operation based on the type of partitioner. (i.e. if we perform any operation say transformation or action which require shuffling across nodes in that we may need the partitioner. Please refer reduceByKey() transformation in the forum)
Basically there are three types of partitioners in Spark:
(1) Hash-Partitioner (2) Range-Partitioner (3) One can make its Custom Partitioner

Property Name : spark.default.parallelism
Default Value: For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
•Local mode: number of cores on the local machine
•Mesos fine-grained mode: 8
•Others: total number of cores on all executor nodes or 2, whichever is larger
Meaning : Default number of partitions in RDDs returned by transformations like join.

Partition in Spark is similar to split in HDFS. A partition in Spark is a logical division of data stored on a node in the cluster. They are the basic units of parallelism in Apache Spark. RDDs are a collection of partitions.When some actions are executed, a task is launched per partition.

By default, partitions are automatically created by the framework. However, the number of partitions in Spark are configurable to suit the needs.
For the number of partitions, if spark.default.parallelism is set, then we should use the value from SparkContext defaultParallelism, othewrwise we suould use the max number of upstream partitions. Unless spark.default.parallelism is set, the number of partitions will be the same as that of the largest upstream RDD, as this would least likely cause an out-of-memory errors.

A partitioner is an object that defines how the elements in a key-value pair RDD are partitioned by key, maps each key to a partition ID from 0 to numPartitions – 1. It captures the data distribution at the output. With the help of partitioner, the scheduler can optimize the future operations. The contract of partitioner ensures that records for a given key have to reside on a single partition.
We should choose a partitioner to use for a cogroup-like operations. If any of the RDDs already has a partitioner, we should choose that one. Otherwise, we use a default HashPartitioner.

There are three types of partitioners in Spark :
a) Hash Partitioner b) Range Partitioner c) Custom Partitioner
Hash – Partitioner : Hash- partitioning attempts to spread the data evenly across various partitions based on the key.
Range – Partitioner : In Range- Partitioning method , tuples having keys with same range will appear on the same machine.

RDDs can be created with specific partitioning in two ways :
i) Providing explicit partitioner by calling partitionBy method on an RDD
ii) Applying transformations that return RDDs with specific partitioners .

 By Default, how many partitions are created in RDD in Apache Spark?
<li style=”list-style-type: none”>
By Default, Spark creates one Partition for each block of the file (For HDFS)
Default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2).
However, one can explicitly specify the number of partitions to be created.
Example1:

<li style=”list-style-type: none”>
No Partition is not specified
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)
Example2:

<li style=”list-style-type: none”>
Following code create the RDD of 10 partitions, since we specify the no. of partitions.
val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 10)
One can query about the number of partitions in following way :

rdd1.partitions.length

OR

rdd1.getNumPartitions
Best case Scenario is that we should make RDD in following way:

numbers of cores in Cluster = no. of partitions

val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”)

Consider the size of wc-data.txt is of 1280 MB and Default block size is 128 MB. So there will be 10 blocks created and 10 default partitions(1 per block).

For a better performance, we can increase the number of partitions on each block. Below code will create 20 partitions on 10 blocks(2 partitions/block). Performance will be improved but need to make sure that each cluster is running on 2 cores minimum.

val rdd1 = sc.textFile(“/home/hdadmin/wc-data.txt”, 20)

BROADCAST VARIABLE

Why is there a need for broadcast variables when working with Apache Spark?

  • These are read only variables, present in-memory cache on every machine.
  • When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster.
  • Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
  • Quite often we have to send certain data such as a machine learning model to every node.
  • The most efficient way of sending the data to all of the nodes is by the use of broadcast variables.
  • Even though you could refer an internal variable which will get copied everywhere but the broadcast variable is far more efficient.
  • It would be loaded into the memory on the nodes only where it is required and when it is required not all the time.
  • It is sort of a read-only cache similar to distributed cache provided by Hadoop MapReduce.
  • Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy of it with tasks.
  • Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and accumulators (like Hadoop counters).
  • Broadcast variables stored as Array Buffers, which sends read-only values to work nodes.
    Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine.
  • Broadcast variables are also used to provide every node with a copy of a large input dataset.
  • Apache Spark tries to distribute broadcast variable by using effectual broadcast algorithms for reducing communication cost.
  • Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances the retrieval efficiency.
  • Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. They come in handy, for example, if your application needs to send a large, read-only lookup table to all the nodes.
  • Sparkle’s second kind of shared variable, communicate factors, enables the program to effectively send an expansive, read-just an incentive to all the laborer hubs for use in at least one Spark tasks. They prove to be useful, for instance, if your application needs to send an extensive, read-just query table to every one of the hubs.

Explain shared variable in Spark.
What is need of Shared variable in Apache Spark?
Shared variables are nothing but the variables that can be used in parallel operations. By default, when Apache Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums

DSTREAM

apache-spark-dstream-1dstream1dstream2dstream3dstream4dstream5dstream6dstream7dstream8dstream9dstream10dstream11dstream12dstream13dstream14dstreambroadcastdstreamchecsstream8

 

What is a DStream?

  • Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data.
  • DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
  • DStreams have two operations –
    • Transformations that produce a new DStream.
    • Output operations that write data to an external system.

Apache Spark Discretized Stream is a gathering of RDDS in grouping .

Essentially, it speaks to a flood of information or gathering of Rdds separated into little clusters. In addition, DStreams are based on Spark RDDs, Spark’s center information reflection. It likewise enables Streaming to flawlessly coordinate with some other Apache Spark segments. For example, Spark MLlib and Spark SQL.

Explain kind of transformation in Spark Streaming DStream.
Different transformations in DStream in Apache Spark Streaming are:

1-map(func) — Return a new DStream by passing each element of the source DStream through a function func.

2-flatMap(func) — Similar to map, but each input item can be mapped to 0 or more output items.

3-filter(func) — Return a new DStream by selecting only the records of the source DStream on which func returns true.

4-repartition(numPartitions) — Changes the level of parallelism in this DStream by creating more or fewer partitions.

5-union(otherStream) — Return a new DStream that contains the union of the elements in the source DStream and
otherDStream.

6-count() — Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.

7-reduce(func)— Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one).

8-countByValue() — When called on a DStream of elements of type K, Return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

9-reduceByKey(func, [numTasks])— When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.

10-join(otherStream, [numTasks]) — When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

11-cogroup(otherStream, [numTasks]) — When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.

12-transform(func) — Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream.

13-updateStateByKey(func) — Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key.
—————–
What is Starvation scenario in spark streaming
In Apache Spark, the master requires two cores because, one core will be used to run the receiver. Now, at least one core is necessary for processing the received data. The system can not process the data, if the number of allocated cores will not be more than the number of receivers, for the cluster.

Therefore, while running locally or while using a cluster, we at least need 2 cores to be allocated to our system.

Now lets come to Starvation scenario in Spark Streaming,
It refers to this type of problem when some cores are not able to execute at all while others make progress.
Explain the level of parallelism in spark streaming.
In order to reduce the processing time, one need to increase the parallelism.
> In Spark Streaming, there are three ways to increase the parallelism :
(1) Increase the number of receivers : If there are too many records for single receiver (single machine) to read in and distribute so that is bottleneck. So we can increase the no. of receiver depends on scenario.
(2) Re-partition the receive data : If one is not in a position to increase the no. of receivers in that case redistribute the data by re-partitioning.
(3) Increase parallelism in aggregation :
What are the different input sources for Spark Streaming
> TCP Sockets
> Stream of files
> ActorStream
> Apache Kafka
> Apache Flume (Push-based receiver or Pull-based receiver)
> Kinesis
> Above are the various input source (receiver) through which data stream can make their way to the streaming application.

 

 

PARQUET FILE

 

What do you understand by the Parquet file?
Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:

  • Able to fetch specific columns for access
  • Consumes less space
  • Follows type-specific encoding
  • Limited I/O operations
  • Offers better-summarized data

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

It is a file of columnar format maintained by numerous data processing systems. It is used to perform both read-write operations in Spark SQL. It is the most desirable data analytics formats. Its columnar approach comes with the following advantages:

  • Columnar storage confines IO operations.
  • Fetching of specific columns.
  • Columnar storage eats less space.
  • Better-summarized information and type-specific encrypting.

What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

  • Limit I/O operations
  • Consumes less space
  • Fetches only required columns.
  • Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

What are the benefits of using parquet file-format in Apache Spark?

Parquet is a columnar format supported by many data processing systems. The benifits of having a columnar storage are –

1- Columnar storage limits IO operations.

2- Columnar storage can fetch specific columns that you need to access.

3-Columnar storage consumes less space.

4- Columnar storage gives better-summarized data and follows type-specific encoding.
Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.

There are several advantages to columnar formats:

1)Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
2)I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
3)As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable

ACCUMULATOR

What are Accumulators?

  • Accumulators are the write only variables which are initialized once and sent to the workers.
  • These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
  • Only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.
  • Spark of-line debuggers called accumulators.
  • Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job you can use accumulators.
  • Only the driver program can read an accumulator value, not the tasks.
  • The write only variables which are initially executed once and send to the workers are accumulators.

    On the basis of the logic written, these workers will be updated and sent back to the driver which will process it on the basis of logic.

    A driver has the potential to exercise accumulator’s value.

  • Accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program.
  • One of the most common uses of accumulators is to count events that occur during job execution for debugging purposes.

What are Accumulators in Spark?

Beginning of-line debuggers called gatherers. Begin aggregators resemble Hadoop counters, to check the amount of events and what’s happening in the midst of business you can use authorities. Simply the driver program can examine a gatherer regard, not the assignments.

 

Collectors are the compose just factors which are introduced once and sent to the specialists. These specialists will refresh dependent on the rationale composed and sent back to the driver which will total or process dependent on the rationale.

No one but driver can get to the collector’s esteem. For assignments, Accumulators are compose as it were. For instance, it is utilized to include the number blunders seen in RDD crosswise over laborers.

 

  • An accumulator is a good way to continuously gather data from a Spark process such as the progress of an application.
  • The accumulator receives data from all the nodes in parallel efficiently.
  • Therefore, only the operations in order of operands don’t matter are valid accumulators. Such functions are generally known as associative operations.
  • An accumulator a kind of central variable to which every node can emit data.

 What are communicated and Accumilators?

Communicate variable:

On the off chance that we have an enormous dataset, rather than moving a duplicate of informational collection for each assignment, we can utilize a communicate variable which can be replicated to every hub at one timeand share similar information for each errand in that hub. Communicate variable assistance to give a huge informational collection to every hub.

Collector:

Flash capacities utilized factors characterized in the driver program and nearby replicated of factors will be produced. Aggregator are shared factors which help to refresh factors in parallel during execution and offer the outcomes from specialists to the driver.

SPARK SESSION 

What is the need for SparkSession in Spark?

  • Starting from Apache Spark 2.0, Spark Session is the new entry point for Spark applications.
  • Prior to 2.0, SparkContext was the entry point for spark jobs. RDD was one of the main APIs then, and it was created and manipulated using Spark Context. For every other APIs, different contexts were required – For SQL, SQL Context was required; For Streaming, Streaming Context was required; For Hive, Hive Context was required.
  • But from 2.0, RDD along with DataSet and its subset DataFrame APIs are becoming the standard APIs and are a basic unit of data abstraction in Spark. All of the user defined code will be written and evaluated against the DataSet and DataFrame APIs as well as RDD.
  • So, there is a need for a new entry point build for handling these new APIs, which is why Spark Session has been introduced. Spark Session also includes all the APIs available in different contexts – Spark Context, SQL Context, Streaming Context, Hive Context.