Spark Streaming 



Spark Streaming 


apache-spark-streamingApache-Spark-Streaming-Transformation-Operations-1kafka streamintegration






What is Spark Streaming?

  •  Whenever there is data flowing continuously and you want to process the data as early as possible, in that case you can take the advantage of Spark Streaming.
  • It is the API for stream processing of live data.
  • Data can flow for Kafka, Flume or from TCP sockets, Kenisis etc., and you can do complex processing on the data before you pushing them into their destinations.
  • Destinations can be file systems or databases or any other dashboards.
  • Spark supports stream processing – an extension to the Spark API , allowing stream processing of live data streams.
  • The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases.
  • It is similar to batch processing as the input data is divided into streams like batches.
  •  Sparkle Streaming is utilized for handling constant gushing information.
  • Along these lines it is a helpful expansion deeply Spark API.
  • It empowers high-throughput and shortcoming tolerant stream handling of live information streams.
  • The crucial stream unit is DStream which is fundamentally a progression of RDDs (Resilient Distributed Datasets) to process the constant information.
  • The information from various sources like Flume, HDFS is spilled lastly handled to document frameworks, live dashboards and databases.
  • It is like bunch preparing as the information is partitioned into streams like clusters.

How to create a stream in spark

a. Dstream
b. Structured stream.
c. DirectStream

Explain Spark Streaming Architecture?

  • Spark Streaming uses a “micro-batch” architecture, where Spark Streaming receives data from various input sources and groups it into small batches.
  • New batches are created at regular time intervals.
  • At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch.
  • At the end of the time interval the batch is done growing.
  • The size of the time intervals is determined by a parameter called the batch interval.
  • Each input batch forms an RDD,  and is processed using Spark jobs to create other RDDs.
  • The processed results can then be pushed out to external systems in batches.


 How Spark Streaming API works?

  • Programmer set a specific time in the configuration, within this time how much data gets into the Spark, that data separates as a batch.
  • The input stream (DStream) goes into spark streaming.
  • Framework breaks up into small chunks called batches, then feeds into the spark engine for processing.
  • Spark Streaming API passes that batches to the core engine.
  • Core engine can generate the final results in the form of streaming batches.
  • The output also in the form of batches. It can allows streaming data and batch data for processing.

How Spark Streaming works?

  • Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
  • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
  • DStreams can be created either from input data streams from sources such as Kafka, Flume, or by applying high-level operations on other DStreams.
  • Internally, a DStream is represented as a sequence of RDDs.

When did we use Spark Streaming?

Spark Streaming is a real time processing of streaming data API.

Spark streaming gather streaming data from different resources like

  • web server log files,
  • social media data,
  • stock market data or
  • Hadoop ecosystems like Flume, and Kafka.
  • Spark Streaming is responsible for scalable and uninterruptable data streaming processes.
  • It is an extension of the main Spark program and is commonly used by Big Data developers and programmers alike.

Name some sources from where Spark streaming component can process real-time data.

  • Apache Flume
  • Apache Kafka
  • Amazon Kinesis
  • NIFI

 Name some companies that are already using Spark Streaming.

  • Uber
  • Netflix
  • Pinterest.

 What is the bottom layer of abstraction in the Spark Streaming API ?


 Please provide an explanation on DStream in Spark.

  • DStream is a contraction for Discretized Stream.
  • It is the basic abstraction offered by Spark Streaming and is a continuous stream of data.
  • DStream is received from either a processed data stream generated by transforming the input stream or directly from a data source.
  • A DStream is represented by a continuous series of RDDs, where each RDD contains data from some certain interval.
  • An operation applied to a DStream is analogous to applying the same operation on the underlying RDDs.

A DStream has two operations:

  • Output operations responsible for writing data to an external system
  • Transformations resulting in the production of a new DStream


  • It is possible to create DStream from various sources, including Apache Kafka, Apache Flume, and HDFS.
  • Also, Spark Streaming provides support for several DStream transformations.
  • Much like Spark is built on the concept of RDDs, Spark Streaming provides an abstraction called DStreams, or discretized streams.
  • A DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step.
  • DStreams can be created from various input sources, such as Flume, Kafka, or HDFS.
  • Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.

What do you understand by receivers in Spark Streaming ?

  • Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark.
  • Receivers are usually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core.

How will you calculate the number of executors required to do real-time processing using Apache Spark? What factors need to be connsidered for deciding on the number of nodes for real-time processing?

The number of nodes can be decided by benchmarking the hardware and considering multiple factors such as optimal throughput (network speed), memory usage, the execution frameworks being used (YARN, Standalone or Mesos) and considering the other jobs that are running within those execution frameworks along with spark.

 What is the difference between Spark Transform in DStream and map ?

  • Tranform function in spark streaming allows developers to use Apache Spark transformations on the underlying RDD’s for the stream.
  • Map function in hadoop is used for an element to element transform and can be implemented using transform.
  • Ideally , map works on the elements of Dstream and transform allows developers to work with RDD’s of the DStream.
  • Map is an elementary transformation whereas transform is an RDD transformatio.

Data Frame

. What is a Data Frame?

  •  A data frame is like a table, it got some named columns which organized into columns.
  • You can create a data frame from a file or from
    • tables in hive,
    • external databases SQL or
    • NoSQL or
    • existing RDD’s.
  • It is analogous to a table.

Answer:  An information casing resembles a table, it got some named sections which composed into segments. You can make an information outline from a document or from tables in hive, outside databases SQL or NoSQL or existing RDD’s. It is practically equivalent to a table.


 Why Partitions are immutable?

  • Every transformation generates new partition.
  •  Partitions use HDFS API so that partition is
    • immutable,
    • distributed and
    • fault tolerance.
  • Partition also aware of data locality. 

What is Map and flatMap in Spark?

  • The map is a specific line or row to process that data.
  • In FlatMap each input item can be mapped to multiple output items (so the function should return a Seq rather than a single item).
  • So most frequently used to return Array elements.

Map and flatMap both functions are applied to each element of RDD. The only difference is that the function that is applied as part of the map must return only one value while flatMap can return a list of values.So, flatMap can convert one element into multiple elements of RDD while map can only result in an equal number of elements.

So, flatMap can convert one element into multiple elements of RDD while map can only result in an equal number of elements.

So, if we are loading RDD from a text file, each element is a sentence. To convert this RDD into an RDD of words, we will have to apply using flatMap a function that would split a string into an array of words. If we have just to clean up each sentence or change case of each sentence, we would be using the map instead of flatMap. See the diagram below.