Apache Spark Components






.What are the main components of Spark?

 Explain about the major libraries that constitute the Spark Ecosystem

What are the various libraries available on top of Apache Spark?

  • Spark Core:
    •  Spark Core contains the basic functionality of Spark, including components for
      • task scheduling,
      • memory management,
      • fault recovery,
      • interacting with storage systems, and more.
    • Spark Core is also home to the API that defines RDDs,
    • Base motor for huge scale parallel and disseminated information handling
  • Spark SQL:
    •  for developers
    • Spark SQL is Spark’s package for working with structured data.
    • It allows querying data via SQL as well as the HQL.
  • Spark Streaming
    • This library is used to process real time streaming data.
    • Examples of data streams include logfiles generated by production web servers.
    • It is a very simple library that listens on unbounded data sets or the datasets where data is continuously flowing.
    • The processing pauses and waits for data to come if the source isn’t providing data.
    • This library converts the incoming data streaming into RDDs for the “n”seconds collected data aka batch of data and then run the provided operations on the RDDs.
  • MLlib:
    • Spark comes with a library containing common machine learning (ML) functionality, called MLlib.
    • Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
    • Performs AI in Apache Spark
    • It is machine learning library provided by Spark.
    • It basically has all the algorithms that internal are wired to use Spark Core (RDD Operations) and the data structures required.
    • For example, it provides ways to translate the Matrix into RDD and recommendation algorithms into sequences of transformations and actions.
    • MLLib provides the machine learning algorithms that can run parallelly on many computers.
  • GraphX:
    •  GraphX is a library for generating and computing graphs,manipulating huge graph data structures(e.g., a social network’s friend graph) and performing graph-parallel computations.
    • Spark API for graph parallel computations with basic operators like
      • joinVertices,
      • subgraph,
      • aggregateMessages, etc.
    • converts graphs into RDD internally.
    • Various algorithms such PageRank on graphs are internally converted into operations on RDD.
  • SparkR
    • to promote R Programming in Spark engine.
  • BlinkDB
    • enabling interactive queries over massive data are common Spark ecosystems.  GraphX, SparkR, and BlinkDB are in the incubation stage.

. What is GraphX?

  • Many times you have to process the data in the form of graphs, because you have to do some analysis on it.
  • It tries to perform Graph computation in Spark in which data is present in files or in RDD’s.
  • GraphX is built on the top of Spark core, so it has got all the capabilities of Apache Spark like fault tolerance, scaling and there are many inbuilt graph algorithms ,there are numerous inbuilt chart calculations too.also. GraphX unifies ETL, exploratory analysis and iterative graph computation within a single system.
  • You can view the same data as both graphs and collections, transform and join graphs with RDD efficiently and write custom iterative algorithms usingthe pregel API.
  • GraphX competes on performance with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.
  • Spark uses GraphX for graph processing to build and transform interactive graphs.
  •  GraphX component enables programmers to reason about structured data at scale.
  • GraphX is a Spark APIfor manipulating Graphs and collections.
  •   Ordinarily you need to process the information as charts, since you need to do some examination on it. It endeavors to perform Graph calculation in Spark in which information is available in documents or in RDD’s.
  •  You can see indistinguishable information from the two charts and accumulations, change and unite diagrams with RDD effectively and compose custom iterative calculations utilizing the pregel API.
  • GraphX contends on execution with the quickest diagram frameworks while holding Spark’s adaptability, adaptation to internal failure and convenience.


 Is there any API available for implementing graphs in Spark?

  • GraphX is the API used for implementing graphs and graph-parallel computing in Apache Spark.
  • It extends the Spark RDD with a Resilient Distributed Property Graph.
  • It is a directed multi-graph that can have several edges in parallel.
  • Each edge and vertex of the Resilient Distributed Property Graph has user-defined properties associated with it. The parallel edges allow for multiple relationships between the same vertices.
  • In order to support graph computation, GraphX exposes a set of fundamental operators, such as joinVertices, mapReduceTriplets, and subgraph, and an optimized variant of the Pregel API.
  • The GraphX component also includes an increasing collection of graph algorithms and builders for simplifying graph analytics tasks.

What is the usage of GraphX module in Spark?

  • GraphX is a graph processing library.
  • It can be used to build and transform interactive graphs.
  • Many algorithms are available with GraphX library. PageRank is one of them.