58.What is RDD?
- RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Sparkthat represent the data coming into the system in object format.
- RDDs are used for in-memory computations on large clusters, in a fault tolerant manner.
- RDDs are read-only portioned, collection of records,
- that are –
- Immutable – RDDs cannot be altered.
- Resilient – If a node holding the partition fails the other node takes the data.
If you have large amount of data, and is not necessarily stored in a single system, all the data can be distributed across all the nodes and one subset of data is called as a partition which will be processed by a particular task.
RDD’s are very close to input splits in MapReduce.
- RDD is a collection of partitioned data that satisfies these properties.
- lazily evaluated,
- In the event that you have enormous measure of information, and isn’t really put away in a solitary framework, every one of the information can be dispersed over every one of the hubs and one subset of information is called as a parcelwhich will be prepared by a specific assignment.
- RDD’s are exceptionally near information parts in MapReduce.
. It is a representation of data located on a network which is
- Immutable– You can operate on the rdd to produce another rdd but you can’t alter it.
- Partitioned / Parallel – The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes.
- Resilience– If one of the node hosting the partition fails, another nodes takes its data.
- RDD provides two kinds of operations:
- Transformations and Actions.
- RDD provides the abstraction for distributed computing across nodes in Spark Cluster.
- You can always think of RDD as a big array which is under the hood spread over many computers which are completely abstracted.
- RDD is made up many partitions each partition on different computers.
- RDD can hold data of any type from any supported programming language such as Python, Java, Scala.
- The case where RDD’s each element is a tuple – made up of (key, value) is called Pair RDD.
- PairRDDs provides extra functionalities such as “group by” and joins.
- RDD is generally lazily computed i.e. it is not computed unless an action on it is called.
- RDD is either prepared out of another RDD or it is loaded from a data source.
- In case, it is loaded from another data source it has a binding between the actual data storage and partitions.
- So, RDD is essentially a pointer to actual data, not data unless it is cached.
- If a machine that holds a partition of RDD dies, the same partition is regenerated using the lineage graph of RDD.
- If there is a certain RDD that you require very frequently, you can cacheit so that it is readily available instead of re-computation every time.
- Please note that the cached RDD will be available only during the lifetime of the application.
- If it is costly to recreate the RDD every time, you may want to persist it to the disc.
- RDD can be stored at various data storage (such as HDFS, database etc.) in many formats.
There are primarily two types of RDD:
- Parallelized Collections : The existing RDD’s running parallel with one another.
- Hadoop datasets : perform function on each file record in HDFS or other storage system
Generally, RDDs support two types of operations – actions and transformations.
- There are two ways to create RDD. One while loading data from a source.
- Second, by operating on existing RDD.
- And an action causes the computation from an RDD to yield the result.
- The diagram below shows the relationship between RDD, transformations, actions and value/result.
- While loading Data from Source – When an RDD is prepared by loading data from some source (HDFS, Cassandra, in-memory), the machines which exist nearer to the data are assigned for the creation of partitions. These partitions would hold the parts of mappings or pointers to the actual data. When we are loading data from the memory (for example, by using parallelize), the partitions would hold the actual data instead of pointers or mapping
- By converting an in-memory array of objects – An in-memory object can be converted to an RDD using parallelize.
- By operating on existing RDD –
- An RDD is immutable.
- We can’t change an existing RDD.
- We can only form a new RDD based on the previous RDD by operating on it.
- When operating on existing RDD, a new RDD is formed. These operations are also called transformations.
- The operation could also result in shuffling – moving data across the nodes.
- Some operations that do not cause shuffling: map, flatMap and filter.
- Examples of the operations that could result in shuffling are groupByKey, repartition, sortByKey, aggregateByKey, reduceByKey, distinct.
- Spark maintains the relationship between the RDD in the form of a DAG (Directed Acyclic Graph).
- When an action such reduce() or saveAsTextFile() is called, the whole graph is evaluated and the result is returned to the driver or saved to the location such as HDFS.
As the major logical data units in Apache Spark, RDD possesses a distributed collection of data. It is a read-only data structure and you cannot change the original format but it can always be transformed into a different form with the changes. The two operations which are supported by RDD are –
- Transformation – It creates a new RDD from the former one. They are executed only on demand.
- Actions – The final outcomes of the RDD computations are returned by actions.
64. How to create an RDD?
You can create an RDD from an in-memory data or from a data source such as HDFS.
You can load the data from memory using parallelize method of Spark Context in the following manner, in python:
myrdd = sc.parallelize([1,2,3,4,5]);
Here myrdd is the variable that represents an RDD created out of an in-memory object. “sc” is the sparkContext which is readily available if you are running in interactive mode using PySpark. Otherwise, you will have to import the SparkContext and initialize.
And to create RDD from a file in HDFS, use the following:
linesrdd = sc.textFile(“/data/file_hdfs.txt”);
This would create linesrdd by loading a file from HDFS. Please note that this will work only if your Spark is running on top of Yarn. In case, you want to load the data from external HDFS cluster, you might have to specify the protocol and name node:
linesrdd = sc.textFile(“hdfs://namenode_host/data/file_hdfs.txt”);
In the similar fashion, you can load data from third-party systems.
Spark provides two methods to create RDD:
- By parallelizing a collection in your Driver program. This makes use of SparkContext’s ‘parallelize’ methodval
IntellipaatData = Array(2,4,6,8,10) val distIntellipaatData = sc.parallelize(IntellipaatData)
- By loading an external dataset from external storage like HDFS, shared file system.
RDDs are of two types:
- Hadoop Datasets – Perform functions on each file record in HDFS (Hadoop Distributed File System) or other types of storage systems
- Parallelized Collections – Extant RDDs running parallel with one another
There are two ways of creating an RDD in Apache Spark:
- By parallelizing a collection in the Driver program. It makes use of SparkContext’s parallelize() method. For instance:
method val DataArray = Array(22,24,46,81,101) val DataRDD = sc.parallelize(DataArray)
- By means of loading an external dataset from some external storage, including HBase, HDFS, and shared file system
68.What do you understand by Pair RDD?
- Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs.
- Pair RDDs allow users to access each key in parallel.
- They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
- Spark provides special operations on RDDs containing key/value pairs.
- These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel.
- For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
59.What do you know about RDD Persistence?
- Persistence means Caching.
- When an RDD is persisted, each node in the cluster stores the partitions (of the RDD) in memory (RAM).
- When there are multiple transformations or actions on an RDD, persistence helps to cut down the latency by the time required to load the data from file storage to memory.
61. What do you mean by the default storage level: MEMORY_ONLY?
- Default Storage Level – MEMORY_ONLY mean store RDD as deserialized Java objects in the JVM.
- If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed.
69.How RDD persist the data?
What are the various levels of persistence in Apache Spark?
There are two methods to persist the data, such as
persist()to persist permanently and
cache()to persist temporarily in the memory.
Different storage level options there such as
- DISK_ONLY and many more.
Both persist() and cache() uses different options depends on the task.
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it.Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
- MEMORY_AND_DISK_SER, DISK_ONLY
Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.
Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:
- DISK_ONLY – Stores the RDD partitions only on the disk.
- MEMORY_AND_DISK – Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
- MEMORY_ONLY_SER – Stores RDD as serialized Java objects with one byte array per partition.
- MEMORY_AND_DISK_SER – Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
- MEMORY_ONLY – The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting into recomputing the same on the fly every time they are required.
- OFF_HEAP – Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
60.What is the difference between cache() and persist() for an RDD?
- cache() uses default storage level, i.e., MEMORY_ONLY.
- persist() can be provided with any of the possible storage levels.
62. If there is certain data that we want to use again and again in different transformations, what should improve the performance?
- RDD can be persisted or cached.
- There are various ways in which it can be persisted: in-memory, on disc etc.
- So, if there is a dataset that needs a good amount computing to arrive at, you should consider caching it.
- You can cache it to disc if preparing it again is far costlier than just reading from disc or it is very huge in size and would not fit in the RAM.
- You can cache it to memory if it can fit into the memory.
How RDD hang on the data?
There are two strategies to bear the data, for instance, hang on() to drive forward forever and hold() to proceed quickly in the memory. Unmistakable limit level decisions there, for instance, MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY and some more. Both endure() and hold() uses assorted choices depends upon the task.
- Spark is amazingly speedy. As indicated by their cases, it runs programs up to 100x speedier than Hadoop MapReduce in memory, or 10x faster on circle. It reasonably utilizes RAM to make the snappier results.
- In diagram perspective, you make many Map-reduce errands and a while later incorporate these assignments using Oozie/shell content. This framework is particularly monotonous and the guide reduce task has significant torpidity.
- And consistently, decoding the yield out of one MR work into the commitment of another MR occupation may require creating another code in light of the way that Oozie may not work.
- In Spark, you can basically do everything using single application/bolster (pyspark or scala comfort) and get the results in a split second. Trading between ‘Running something on group’ and ‘achieving something locally’ is truly straightforward and clear. This also prompts less setting switch of the fashioner and more prominent benefit.
- Spark kind of reciprocals to MapReduce and Oozie set up together.
What is DAG and Stage in spark processing?
FYI the above program, the overall execution plan is as per the DAG scheduler.
For each and every method execution is optimized as per the stages.
63. What happens to RDD when one of the nodes on which it is distributed goes down?
Since Spark knows how to prepare a certain data set because it is aware of various transformations and actions that have lead to the dataset, it will be able to apply the same transformations and actions to prepare the lost partition of the node which has gone down.
65. When we create an RDD, does it bring the data and load it into the memory?
- An RDD is made up of partitions which are located on multiple machines.
- The partition is only kept in memory if the data is being loaded from memory or the RDD has been cached/persisted into the memory.
- Otherwise, an RDD is just mapping of actual data and partitions.
66.How to save RDD?
There are few methods provided by Spark:
- Write the elements of the RDD as a text file (or set of text files) to the provided directory.
- The directory could be in the local filesystem, HDFS or any other file system.
- Each element of the dataset will be converted to text using toString() method on every element.
- And each element will be appended with newline character “\n”
- Write the elements of the dataset as a Hadoop SequenceFile.
- This works only on the key-value pair RDD which implement Hadoop’s Writeable interface.
- You can load sequence file using sc.sequenceFile().
- This simply saves data by serializing using standard java object serialization.
67.What do you understand by SchemaRDD?
- An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column.
- Row objects are just wrappers around arrays of basic types (e.g., integers and strings).
70.How would you determine the quantity of parcels while making a RDD? What are the capacities?
- You can determine the quantity of allotments while making a RDD either by utilizing the sc.textFile or by utilizing parallelize works as pursues:
- Val rdd = sc.parallelize(data,4)
- val information = sc.textFile(“path”,4)
What is the contrast between RDD , DataFrame and DataSets?
- It is the structure square of Spark. All Dataframes or Dataset is inside RDDs.
- It is lethargically assessed permanent gathering objects
- RDDS can be effectively reserved if a similar arrangement of information should be recomputed.
- Gives the construction see ( lines and segments ). It tends to be thought as a table in a database.
- Like RDD even dataframe is sluggishly assessed.
- It offers colossal execution due to a.) Custom Memory Management – Data is put away in off load memory in twofold arrangement .No refuse accumulation because of this.
- Optimized Execution Plan – Query plans are made utilizing Catalyst analyzer.
- DataFrame Limitations : Compile Time wellbeing , i.e no control of information is conceivable when the structure isn’t known.
DataSet: Expansion of DataFrame
- DataSet Feautures – Provides best encoding component and not at all like information edges supports arrange time security.
Dataframe is untyped (throw an exception at runtime in case of any error in the schema mismatch)
Dataset is typed(throw an exception at compile time in case of any error in the schema mismatch)
how to join two dataframes in spark?