Spark Interview Questions and Answers
1. What is Spark?
Answer: Spark is scheduling, monitoring and distributing engines for big data. It is a cluster computing platform designed to be fast and general purpose. Spark extends the popular MapReduce model. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk. (spark interview questions)
2. What are the main components of Spark?
Spark Core: Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines RDDs,
Spark SQL: Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the HQL.
Spark Streaming: Spark Streaming is a Spark component that enables the processing of live streams of data. Examples of data streams include log files generated by production web servers.
MLlib: Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms.
GraphX: GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.
3. What are the benefits of Spark over MapReduce?
Answer: Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
4. What are the steps that occur when you run a Spark application on a cluster?
Answer: The user submits an application using spark-submit.
- Spark-submit launches the driver program and invokes the main() method specified by the user.
- The driver program contacts the cluster manager to ask for resources to launch executors.
- The cluster manager launches executors on behalf of the driver program.
- The driver process runs through the user application. Based on the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks.
- Tasks are run on executor processes to compute and save results.
- If the driver’s main() method exits or it calls SparkContext.stop(), it will terminate the executors and release resources from the cluster manager.
5. How Spark achieves fault tolerance?
Answer: Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses a different data storage model, RDD. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition. This removes the need for replication to achieve fault tolerance.
6. Explain the popular use cases of Apache Spark?
Answer: Apache Spark is mainly used for
- Iterative machine learning.
- Interactive data analytics and processing.
- Stream processing
- Sensor data processing.
7. How can you achieve high availability in Apache Spark?
Answer: Implementing single node recovery with local file system Using StandBy Masters with Apache ZooKeeper.
8. What do you understand by Lazy Evaluation?
Answer: Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like a map () is called on an RDD-the operation is not performed immediately. Transformations in Spark are not evaluated until you perform an action. This helps optimize the overall data processing workflow. (E learning portal)
9. What are the disadvantages of using Apache Spark over Hadoop MapReduce?
Answer: Apache spark does not scale well for compute-intensive jobs and consumes a large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost-efficient processing of big data. Also, Spark does have its own file management system and hence needs to be integrated with other cloud-based data platforms or apache Hadoop.
10. What do you understand by Pair RDD?
Answer: Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
11. What is the lineage graph?
Answer: The RDDs in Spark depend on one or more other RDDs. The representation of dependencies between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
12. What is a Sparse Vector?
Answer: A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
13. What is Spark Streaming?
Answer: Whenever there is data flowing continuously and you want to process the data as early as possible, in that case, you can take advantage of Spark Streaming. It is the API for stream processing of live data. Data can flow for Kafka, Flume or TCP sockets, Kenisis, etc., and you can do complex processing on the data before you pushing them into their destinations. Destinations can be file systems or databases or any other dashboards.
14. What is Sliding Window?
Answer: In Spark Streaming, you have to specify the batch interval. For example, let’s take your batch interval is 10 seconds, Now Spark will process the data whatever it gets in the last 10 seconds i.e., last batch interval time. But with Sliding Window, you can specify how many last batches have to be processed. In the below screenshot, you can see that you can specify the batch interval and how many batches you want to process. Apart from this, you can also specify when you want to process your last sliding window. For example, you want to process the last 3 batches when there are 2 new batches. That is like when you want to slide and how many batches have to be processed in that window.
15. List some use cases where Spark outperforms Hadoop in processing?
Answer: Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
Spark is preferred over Hadoop for real-time querying of data
Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
16. On which all platforms can Apache Spark run?
Answer: Spark can run on the following platforms:
YARN (Hadoop): Since yarn can handle any kind of workload, the spark can run on Yarn. Though there are two modes of execution. One in which the Spark driver is executed inside the container on node and second in which the Spark driver is executed on the client machine. This is the most common way of using Spark.
Apache Mesos: Mesos is an open-source good upcoming resource manager. Spark can run on Mesos.
EC2: If you do not want to manage the hardware by yourself, you can run the Spark on top of Amazon EC2. This makes spark suitable for various organizations.
Standalone: If you have no resource manager installed in your organization, you can use the standalone way. Basically, Spark provides its own resource manager. All you have to do is install Spark on all nodes in a cluster, inform each node about all nodes and start the cluster. It starts communicating with each other and run.
17. What is the best way of viewing this course?
Answer: You have to just watch the course from beginning to end. Once you go through all the videos, try to answer the questions in your own words. Also, mark the questions that you could not answer by yourself. Then, in the second pass go through only the difficult questions. After going through this course 2-3 times, you will be well prepared to face a technical interview in the Apache Spark field.
18. What are the topics covered in this course?
Answer: We cover a wide range of topics in this course. We have questions on Apache Spark, Spark architecture, tricky questions, etc.
19. What are Spark’s main features?
Answer: Speed: Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing the number of reading/write to disc. It stores this intermediate processing data in-memory. It uses the concept of a Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time-consuming factors – of data processing.Combines SQL, streaming, and complex analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.
Ease of Use: Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps.
Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.
20. Explain the Apache Spark Architecture?
Answer: Apache Spark application contains two programs namely a Driver program and Workers program.
A cluster manager will be there in-between to interact with these two cluster nodes. Spark Context will keep in touch with the worker nodes with the help of the Cluster Manager.
Spark Context is like a master and Spark workers are like slaves.
Workers contain the executors to run the job. If any dependencies or arguments have to be passed then Spark Context will take care of that. RDD’s will reside on the Spark Executors.
You can also run Spark applications locally using a thread, and if you want to take advantage of distributed environments you can take the help of S3, HDFS or any other storage system.
21. How will this course help me?
Answer: By attending this course, you do not have to spend time searching the Internet for Apache Spark interview questions. We have already compiled the list of most popular and latest Apache Spark Interview questions.
22. Explain Spark Streaming Architecture?
Answer: Spark Streaming uses a “micro-batch” architecture, where Spark Streaming receives data from various input sources and groups it into small batches. New batches are created at regular time intervals. At the beginning of each time interval, a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval, the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval. Each input batch forms an RDD and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.
23. What is Piping?
Answer: Spark provides a pipe() method on RDDs. Spark’s pipe() lets us write parts of jobs using any language we want as long as it can read and write to Unix standard streams. With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as a String, manipulates that String however you like, and then writes the result(s) as Strings to standard output.
24. What file systems Spark support and What is Yarn ?
Answer: Hadoop Distributed File System (HDFS)
Local File system
Yarn :- Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.
25. Is there any benefit of learning MapReduce, then?
Answer: Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
26. What are Executors?
Answer: Spark executors are worker processes responsible for running the individual tasks in a given Spark job. Executors are launched once at the beginning of a Spark application and typically run for the entire lifetime of an application. Executors have two roles. First, they run the tasks that make up the application and return results to the driver. Second, they provide in-memory storage for RDDs that are cached by user programs.
27. What are the Row objects?
Answer: Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields. Row objects have a number of getter functions to obtain the value of each field given its index. The standard getter, get (or apply in Scala), takes a column number and returns an Object type (or Any in Scala) that we are responsible for casting to the correct type. For Boolean, Byte, Double, Float, Int, Long, Short, and String, there is a getType() method, which returns that type. For example, get String Would return field 0 as a string.
28. What is Receiver in Spark Streaming?
Answer: Every input DStream is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.
29. Does Apache Spark provide checkpointing?
Answer: Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to the checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
30. What is spark context?
Answer: SparkContext is the entry point to Spark. Using spark context you create RDDs which provided various ways of churning data.
31. Why is Spark faster than Map Reduce?
There are a few important reasons why Spark is faster than MapReduce and some of them are below:
There is no tight coupling in Spark i.e., there is no mandatory rule that reduces must come after the map.
Spark tries to keep the data “in-memory” as much as possible.
In MapReduce, the intermediate data will be stored in HDFS and hence takes a long time to get the data from a source but this is not the case with Spark.
32. What is the level of questions in this course?
Answer: This course contains questions that are good for a Fresher to an Architect level. The difficulty level of question varies in the course from a Fresher to an Experienced professional.
33. What are dreams?
Answer: Much like Spark is built on the concept of RDDs, Spark Streaming provides an abstraction called DStreams, or discretized streams. A DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS. Once built, they offer two types of operations: transformations, which yield a new DStream, and output operations, which write data to an external system.
34. what is a schema RDD/DataFrame?
Answer: A SchemaRDD is an RDD composed of Row objects with additional schema information of the types in each column. Row objects are just wrappers around arrays of basic types (e.g., integers and strings).
35. What is Spark Executor?
Answer: When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
36. What is the Broadcast Variables?
Answer: Spark’s second type of shared variable, broadcast variables, allows the program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. They come in handy, for example, if your application needs to send a large, read-only lookup table to all the nodes.
37. Difference between map() and flatMap()?
Answer: The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD. Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap(). As with map(), the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, we return an iterator with our return values.
38. What is Spark SQL?
Answer: SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured data and perform structured data processing. Through this module, Spark executes relational SQL queries on the data. The core of the component supports an altogether different RDD called SchemaRDD, composed of row objects and schema objects defining the data type of each column in the row. It is similar to a table in a relational database.
39. Define Spark Streaming. Does spark support stream processing?
Answer: An extension to the Spark API, allowing stream processing of live data streams. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards, and databases. It is similar to batch processing as the input data is divided into streams like batches.
40. How does Spark Streaming work?
Answer: Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
41. What is Spark Driver?
Answer: Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, the driver in Spark creates SparkContext, connected to a given Spark Master. The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
42. What do you understand by Transformations in Spark?
Answer: Transformations are functions applied to RDD, resulting in another RDD. It does not execute until an action occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements to form current RDD that pass function argument.
43. What does a Spark Engine do?
Answer: Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.
44. What are the client mode and cluster mode?
Answer: Each application has a driver process that coordinates its execution. This process can run in the foreground (client mode) or in the background (cluster mode). Client mode is a little simpler, but cluster mode allows you to easily log out after starting a Spark application without terminating the application.
45. What is Apache Spark?
Answer: Spark is a fast, easy-to-use and flexible data processing framework. Most of the data users know only SQL and are not good at programming. A shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive megastore, queries, and data.
46. What is Executor memory?
Answer: You can configure this using the –executor-memory argument to spark-submit. Each application will have at most one executor on each worker, so this setting controls how much of that worker’s memory the application will claim. By default, this setting is 1 GB—you will likely want to increase it on most servers.
47. What is the maximum number of total cores?
Answer: This is the total number of cores used across all executors for an application. By default, this is unlimited; that is, the application will launch executors on every available node in the cluster. For a multiuser workload, you should instead ask users to cap their usage. You can set this value through the –total-execution cores argument to spark-submit, or by configuring spark.cores.max in your Spark configuration file.
48. What is the Standalone mode?
Answer: In standalone mode, Spark uses a Master daemon which coordinates the efforts of the Workers, which run the executors. Standalone mode is the default, but it cannot be used on secure clusters. When you submit an application, you can choose how much memory its executors will use, as well as the total number of cores across all executors.
49. Explain the key features of Spark?
Answer: Allows Integration with Hadoop and files included in HDFS.
Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written), interpreter
Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
Spark supports multiple analytic tools that are used for interactive query analysis, real-time analysis and graph processing.
50. What are the optimizations that the developer can make while working with spark?
1. Spark is memory intensive, whatever you do it does in memory.
2. Firstly, you can adjust how long spark will wait before it times out on each of the phases of data locality (data local — process local –node-local — rack local — Any)
3. Filter out data as early as possible. For caching, choose wisely from various storage levels.
4. Tune the number of partitions in spark.
51 .Define Actions.
An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to the local node.