2.Materials/Methods. Differences between MapReduce and Apache spark
MapReduce is a programming mechanism for processing and creating large data sets with a parallel, distributed algorithm in a computer cluster. MapReduce consists of several components, including:
- JobTracker - the main node that manages all work and resources in the cluster.
- TaskTrackers - agents placed on each machine in the cluster to run the map and reduce tasks.
- JobHistoryServer - a component that tracks the work done and is usually deployed as a separate function or with JobTracker.
Figure 1. MapReduce architecture
Apache Hadoop is an open source software system. It is designed to measure applications from a single server to thousands of machines and run applications in clusters with traditional equipment. The Apache Hadoop platform is divided into two levels.
• Hadoop Distributed File System (HDFS)
• Processing layer (MapReduce)
The Hadoop storage layer, i.e., HDFS, is responsible for storing data, and MapReduce is responsible for processing data in the Hadoop cluster. MapReduce is a programming paradigm that provides scalability for hundreds or thousands of servers in the Hadoop cluster. MapReduce is a processing technique and programming model for distributed computing based on the Java programming language. MapReduce is a powerful framework for processing large distributed sets of configured or unstructured data in a Hadoop cluster stored in a Hadoop Distributed File System (HDFS).[4] A strong feature of MapReduce is its scalability.
Apache spark is a cluster technology designed for fast computing when processing large amounts of data. Apache spark is a distributed processing mechanism, but it does not come with a built-in cluster resource manager and distributed storage system. You need to connect the cluster manager and storage system of your choice. It consists of a set of libraries similar to the ones available for the Apache spark core and Hadoop.[8] The core is a distributed execution mechanism and a set of languages. Apache spark supports languages such as Java, Scala, Python and R to develop distributed applications. Additional libraries have been built on top of the Spark core to support workloads that use flow, SQL, graphics, and machine learning. Apache spark is a data processing mechanism for bulk and streaming that includes SQL queries, graphics, and machine learning. Apache spark can work independently, as well as in the Hadoop YARN cluster manager, so that it can read available Hadoop data.
Figure 2. Apache spark technology
The main difference between MapReduce and Apache spark
- MapReduce is based on hard disk, while Apache spark uses fast memory and can use disk for processing.
- MapReduce and Apache spark are compatible in terms of data types and data sources.
- The main difference between MapReduce and Spark is that MapReduce uses non-volatile memory, while Spark uses a permanently distributed data set.
- Hadoop MapReduce is designed for non-volatile data, while Apache spark has better performance for non-volatile data, especially in allocated clusters.
- Hadoop MapReduce can be an economical option because Hadoop is a service and Apache spark is economical due to high memory availability.
- Apache spark and Hadoop MapReduce are both fault tolerant, but relatively more fault tolerant than Hadoop MapReduce Spark.
- Hadoop MapReduce requires basic Java programming skills, while Apache spark programming is easier because it has an interactive mode.
- Spark can do mass processing 10-100 times faster than MapReduce, but both tools are used to process large amounts of data.[7]
Do'stlaringiz bilan baham: |