30 Apr 2019

Hadoop and Spark

When talking about big data frameworks, most likely people will bring up Apache Hadoop and Apache Spark. This post will explain what they are and their relationships.

Apache Hadoop

Apache Hadoop is an open source big data framework written in Java for distributed storage and distributed processing of big data. At the core of the Hadoop system are two important parts:

HDFS (Hadoop Distributed File System), which is the storage unit of the system. Each HDFS has one single namenode and a bunch of datanodes. The namenode stores the metadata of the system including the name and addresses of the datanodes. The datanodes are where the actual data is stored.
MapReduce, which is the processing unit of the system. MapReduce is a programming model for distributed parallel processing of data.

Hadoop has been around since 2010 and it has become the cornerstone in the big data arena. A slew of supporting tools have emerged to form the Hadoop ecosystem:

Pig Latin, the scripting language used on Hadoop
Hive, SQL query on Hadoop
YARN, also called MapReduce 2.0, it is the cluster resource management and job scheduling tool
Mahout, machine learning and data mining tool on Hadoop
Oozie, workflow management tool
Flume, log collector
Zookeeper, coordination tool
Ambari, provisioning, managing and monitoring tool
Sqoop, data exchange tool

The major disadvantage with Hadoop is it is batch-driven and therefore not for real-time data processing.

Apache Spark

Apache Spark is an open source big data framework released in 2014 mainly to tackle the performance issue with big data computation. Spark has become more and more popular among data scientists who want to get an answer quickly rather than waiting for the batch processing result the next day.

Spark reads data from other sources and stores it in resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. This mechanism greatly improves Spark’s speed when doing data processing. Spark is claimed to be up to 100 times faster than Hadoop MapReduce although some people are skeptical about this claim.

Spark is written in Scala and it supports programs written in Scala, Python, Java and R. Spark’s ecosystem contains the following:

Spark Shark, SQL on Spark
Spark Streaming, for real-time data streaming
Spark MLLib, machine learning library on Spark
GraphX, graph computation on Spark
SparkR, R on Spark

Apache Spark does not have its own dedicated distributed storage solution.

Summary

Hadoop and Spark are not mutually exclusive. You can configure them to work together to take advantage of Hadoop’s HDFS for storage and Spark for computation.

Hadoop and Spark

Apache Hadoop

Apache Spark

Summary

Tags: