When talking about big data frameworks, most likely people will bring up Apache Hadoop and Apache Spark. This post will explain what they are and their relationships.
Apache Hadoop is an open source big data framework written in Java for distributed storage and distributed processing of big data. At the core of the Hadoop system are two important parts:
Hadoop has been around since 2010 and it has become the cornerstone in the big data arena. A slew of supporting tools have emerged to form the Hadoop ecosystem:
The major disadvantage with Hadoop is it is batch-driven and therefore not for real-time data processing.
Apache Spark is an open source big data framework released in 2014 mainly to tackle the performance issue with big data computation. Spark has become more and more popular among data scientists who want to get an answer quickly rather than waiting for the batch processing result the next day.
Spark reads data from other sources and stores it in resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines. This mechanism greatly improves Spark’s speed when doing data processing. Spark is claimed to be up to 100 times faster than Hadoop MapReduce although some people are skeptical about this claim.
Spark is written in Scala and it supports programs written in Scala, Python, Java and R. Spark’s ecosystem contains the following:
Apache Spark does not have its own dedicated distributed storage solution.
Hadoop and Spark are not mutually exclusive. You can configure them to work together to take advantage of Hadoop’s HDFS for storage and Spark for computation.