Apache Spark

Apache Spark began at the University of California, Berkeley’s AMPLab around 2010 as a research project led by Matei Zaharia and collaborators. Its central idea was published at USENIX NSDI 2012 in the paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” by Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. The paper introduced the Resilient Distributed Dataset (RDD), an immutable distributed collection that can be rebuilt from its lineage after a failure, allowing data to be kept in memory and reused across many operations.

The motivation was that MapReduce forced programs to write intermediate results to disk between stages, which was slow for the iterative algorithms common in machine learning and for interactive data analysis. By holding working sets in memory and tracking how each dataset was derived, Spark could rerun lost partitions instead of replicating data, recovering fault tolerance while avoiding repeated disk writes.

Today Spark presents itself, in its own words, as a multi-language engine for executing data engineering, data science, and machine learning on single machines or clusters. The project’s site describes it as simple, fast, scalable, and unified, with support for Python, SQL, Scala, Java, and R, and built-in capabilities spanning batch and streaming data, distributed SQL analytics, large-scale data science, and machine learning.

Spark entered the Apache Software Foundation’s incubator and graduated to a top-level Apache project in 2014. It grew into one of the most widely used frameworks for distributed data processing, and its creators went on to found the company Databricks around it.

Sources

Related