HDFS (Hadoop Distributed File System)

The Hadoop Distributed File System (HDFS) is the storage layer of Apache Hadoop. The Hadoop project page describes HDFS as the module that “provides high-throughput access to application data.” Its design document explains that HDFS is built to hold enormous files: “A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.”

HDFS stores each file as a sequence of large blocks spread across the cluster. The design document notes that “a typical block size used by HDFS is 128 MB,” so “an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.” Splitting files this way lets many machines read and process different parts of the same file in parallel.

Reliability comes from replication. According to the design document, “the blocks of a file are replicated for fault tolerance,” and “the block size and replication factor are configurable per file.” When a machine holding some replicas fails, the cluster detects the loss and re-creates the missing copies elsewhere, so data survives the routine failure of individual nodes.

HDFS follows the design of the Google File System described in Google’s GFS paper, adapting the same master-and-workers, block-based, replicated approach into open source. As Hadoop’s storage foundation, HDFS is what higher-level systems such as Hive and HBase sit on top of.

HDFS (Hadoop Distributed File System)

Sources

Related