Apache Parquet

Apache Parquet is, in the words of its own documentation, “an open source, column-oriented data file format designed for efficient data storage and retrieval.” Where a row-oriented format writes each record’s fields together, Parquet groups the values of each column together on disk. This columnar layout is what makes it well suited to analytics: queries that touch only a few columns can read just those columns, and storing similar values together enables strong, per-column compression.

Parquet originated in 2013 as a collaboration between engineers at Twitter and Cloudera, who set out to bring efficient columnar storage to the Hadoop ecosystem in a way that was open and not tied to a single processing engine. The format’s specification is maintained in the parquet-format repository, which states that Parquet was “built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper.” That heritage lets Parquet represent deeply nested records, not just flat tables, while still storing them column by column.

The physical layout reflects these goals. A Parquet file is bracketed by a 4-byte magic number, “PAR1”, and organized into row groups that each contain column chunks, with file-level metadata describing the schema and the location and statistics of every chunk. Because the metadata records per-column statistics such as minimum and maximum values, readers can skip entire row groups that cannot match a query, an optimization often called predicate pushdown. The specification also notes that “Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings,” so different columns can use whatever encoding and compression fit their data best.

Parquet pairs naturally with in-memory columnar formats. Its on-disk columnar discipline maps closely onto Apache Arrow’s in-memory columnar layout, letting systems read Parquet files into Arrow buffers and operate on them with little conversion, so the same column-oriented model runs from storage through computation.

Over the following decade Parquet became the default file format for analytical data at rest. It is the storage layer beneath query engines such as Apache Spark, the format of choice for files in HDFS and cloud object stores, and the foundation that data lakes and the table formats built on top of them assume. By providing an open, engine-neutral, compression-friendly columnar format, Parquet became one of the quiet standards on which modern data engineering rests.

Sources

Related