Dask

Dask is, in the words of its own documentation, “a Python library for parallel and distributed computing.” Its repository carries the tagline “Parallel computing with task scheduling” and describes it as “a flexible parallel computing library for analytics.” Dask’s purpose is to let analysts and engineers process data and run computations that are too large or too slow for a single core or a single machine, while keeping the familiar interfaces of the Python scientific stack.

The project emerged from the scientific Python community and saw its first releases in 2015. Its key idea is that many large computations can be expressed as a graph of small tasks with dependencies between them. Dask builds these task graphs lazily, then hands them to a scheduler that executes the tasks in parallel, whether across the threads and processes of one computer or across many machines in a cluster. This separation between describing a computation and scheduling it is what lets the same code scale from a laptop to a large deployment.

Dask is best known for its high-level collections that mirror tools developers already use. Dask DataFrame presents a pandas-like interface backed by many smaller pandas DataFrames partitioned along the index, and Dask Array does the same for NumPy by tiling large arrays into chunks. There is also Dask Bag for semi-structured data and a Futures API for fine-grained, real-time task submission. Because these collections imitate the APIs of pandas and NumPy, code written against them can often process out-of-core datasets, those larger than memory, with few changes.

Beyond its own collections, Dask integrates with the wider ecosystem, including scikit-learn for parallel and distributed machine learning workflows. Its scheduler can run on a single machine for convenience or coordinate work across a cluster for scale, and it exposes a dashboard for observing how task graphs execute.

Dask occupied a distinctive niche in data engineering: rather than introducing a new programming model the way some cluster-computing frameworks did, it extended the existing NumPy and pandas idioms that Python data scientists already knew. This made it a natural on-ramp from single-machine analysis to parallel and distributed computing, complementing engines such as Apache Spark while staying close to native Python.

Sources

Related