Polars

Polars is a DataFrame library for working with structured, tabular data, designed first and foremost for speed. Its own documentation describes it as “a blazingly fast DataFrame library for manipulating structured data,” noting that “the core is written in Rust, and available for Python, R and NodeJS.” The project repository frames it as “an analytical query engine written for DataFrames,” emphasizing that it is “designed to be fast, easy to use and expressive.”

The library was created by Ritchie Vink, who began the project in 2020 and grew it into a widely used alternative to pandas. Polars was built from the start on the Apache Arrow columnar memory format, which lets it consume and produce Arrow data, often through zero-copy operations, and lay values out in memory in a way that modern processors handle efficiently. Writing the core in Rust gave the project memory safety without a garbage collector along with predictable, low-level control over performance.

A defining feature of Polars is its lazy API. Rather than executing each operation immediately, the lazy interface lets a user describe a whole pipeline of transformations, which Polars then analyzes and optimizes as a single query plan before running it. The engine applies optimizations such as predicate pushdown and projection pushdown, deferring and rearranging work so that only the necessary data is read and computed. Polars also offers an eager API for interactive use, multi-threaded execution across CPU cores, SIMD acceleration, and a streaming mode for processing datasets larger than available RAM.

Polars exposes an expression-based syntax in which transformations are written as composable expressions over columns, which the query engine can reason about and parallelize. This contrasts with the more imperative, index-centric style of pandas, and it is part of what allows Polars to optimize and parallelize work that would otherwise run sequentially.

By combining Rust, the Arrow format, and a query optimizer, Polars brought ideas more often associated with databases and distributed engines into the single-machine DataFrame world. It became a prominent example of how the data-engineering ecosystem could reuse the shared Arrow substrate to build faster tools, and it stands alongside pandas as one of the most-used DataFrame libraries in the Python data stack.

Sources

Related