Apache Flink

Apache Flink describes itself, on its own project site, as “a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.” It is designed to run in standard cluster environments and to perform computations at in-memory speed regardless of scale.

Flink’s defining trait is that it treats the stream as the primary abstraction. An unbounded stream has a start but no defined end and must be processed continuously as events arrive; a bounded stream, the kind a batch job consumes, is simply a stream that ends. By building on this model, Flink handles both real-time and batch workloads with the same engine, where batch becomes a special case of streaming.

The project emphasizes correctness guarantees, including exactly-once state consistency, so that even after failures each event is reflected in the computed results exactly once. It offers layered APIs, from high-level SQL down through a DataStream API to the low-level ProcessFunction, and targets event-driven applications, stream and batch analytics, and data pipeline or ETL work.

Flink grew out of a European research effort and became a top-level Apache Software Foundation project, joining engines like Apache Spark and Apache Kafka at the center of modern real-time data architectures.

Sources

Related