K-Means Clustering

K-means is the best-known algorithm for clustering: grouping data into a chosen number of clusters, k, so that points within a cluster are similar to each other. The name and one of the standard formulations come from James MacQueen’s 1967 paper “Some Methods for Classification and Analysis of Multivariate Observations,” published in the proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. A closely related procedure had been described earlier by Stuart Lloyd at Bell Labs.

The algorithm is simple and iterative. Start with k guessed cluster centers. Assign each data point to its nearest center. Move each center to the average position of the points assigned to it. Repeat until the assignments stop changing. The result partitions the data into k groups, each represented by its center. K-means is unsupervised: it finds structure in data without being given any labels.

Its weaknesses are well known. The user must choose k in advance, the outcome depends on the random starting centers, and it tends to find roughly spherical clusters of similar size. Despite this, its speed and simplicity keep it in constant use for customer segmentation, image compression, and as a quick first look at the structure of a dataset.

Why business readers should care: k-means is the default tool for turning a pile of unlabeled records - customers, transactions, sensor readings - into a handful of meaningful segments that teams can actually act on.

Sources

Related