Random Forests

“Random Forests” by Leo Breiman was published in the journal Machine Learning, volume 45, in 2001. It introduced one of the most reliable and widely used algorithms in all of machine learning.

A random forest is an ensemble of decision trees. Each tree is trained on a bootstrap sample of the data, as in bagging, and with an extra twist: at each split, the tree is only allowed to consider a random subset of the available features rather than all of them. This double dose of randomness makes the individual trees more different from one another, so that combining their votes cancels out more error. Breiman proved the forest’s accuracy depends on how strong the individual trees are and how uncorrelated they are, and showed that adding more trees does not cause overfitting.

Random forests are fast to train, need little tuning, handle large numbers of features, and give useful estimates of which features matter most. For two decades they have been a default first choice for classification and regression on tabular data, alongside the gradient-boosting methods that sometimes edge them out on accuracy.

Why business readers should care: when a team needs a strong, low-fuss predictive model on ordinary business data and does not have time to hand-tune it, the random forest is the algorithm they most often reach for.

Sources

Related