Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?

“Why do tree-based models still outperform deep learning on tabular data?” by Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux was presented at NeurIPS 2022. Amid a stream of papers claiming deep networks could finally beat tree ensembles on tabular data, this work ran a careful, standardized benchmark to check, and reached a sobering conclusion.

Across many datasets, the authors found that tree-based models such as gradient-boosted trees and random forests remained state of the art on medium-sized data, around ten thousand samples, even before accounting for the fact that they are faster and need less tuning. More valuable than the verdict was the explanation. The paper identifies three properties of tabular data that trip up neural networks: many features are uninformative and trees are better at ignoring them, neural nets are biased toward overly smooth functions while real tabular targets are often irregular, and standard neural nets are sensitive to rotations of the feature space whereas tabular columns have a fixed, meaningful identity. The authors released their benchmark so future deep models could be tested fairly.

The paper became the standard citation for the practical advice that, on typical business tables, you should reach for gradient boosting first.

Why business readers should care: it gives an evidence-based reason to default to proven tree ensembles for spreadsheet-style data rather than chasing deep learning hype.

Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?

Sources

Related