CatBoost: Unbiased Boosting with Categorical Features

“CatBoost: unbiased boosting with categorical features” by Liudmila Prokhorenkova and colleagues at Yandex was posted to arXiv in 2017 and presented at NeurIPS 2018. CatBoost is the third member, alongside XGBoost and LightGBM, of the family of gradient-boosting libraries that dominate tabular machine learning.

The paper identifies a subtle problem the authors call prediction shift, a form of target leakage that occurs because standard gradient boosting uses the same data points both to compute the targets a tree should fit and to evaluate it. CatBoost’s answer is ordered boosting, a permutation-based scheme that, for each example, only uses earlier examples to build the model that scores it. The library also brings a principled way of turning categorical features, such as city names or product categories, into numbers without the usual leakage, which is why it is named after categories.

CatBoost is known for working well out of the box with little tuning, and for handling categorical data gracefully, a common pain point in business datasets.

Why business readers should care: a large share of useful business variables are categories rather than numbers, and CatBoost reduces the hand-engineering needed to feed them into an accurate predictive model.

CatBoost: Unbiased Boosting with Categorical Features

Sources

Related