XGBoost: A Scalable Tree Boosting System

“XGBoost: A Scalable Tree Boosting System” by Tianqi Chen and Carlos Guestrin was posted to arXiv in March 2016 and presented at the KDD conference that year. It describes an engineering-focused implementation of gradient-boosted decision trees that became one of the most widely used machine learning tools of the decade.

Gradient boosting builds a predictive model by adding decision trees one at a time, with each new tree trained to correct the errors of the ones before it. XGBoost made this idea fast and practical at scale. Its main contributions are a sparsity-aware algorithm that handles missing values and sparse data efficiently, a weighted quantile sketch that lets the system find good split points approximately on huge datasets, and low-level optimizations around cache use, data compression, and sharding across machines. The authors report training on billions of examples using far fewer resources than earlier systems.

For several years XGBoost was the algorithm behind a large share of winning entries in Kaggle data science competitions, especially on structured, tabular data of the kind most businesses actually have. It remains a default choice when a team wants strong accuracy on spreadsheet-style data without the cost and complexity of deep learning.

Why business readers should care: most real corporate data lives in tables of rows and columns, and XGBoost is one of the most reliable tools for turning that data into accurate predictions of churn, fraud, demand, or risk.

XGBoost: A Scalable Tree Boosting System

Sources

Related