XGBoost is All You Need, Part 3 - Gradient Boosted Trees

Jan 30, 2025

This is the third part in the series of blog posts about XGBoost, based on my 2024 GTC presentation you can find Part 1 here, and Part 2 here.

Today we want to talk about gradient boosted trees. Even though XGBoost has an option for purely linear boosting, it’s the non-linear version - based on the decision tree algorithms - that gives this library and algorithm its predictive power.

Decision trees are perhaps the simplest predictive algorithm to describe. Say you want to know if it’s raining outside. The only “feature” that you have is whether it’s cloudy or not. You build an algorithm that predicts it will not rain when there are no clouds in the sky, and it will rain if it’s overcast. Your prediction for the clear sky will be completely accurate, with the predictions for the overcast being fairly accurate. You can potentially add other features - such as the density of the cloud coverage, windy conditions, temperature, etc., and build an algorithm based on all of those features that gives you a simple yes or no answer. That’s essentially what a decision tree is. Decision trees are very easy to understand and implement for simple binary choice classifications, but with a bit of work and ingenuity they can also work for regression problems.

The Benefits of Gradient Boosted Trees

Gradient boosted trees offer several advantages when dealing with real-world datasets, especially those that exhibit irregular or complex structures. Unlike some other methods, tree-based models are well-suited for capturing patterns in tabular data without requiring sophisticated preprocessing or extensive feature engineering. This adaptability makes them a practical and accessible choice for many tasks where data may not follow a consistent format.

Another key benefit is the minimal data preprocessing that trees require. Models like linear regression or neural networks often need normalized inputs, one-hot encoding, or other transformations to perform well. In contrast, gradient boosted trees can typically process raw data with little or no adjustment. This not only saves time but also reduces the risk of introducing errors during data preparation. Furthermore, many popular gradient boosting frameworks can handle missing values automatically, either by assigning them to a separate branch during training or learning optimal splits for null values. As a result, you often do not need to impute or remove missing records before modeling.

Tree-based methods are also robust in the presence of outliers because they split the data at different threshold values rather than fitting a single global function. Large or anomalous data points do not disproportionately shift the model’s behavior, which helps maintain stable performance. In addition, tuning a gradient boosted tree model is relatively straightforward. You typically focus on hyperparameters such as the number of trees, tree depth, and learning rate. There is no need to design elaborate network architectures or layer configurations, as is required with many deep learning approaches.

An important byproduct of using gradient boosted trees is the insight they provide into feature importance. Because trees split data based on certain criteria, it is straightforward to see how often and how effectively particular features reduce model error. This information helps with both understanding the data and directing efforts toward feature selection and engineering. When processing large datasets, gradient boosting libraries can also take advantage of modern hardware, including GPUs, to train models faster. The parallel computations reduce training times significantly, making it possible to iterate quickly and refine results.

Finally, many gradient boosting libraries, such as XGBoost, LightGBM, and CatBoost, are easy to install, use, and maintain. They have extensive documentation, active user communities, and regular updates. This combination of simplicity, performance, and maintainability has made gradient boosted trees a popular and efficient choice in machine learning applications involving tabular data.

Comparing decision boundaries. Image taken from arXiv:2207.08815v1.

The Shortcomings of Gradient Boosted Trees

Gradient boosted trees rely on piecewise constant approximations and are therefore not inherently sensitive to linearities in the data. When a clear linear relationship exists, these models may require additional feature engineering—such as explicit interaction terms or polynomial expansions—to capture that structure effectively.

Another limitation is their poor extrapolation ability. Gradient boosted trees learn from patterns observed within the training range, so predictions made on data that falls well outside this range can be unreliable. The models cannot extend learned trends in a way that accurately reflects behavior far beyond what was seen during training.

By design, boosting is a “shallow” method. Each iteration creates a small decision tree, which contributes weak predictions. The strength of the final model comes from combining many of these weak learners, but the approach remains limited to one layer of depth at a time. This makes capturing complex, hierarchical patterns more difficult compared to multi-layer methods.

Gradient boosted trees do not provide automatic feature engineering or selection mechanisms. They rely entirely on the given features and do not embed or transform them in the way some neural network architectures can. As a result, domain knowledge and thoughtful preprocessing are often critical for good performance.

Model sizes for gradient boosted trees can also be quite large. In some cases, thousands of individual trees must be stored to achieve the desired accuracy, which can lead to large files and increased memory requirements. This can pose challenges in environments where storage or memory is constrained.

Finally, deploying gradient boosted tree models in production can be tricky. Integrating these models with existing systems may require specialized libraries for efficient inference. Performance considerations such as prediction latency and scaling to high volumes of requests can further complicate production deployments.

We Need More GBT Research

Even though GBTs are extremely capable and successful, there has been relatively little research on the. One main reason for the slow pace of research is that the basic gradient boosting algorithm has remained largely unchanged for many years. The core idea—building an ensemble of trees where each new tree corrects errors made by the previous ones—has proven very effective. Because the algorithm works well, many researchers have focused on applying it to real-world problems rather than rethinking the core method. Despite its success, there is still room for innovation, especially as new challenges in data and computational resources emerge. Small improvements or adjustments in the algorithm might lead to even better performance or more efficient training processes.

Another factor is that most popular GBT libraries are implemented in languages such as C, C++, or CUDA. These languages are chosen for their speed and efficiency, which are critical for handling large datasets. However, they are not as accessible as Python, which has become the language of choice for many researchers due to its ease of use and readability. When the primary implementations of GBTs are in lower-level languages, it can create a barrier for researchers who want to experiment with or extend the algorithms. This limits the community of researchers who can easily contribute new ideas and small improvements.

The third issue is that many GBT algorithms are not designed in a modular way. A modular codebase allows individual components of an algorithm to be modified or replaced without having to rewrite the entire system. In contrast, the typical GBT implementations are often monolithic. This means that making even small changes to the algorithm can require significant effort. A more modular design would enable researchers to experiment with different components, such as alternative loss functions, tree-building strategies, or regularization methods, without having to dive into a large and complex codebase. This could lead to more iterative improvements and faster innovation.

These challenges—limited evolution of the core algorithm, implementation barriers due to low-level languages, and a lack of modularity—contribute to why there has been relatively little recent research on gradient boosted trees. Addressing these issues could open the door to incremental improvements that build on the strong foundation of the current algorithms. More accessible and modular implementations would allow a wider range of researchers to experiment, share ideas, and develop new techniques that could make gradient boosted trees even more effective. In the long run, this kind of research can help push the boundaries of what these algorithms can achieve in various machine learning tasks.

littlebobbytables

Feb 25

Additionally, the small amount of research that is done on GBMs is often paywalled, not open sourced or released and never maintained. It would be great if academia gave extended grants to open source and maintain projects for X years after publication.

Expand full comment

Rami Luisto

Mar 17

I like your series of posts! As a quick note, the first sentence under title "We Need More GBT Research" says:

"""

Even though GBTs are extremely capable and successful, there has been relatively little research on the.

I guess the last word should be "them" and not "the". :)

3 more comments...

XGBlog

Discussion about this post