XGBlog

Deep Learning and its Discontents

Bojan Tunguz — Thu, 06 Nov 2025 13:54:02 GMT

Yesterday I came across a tweet that praised “Deep Learning”, a classic ML/AI textbook.

The author of the tweet credited that textbook for his own career success. I retweeted it, and commented, half tongue in cheek, that “I basically owe my career to ignoring this book.” Now, before we go any further, a disclaimer is in order: I have nothing against the author of the original tweet (I like him and consider him a friend), nor the authors of the above textbook. They are all exceedingly respectable researchers in the field, and have been rightly credited for some of the most important breakthroughs over the past few decades.

I actually started reading “Deep Learning” before it was even officially published - the authors were making the pdf version of the work in progress freely available online. At the time I was still relatively new to the world of Machine Learning, and was trying, on the side, to read some more foundational material. But even then I was already mostly uninterested in applying the same approach to learning the material that I had in my academic career - pouring over intricate and arcane deep knowledge in order to master it after painstaking long study. Now, these days, (as confirmed by the reactions to my tweet) the main issue that most people have with “Deep Learning” is that it’s a purely theoretical book with no applications. My issue, however, with it is that it’s an absolutely terrible textbook. There are no worked out examples in the book of solving even the theoretical problems, and there are no exercises. You are just supposed to raw dog the material and absorb it as you go along. Its pedagogical quality is pretty abysmal.

The second, and rather minor in the grand scheme things, reason why the textbook was not important to me was that at some point I decided that for what I really cared about - predictive modeling and machine learning for tabular data - deep learning was not that helpful and relevant. Right out of the box, neural networks don’t perform as well as the tree-based algorithms. Neural networks can add predictive power, but primarily through ensembling with other models. Furthermore, in those domains non-algorithmic considerations have even more of an impact on your model. That is to say, things like quality of data, feature selection, feature engineering, etc. are far more relevant. I’ve already written about this in my previous posts on this blog, and will probably go into more detail in some subsequent posts.

In general, when it comes to any career in the tech industry, my biggest recommendation for getting good at technical skills is to practice those skills. You will gain much more from being able to solve immediate and relevant problems than you can from pouring over theoretical knowledge. If you still have an itch to dig deeper into the theoretical knowledge later on, then by all means do so! But I would warn you not to do it as a way of “procrastinating” with getting relevant practical knowledge instead.

Dealing With Missing Values, Part 3.

Bojan Tunguz — Sat, 07 Jun 2025 12:49:52 GMT

This is the third and last post on the topic of dealing with missing values in Data Science and Machine Learning. You can read the first part here, and the second part here. As I noted in my first post in this series, this turned out to be a much bigger topic than I had though when I first wrote about it years ago, and what I intended to be a quick short post ended up needing to be split into three parts. Even so, we have only scratched the surface of this big area, and you are urged to look up many other wonderful sources online.

So here are three more ways of dealing with missing values.

Probabilistic & Statistical Approaches (Bayesian/EM-style)

When handling missing data, probabilistic and statistical approaches are valuable because they explicitly model missingness under a rigorous framework. Methods like Expectation-Maximization (EM) iteratively estimate missing values alongside the parameters of the underlying statistical model, offering a structured and statistically sound solution. One effective implementation in Python involves scikit-learn's IterativeImputer coupled with a BayesianRidgeestimator. This combination iteratively imputes missing values by treating them as random variables, leveraging Bayesian inference to incorporate uncertainty and manage complex multivariate relationships. Bayesian methods specifically allow for incorporating prior information, which can enhance robustness, particularly in scenarios with limited data or small sample sizes. This can be especially beneficial when dealing with sensitive or critical datasets in fields such as healthcare, finance, or scientific research, where the cost of incorrect imputations can be high.

For instance, in Python, you might write:

Interpolation (Time-Series & Ordered Data)

Another effective method, particularly suitable for sequential or ordered data such as time-series, is interpolation. Interpolation techniques fill in missing data by estimating values based on surrounding known data points, thus preserving continuity and inherent sequential patterns. Linear interpolation, which connects missing points using straight lines, is common due to its simplicity and effectiveness. However, more sophisticated methods like spline or polynomial interpolation can be utilized if smoother transitions are desired, although these might introduce artifacts, especially in noisy datasets. Moreover, interpolation techniques are highly context-sensitive; they assume that sequential order or time intervals between observations accurately reflect underlying trends or patterns. Therefore, careful assessment of data continuity and frequency is essential before selecting an interpolation method.

A straightforward interpolation example in Python might look like:

Robust Model Design (Tree-Based & Native Missing-Value Handling)

Alternatively, some algorithms natively handle missing data, particularly tree-based models such as scikit-learn’s HistGradientBoostingClassifier, XGBoost, LightGBM, and CatBoost. These models inherently manage missing values during training, eliminating the need for a separate imputation step and often resulting in excellent predictive performance. They manage missing data by internally identifying optimal splits, taking missing values into account directly during model training. Although these methods simplify preprocessing, they limit your model selection flexibility by tying you to specific algorithms. Additionally, models with built-in handling of missing values may also yield more interpretable outcomes since they clearly illustrate the role of missingness within the predictive structure.

Here's a brief example:

How to Choose the Right Method

Choosing the right method for dealing with missing values involves thoughtful consideration of several factors. Understanding the underlying mechanism causing missingness is critical. Data that are Missing Completely at Random (MCAR) are typically easier to manage with simpler methods like mean or median imputation. Missing at Random (MAR) data may require more sophisticated techniques such as multivariate imputations (e.g., MICE), as these methods better capture complex relationships among variables. Data Missing Not at Random (MNAR), meanwhile, usually necessitate domain-specific insights or specialized statistical approaches, possibly including sensitivity analyses or explicitly modeling the missingness mechanism itself.

It is essential to evaluate how each method impacts your dataset's integrity and potential biases. While deletion methods are straightforward and computationally simple, they might substantially reduce the dataset's size and introduce bias, especially if data are not missing completely at random. Simple methods, such as mean or median imputation, can be useful initial steps to benchmark results and provide baseline performance. Progressing toward more sophisticated approaches like KNN or MICE is advisable as the complexity of the data and analysis requirements grow. These advanced methods, while computationally more intensive, typically yield higher quality imputations by capturing interdependencies among features.

Domain knowledge remains invaluable throughout this process. Incorporating expert insights can guide sensible thresholds or plausible imputations, significantly enhancing the validity and reliability of your analyses. Practical considerations, like data collection methods, measurement errors, and domain-specific data constraints, should always inform your choice of imputation strategy. Finally, remember that methods can be combined creatively. For instance, interpolating sequential data before using a robust tree-based model, or employing missingness indicators alongside probabilistic methods, can offer practical and effective solutions.

By carefully applying these strategies, you can effectively manage missing values, leading to more accurate, reliable, and insightful outcomes in your data science and machine learning projects.

Dealing With Missing Values, Part 2.

Bojan Tunguz — Mon, 02 Jun 2025 12:18:36 GMT

In our previous post, we introduced the topic of dealing with missing variables in Data Science and Machine Learning. The topic turned out to be much larger than what I had originally thought when deciding to rewrite a blog post I wrote years ago, so I decided to split it in several parts. The more I think about it, though, the more I am convinced that only a short book on the topic would actually do it full justice.

Multivariate Imputation by Chained Equations (MICE)

Multivariate Imputation by Chained Equations (MICE) is a sophisticated approach for handling missing data, especially effective when the missing values follow a complex pattern under the Missing At Random (MAR) assumption. Unlike simpler methods that handle each feature independently, MICE iteratively models each feature based on the others, capturing the inherent relationships within your dataset.

The MICE process works by first filling missing values with initial estimates, often simple ones like mean or median values. It then iteratively refines these estimates by modeling each feature with regression techniques, conditional on the others. This approach allows MICE to accurately preserve multivariate relationships and provide uncertainty estimates for the imputed values.

However, MICE requires careful tuning. It involves deciding on the number of iterations to run and handling the computational complexity that arises from modeling each feature iteratively.

Here is how you can implement MICE using Python's scikit-learn:

Indicator Variable Techniques

Sometimes, the pattern of missing data itself can carry valuable information. In such scenarios, indicator variable techniques explicitly record the presence or absence of data as separate binary indicators. This method ensures that the imputation process and downstream models can learn directly from patterns of missingness.

Indicator variables help algorithms distinguish between originally observed and imputed values, which can significantly improve the performance of predictive models. However, this technique also increases the dimensionality of the dataset, potentially leading to overfitting, especially in cases where missing data is sparse.

A common practice is to combine indicator variables with a straightforward imputation strategy like mean or mode imputation, as illustrated below:

Domain-Specific Rules

Domain-specific imputation leverages expert knowledge or external contextual data to fill missing values. This method is highly effective when you have clear business logic or contextual insights guiding the imputation process.

For example, if you know from domain knowledge that all employees in the engineering department earn at least $70,000 annually, then this information should directly inform how you handle missing values in the "Salary" column. Using domain-specific rules can improve the validity of your imputation significantly over general statistical approaches.

Here's a practical example of applying domain-specific imputation:

This concludes the second part of this series. In the last part we’ll tackle probabilistic methods, interpolation, and tree-based methods..

Dealing With Missing Values, Part 1.

Bojan Tunguz — Tue, 27 May 2025 12:53:53 GMT

No real world data collection process is perfect, and we are often left with all sorts of noise in our dataset: incorrectly recorded values, non-recorded values, corruption of data, etc. If we are able to spot all those irregular points, oftentimes the best we can do is treat them as missing values. Missing values are the fact of life if you work in data science, machine learning, or any other field that relies on the real-world data. Most of us hardly give those data points much thought, and when we do we rely on many ready-made tools, algorithms, or rules of thumb to deal with them. However, to do them proper justice you sometimes need to dig deeper, and make a judicious choice of what to do with them. And what you end up doing with them, like in many other circumstances in data science, can be boiled down to the trusted old phrase of “it depends”. Missing data can significantly impact the results of analyses and models, potentially leading to biased or misleading outcomes.

Many years ago I wrote a corporate blog post on this topic. The experience of writing that post taught me many valuable lessons about blogging - that blog post had to go through way too many chains of command before it saw the light of day, and eventually, like most other things about that startup, it disappeared. I’ve decided to revisit this topic, and try to do it even more justice this time around. And as I embarked on that journey, I soon realized that a single post will not do. So here is the first part in what I expect to be a three part series.

Let’s start by creating a dummy dataframe into which we can randomly insert various missing values.

Deletion-Based Methods

Deletion-based methods handle missing data by completely removing rows or columns that contain missing values. This approach is straightforward and easy to apply, requiring no additional parameters or complex adjustments. However, it should only be used under specific circumstances.

Deletion-based methods are most suitable when the missing data occur completely at random (MCAR). In other words, the absence of certain data points must be entirely unrelated to any observable or hidden variables. Additionally, these methods are practical only when dealing with large datasets, where losing some data points or variables will not significantly affect the overall dataset or analysis results.

The primary advantage of deletion-based methods is their simplicity. However, a notable drawback is the potential for significant data loss, particularly in datasets with many missing values. If the assumption of MCAR does not hold, results can become biased due to selective data removal.

Here is an example using Python and the Pandas library:

Simple Imputation (Mean/Median/Mode)

Simple imputation involves replacing missing values with basic statistical measures such as the mean, median, or mode of existing data. This method is ideal when dealing with moderate levels of missingness, providing a quick and straightforward solution.

Mean imputation is typically applied to numeric data that is approximately normally distributed, while median imputation is recommended for skewed numeric data due to its robustness against outliers. Mode imputation is most commonly used for categorical variables, replacing missing entries with the most frequently occurring category.

The advantages of simple imputation include its speed and ease of implementation. Nevertheless, this approach can significantly underestimate data variance, potentially introducing biases into subsequent analyses or models.

The following Python code demonstrates simple imputation:

KNN & Regression-Based Imputation

More sophisticated methods for handling missing data include K-Nearest Neighbors (KNN) and regression-based imputation. These methods leverage relationships and patterns in the data to estimate missing values more accurately than simple statistical methods.

KNN imputation predicts missing values based on the closest neighboring data points. It is particularly useful when preserving the local structure of data is crucial, and linear assumptions may not hold. Regression-based imputation, on the other hand, predicts missing values using linear relationships among features, assuming the data exhibits linearity.

The advantage of these advanced methods is their potential for increased accuracy compared to simple imputation. However, they tend to be computationally more intensive. Additionally, these methods can introduce issues such as over-smoothing of data and potential leakage of target information if predictive models inadvertently include target-related features.

The following examples illustrate these techniques in Python:

This concludes the first part of this series. In the next two parts we’ll tackle some more advanced techniques.

TrainXGB - Train XGBoost in Browser

Bojan Tunguz — Thu, 06 Mar 2025 13:05:48 GMT

I can finally share a bit about one of the small projects that I’ve been working on recently - www.trainxgb.com. It’s an in-browser app built from scratch with Panel and PyScript. It allows you to train an XGBoost model using a simple GUI and leverages the powerful WebAssembly (WASM) in-browser compute environment. Just upload your CSV data file, choose the appropriate hyperparameters, and press the “Train XGBoost” button. No coding skills required, no account creation necessary.

This is just the first iteration of this app that includes more or less all the features that I wanted it to have. It will likely be pretty buggy, and potentially unstable. I’ve tried to test it as much as possible, and for the most part it worked as I’d wanted it to. I’ve tested it on Safari and Chrome on Macs and iPad, Firefox on Ubuntu, and Edge on Windows. It worked on all of those devices. The app is a bit too big to run on your cell phone, but it might be possible to optimize it in the next few iterations. It should ideally be used with small-ish datasets, the kinds that you usually see in most Data Science and tabular data use cases. I’ve tested it with the datasets up to 100 MB, and it worked fine. I’ve used it on MNIST. It finished in a minute. I got 98% accuracy. Go ahead, just give it a try.

I first started fiddling with something like this app a few years ago when PyScript first came out. I was able to create a *very* rudimentary app that just trained on a single built-in dataset. Just the most basic proof of concept. At the time I had wanted to build something more powerful, but I cold not get any buy-in for it. Various other professional and personal circumstances had prevented me from pursuing this project until now.

Another big obstacle was that all the technologies used for this project are relatively new and obscure, and I am not a developer. Even just a year ago it would have been a very steep development cycle to get a project like this one going. Fortunately, the AI coding assistants have tremendously grown in power over the last few months. For most of this project I had used the help of OpenAI’s o1-pro model. As I had mentioned, all the technologies in my stack are relatively nonstandard, and many AI assistants are not able to easily get a working version of the app on the first try. At one point I had tried to martial Claude to rewrite and streamline an early version of the app, but it decided that what I really needed was to have the whole app be rewritten in JavaScript, and run the XGBoost training in the backend. (!!!) When Grok 3 came out it gave me a pretty decent output as well, but by then I had already been committed to o1-pro, and just decided to stick with it.

In the future posts I may write a bit more about this whole process, and the tech stack I had used, but for now I’d just like to have you play with the app, and provide me with all the constructive feedback you have.

XGBoost is All You Need - Part 7

Bojan Tunguz — Thu, 27 Feb 2025 13:22:39 GMT

This is the seventh, and last, installment in my series of posts on XGBoost, and the third on practical applications, based on my 2024 GTC presentation. Previous posts can be found at the following links: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6.

Challenges in Visualizing Tabular Data

Tabular datasets are notoriously difficult to visualize and interpret compared to image or text data. Unlike an image (which a human can “see” patterns in) or a piece of text (which we can read and comprehend), a table of raw features doesn’t present an obvious structure to our senses. It’s often up to the analyst to dig through the numbers, plotting one or two features at a time to detect patterns. Even then, high-dimensional relationships can be elusive; you might create scatter plots or apply dimensionality reduction techniques, but capturing all interactions is tough. In short, tables may contain precise values, but they “do not tell a story” on their own – the reader must interpret the numerical data, find patterns, and draw conclusions. This lack of an immediate visual narrative makes understanding complex feature relationships in tabular data challenging and motivates the search for better representations.

Using Shapley Values from a Supervised Model for Unsupervised Clustering

One way to tackle these interpretation challenges is to transform the tabular data into a more meaningful representation before trying to visualize or cluster it. Here’s where Shapley values come in. Shapley values are typically used in supervised machine learning to explain model predictions – they quantify how much each feature contributes to a particular prediction. The interesting twist is using these values outside their usual role: leveraging them for unsupervised tasks like clustering. The idea is straightforward: train a supervised model on your tabular data (using a relevant target variable), and then use the model to compute Shapley values for each instance. This converts your raw dataset into a new dataset of the same shape where each feature’s raw value is replaced by its contribution to the model’s output. In essence, we are using a supervised learning step to inform an unsupervised analysis. Of course, this approach assumes you have a target variable to train the model in the first place – a notable departure from traditional clustering which is fully unsupervised. Fortunately, in many real-world scenarios an outcome of interest exists (e.g. a known class label or result we care about), making it feasible to apply this strategy. By converting raw features into Shapley values, we inject domain-relevant signal (learned by the supervised model) into the data before performing clustering.

Why Use Shapley Values for Clustering?

Using Shapley values as a preprocessing step for clustering offers several important advantages over clustering on raw features. Fundamentally, Shapley values act as a feature transformation that preserves relationships relevant to a prediction target. This means the structure in the data that influenced the model is retained, while arbitrary scale differences and noise can be reduced. In practice, clustering on Shapley-transformed data tends to yield groups that align with meaningful patterns (since points in the same cluster have similar feature contributions towards the outcome), rather than groups driven by just raw value similarity

Shapley values are expressed in the units of the model’s output (e.g. contribution to a prediction probability or log-odds). All features’ contributions are on a comparable scale by construction. This greatly reduces the risk of distance-based clustering algorithms being skewed by features simply because they have larger numerical ranges or different units (a common issue when, say, mixing revenue in dollars with age in years). In other words, the data is effectively self-normalized by the model’s predictions, so no manual feature scaling is required.

The Shapley value representation has essentially the same number of features as the original dataset. Each original feature yields one Shapley value (per instance) representing its contribution. This means we are not creating an expanded feature space; instead, we’re replacing or augmenting the original features with their model-derived counterparts. You can plug these values into a clustering algorithm as a drop-in replacement for the original features without worrying about additional dimensions complicating the clustering process.

If your dataset has categorical features, the supervised model will handle them during training (for example, through label encoding). The resulting Shapley values for those features are numeric importance scores. Thus, there’s no need for separate encoding of categorical variables purely for the sake of clustering – the model + Shapley pipeline has already taken care of translating category levels into a consistent numerical contribution. This simplifies preprocessing, sparing us from choosing between one-hot, label encoding, or other schemes for mixed data.

When using tree-based models (like XGBoost or LightGBM) to generate Shapley values, missing values in the data can be handled natively by the model. For example, XGBoost will send an observation down a default branch if a feature value is missing, meaning the model can still make a prediction without explicit imputation. The Shapley values computed from such a model inherently account for missingness as part of the feature’s contribution (or lack thereof). This eliminates the need to worry about imputation or special treatment of NaNs before clustering – the model’s logic has absorbed that complexity.

Because Shapley values put features on an equal footing (in terms of scale) and weight them by relevance, the clusters obtained from this transformed data tend to be more well-separated and meaningful. We are effectively clustering based on how observations behave relative to a target outcome, rather than on raw feature magnitudes. Using Shapley values for clustering can produce clearer groupings, where each cluster is defined by a distinct pattern of feature contributions rather than arbitrary numeric similarities.

Interpreting Clusters with an Auxiliary XGBoost Model

After clustering the data in this Shapley value space, the final step is understanding what defines each cluster in terms of the original features. We have clusters that were formed using the supervised model’s insights – now we want to translate those clusters back to domain language (original features and their values). A convenient way to do this is to build a second, interpretable model that predicts the cluster assignments. In practice, one can take the cluster labels as a new target variable and train a classifier (for example, an XGBoost model) to predict which cluster an instance belongs to. Essentially, we treat the task “is this data point in Cluster A, B, C, etc.?” as a multi-class classification problem. By training an auxiliary XGBoost model on the original dataset with cluster IDs as labels, we obtain a model that encapsulates the differences between clusters. We can then apply Shapley value analysis (or examine feature importances) on this model to see which features most strongly drive the distinctions between clusters. This approach gives us human-interpretable explanations for each cluster: for instance, we might discover that Cluster 1 is characterized by high values of Feature X and low values of Feature Y contributing to the outcome, whereas Cluster 2 has the opposite pattern. In summary, this step closes the loop by using explainable AI tools on the clustering results themselves. The next section of this post will dive into this interpretability step, building the auxiliary XGBoost model and using it to explain what makes each cluster unique, so that we not only have robust clusters but also a clear understanding of their defining features.

The entire procedure for using XGBoost for a simple visualization and clustering procedure looks like this:

•For a supervised learning task – create a simple XGBoost model (don’t worry about predictive performance)

•Get Shapley values for all datapoints

•Use a simple dimensionality reduction scheme (t-SNE, UMAP) for visualization of features

•Use a clustering algorithm to extract clusters

•Other dimensionality reduction schemes: neural nets and autoencoders

For illustration, here is an example of what the Porto Seguro dataset (that we’ve been using in the previous couple of examples) looks like once we’ve applied the above procedure and reduced it to a two-dimensional space. Various clusters are clearly visible, and the distinction between the positive and the negative class is also clearly visible. Now, all of this is to some extent to be expected - after all, the visualization is based on features that have already been created to perform well with a predictive model. Nonetheless, many important insights can still be gleaned.

Why XGBoost and not simpler statistical methods?

In principle, all of the above could be done with lots of careful statistical analysis. For example, after forming clusters, an analyst might examine which features differ the most across those groups using statistical measures or tests. Techniques exist to score features by how strongly they relate to cluster labels – for instance, measuring the variance or correlation of each feature with the cluster assignments. In practice, this could involve applying methods like analysis of variance or chi-square tests to each feature to determine its significance in distinguishing clusters. The traditional route of interpreting clusters through manual statistical analysis often demands substantial time and specialized expertise. Designing the appropriate models or tests for each aspect of the data isn’t straightforward – it requires a strong background in statistics to choose suitable methods and a deep understanding of the subject matter to make sense of the results. Analysts might need to try multiple modeling approaches, account for interactions or non-linear effects by crafting new variables, and validate that each model is sound. All of this can be time-consuming and complex. Using XGBoost and Shapley values offers a more straightforward path to visualize the datasets, find clusters, and interpret clustering outcomes, leveraging computation to reduce manual effort. In practical terms, a data scientist can achieve in a short time what would otherwise require exhaustive statistical exploration – setting up a robust visualization, find clusters in the dataset, quickly identifying the key features that drive cluster formation and how they combine, without having to explicitly program each hypothesis. This scalable, off-the-shelf procedure allows one to make visualization and interpret clusters with far less effort, making advanced insight accessible even when time or deep statistical expertise is limited.

Beyond supervised tasks

The above procedure worked pretty well as long as we were dealing with straightforward supervised problem upon which we could build our analysis. However, most visualization, clustering and similar problems deal with pure unsupervised problems. The question then becomes how to apply this procedure to those situations. For instance, would it be possible to create an XGBoost based autoencoder? And if so, would dataset embeddings with XGBoost be something that can be achieved? These are some very intriguing questions, and potential topics for further research.

XGBoost is All You Need - Part 6

Bojan Tunguz — Mon, 24 Feb 2025 13:22:51 GMT

This is the sixth installment in my series of posts on XGBoost, and the second on practical applications, based on my 2024 GTC presentation. Previous posts can be found at the following links: Part 1, Part 2, Part 3, Part 4, Part 5.

Shapley Values - The power of interpretability

Shapley values are a unique, game-theoretic approach to attributing feature importance in machine learning models. Originating from cooperative game theory, Shapley values were originally designed to fairly distribute a coalition’s payoff among players based on their contributions. In the ML context, each feature value is treated as a “player” in a game where the model’s prediction is the total payout to be distributed. This provides a theoretically sound framework for feature attribution: Shapley values are the only solution that satisfies certain fairness properties (efficiency, symmetry, dummy, additivity) that ensure each feature gets credit proportional to its true contribution. In contrast, simpler heuristics like the built-in feature importance of a tree model (e.g. Gini importance or split counts in random forests) lack such rigorous guarantees and can be biased by factors like feature scale or redundancy. By fairly allocating credit to features, Shapley values offer a more principled measure of importance for tasks like feature selection – helping identify which features genuinely drive the model’s predictions as opposed to those that appear important due to quirks of the training process.

Shapley values provide local, instance-level explanations. Unlike global feature importance methods that yield a single importance score per feature (averaged over all data), Shapley values break down each individual prediction into contributions from each feature. In other words, for every data point, we get a personalized attribution of how each feature influenced that particular prediction. This local explanation property is extremely powerful: it means we can explain why the model made a specific prediction by looking at that data point’s Shapley values (e.g. “Feature A contributed +0.5 to the prediction, Feature B contributed -0.2,” etc.). SHAP (SHapley Additive exPlanations) is a popular framework that leverages Shapley values for per-instance explainability. For example, suppose a tree model ranks “Income” as the most important feature overall for loan approval (global importance). Global methods tell us Income matters on average, but they won’t explain why two particular applicants with similar income got different outcomes. Shapley values, however, can reveal that for one applicant, Income had a high positive contribution, while for another applicant Income had a smaller effect and another feature (say, Debt) drove the prediction. This per-instance granularity contrasts with global importance metrics like permutation importance, which lack context for individual predictions.

The downside is that exact Shapley value computation is notoriously expensive — the complexity grows exponentially with the number of features. Computing a feature’s Shapley value involves considering all possible subsets of features and measuring the feature’s contribution in each context. Formally, for N features, there are 2^N possible subsets (coalitions) of features, and evaluating all of them for every feature quickly becomes intractable as N grows. In fact, a straightforward calculation requires on the order of N! (factorial) evaluations to get exact Shapley values, which is computationally prohibitive except for very small feature sets. This exponential explosion means that if you have, say, 20 features, the number of coalitions to examine is over 1 million; with 30 features it’s over a billion. The factorial growth of subsets makes exact Shapley calculations infeasible for high-dimensional datasets or complex models. In practical terms, evaluating every combination of features would require an astronomical number of model runs, so exact Shapley values are rarely computed for real-world ML problems. Data scientists must therefore be mindful of this trade-off: while Shapley values are theoretically appealing for feature importance, naively computing them doesn’t scale. This limitation has driven the development of clever approximation methods to make Shapley-based explanations usable in practice.

Most Shapley value implementations rely on approximation techniques or model-specific assumptions to sidestep this combinatorial explosion. To make Shapley values feasible, researchers have developed algorithms that approximate the exact values with far fewer evaluations. A common strategy is Monte Carlo sampling: rather than enumerating all 2^N subsets, one can sample a large number of random feature orderings or subsets and estimate each feature’s average marginal contribution from those samples. . The SHAP library popularized this approach, providing a unified framework for approximating Shapley values for any model by treating the explanation as a weighted linear regression problem (KernelSHAP). For tree-based models, there is an even more efficient method known as TreeSHAP, which exploits the tree’s structure to compute Shapley values in polynomial time instead of exponential. TreeSHAP uses dynamic programming on decision paths to exactly compute Shapley contributions for tree ensemble models much faster than brute-force. In practice, tools like the SHAP Python package use these techniques under the hood: for arbitrary models it uses sampling-based approximations, and for decision tree models (e.g. Random Forests, XGBoost) it uses the TreeSHAP algorithm for speed. The key point is that virtually all Shapley value software avoids the full exponential cost by leveraging approximations or optimizations. While these methods may introduce a bit of approximation error, they make it practical to get Shapley-based feature importances on datasets with dozens or even hundreds of features. Thus, data scientists can reap the benefits of Shapley values’ fairness and local accuracy without waiting millennia for computations to finish.

Because calculating Shapley values (or their approximations) involves repeated model evaluations for many feature combinations or permutations, the problem is embarrassingly parallel. This means it can benefit hugely from parallel hardware like GPUs. GPUs can execute thousands of operations simultaneously, making them well-suited to speeding up the large number of independent calculations required for Shapley estimates. In recent years, GPU-accelerated implementations of SHAP and TreeSHAP have shown order-of-magnitude speedups. For example, one study reported achieving up to 19× faster computation of standard Shapley values and up to 340× faster computation of Shapley interaction values by offloading the work to an NVIDIA V100 GPU, compared to a multi-core CPU baseline. Libraries such as GPU TreeShap (integrated in frameworks like XGBoost) allow Shapley values for tree models to be calculated on GPUs, often turning hours of computation into minutes. The SHAP Python library can also utilize GPU backend (via CuPy or GPU-enabled tree libraries) to accelerate KernelSHAP sampling and TreeSHAP calculations. For data scientists, this means that computing Shapley explanations on large datasets or complex models is no longer prohibitive – with the right hardware and libraries, you can get near real-time explanations.

Shapley Interaction values – an extension of Shapley values – are useful for uncovering feature interactions and informing feature engineering. So far we’ve discussed allocating importance to individual features, but Shapley analysis can go a step further and quantify the contribution of feature combinations. Shapley interaction values allocate credit among pairs of features (and by extension, can be generalized to higher-order groups). In essence, they decompose the prediction not just into single-feature effects but into pairwise interaction effects as well. For two features i and j, the Shapley interaction value attempts to measure how much those two features together contribute to the prediction beyond the sum of their individual contributions. Concretely, SHAP interaction values produce a matrix of attributions: the diagonal elements are the usual Shapley values for each feature (solo effect), and each off-diagonal element is the attribution for the interaction between feature i and feature j. This tells us whether the model has learned any synergy or redundancy between features. For example, if features X1 and X2 only matter when combined (perhaps they represent coordinates that together pinpoint a location), Shapley interaction values for the pair (X1,X2) will be high, indicating a strong synergistic interaction. Conversely, if two features provide overlapping information (one might make the other redundant), the interaction value for that pair will be near zero or even negative, indicating that knowing both doesn’t add much beyond one alone. Identifying such relationships is extremely valuable for feature engineering: redundant features can potentially be pruned to simplify the model without sacrificing performance, while synergistic features might inspire creation of a new combined feature or at least inform us to keep both features together. Shapley interaction values thus enable data scientists to dissect not just individual feature effects but also the pairwise dependencies learned by the model. This insight can guide feature selection (e.g. avoid dropping a feature that only has importance when partnered with another) and feature creation (e.g. adding interaction terms or aggregated features). Overall, by extending the fair credit allocation principle to feature pairs, Shapley interaction values provide a systematic way to quantify feature interactions, shining a light on complex relationships that simpler importance metrics would miss.

Shapley values in XGBoost

Starting with XGBoost v1.3 (released in late 2020), the library introduced GPU acceleration for Shapley value computations. One convenient feature of XGBoost is that its standard Python API can compute Shapley values directly. By simply calling the model’s predict method with pred_contribs=True, XGBoost will return the contribution of each feature to the prediction (the Shapley values) without any external libraries. This built-in functionality means you don’t need third-party packages to derive feature importance via Shapley values – the logic is integrated and optimized within XGBoost itself. In practice, this tight integration makes computing feature attributions more efficient. For example, one user reported that using the standalone SHAP library took days on a subsample of a dataset, whereas XGBoost’s native method produced the same Shapley values in just minutes. Eliminating the extra overhead of external explainers thus streamlines the workflow and speeds up the calculations.

In addition to individual feature contributions, XGBoost’s API also supports computing Shapley interaction values directly. By setting pred_interactions=True in the predict call, you can obtain second-order Shapley values that attribute contributions to pairs of features. This allows data scientists to analyze feature interactions (how two features jointly influence a prediction) without needing any external implementation. The ability to get interaction effects natively is particularly useful for understanding dependencies in tree-based models – for instance, it can reveal whether two features together have a synergistic or compensatory effect on the model’s output. All of this is done within XGBoost’s own engine, again avoiding the need to hand data off to another library for these deeper insights.

GPU acceleration has been a game-changer for applying Shapley values to larger datasets. Prior to GPU support, analysts often had to sample down their data or limit Shapley calculations to a handful of instances due to the heavy runtime. Now, with orders-of-magnitude faster computation, it’s feasible to explain predictions on extensive datasets or very complex models. The ability to harness a GPU (or multiple GPUs) effectively removes the previous scalability barrier. Data scientists can obtain comprehensive feature attribution on datasets that were once too large to handle, without waiting days for results. In short, the introduction of GPU power has made Shapley-based interpretation practical for big data scenarios that were previously off-limits.

In practice, the speedups from GPU acceleration are often on the order of hundreds-fold, and in some cases even 1000X or more faster compared to CPU-only methods. The exact gain depends on factors like the size of the dataset, number of trees, and hardware specifics, but it is consistently very large. These massive speed improvements mean that tasks which previously might have timed out or taken unbearably long can now be completed quickly. Even with a modest GPU, you can expect a substantial reduction in computation time for Shapley value analysis, often turning a formerly overnight job into a matter of seconds or minutes.

Because of this dramatic acceleration, XGBoost users can incorporate Shapley value explanations into real-time or iterative workflows. For instance, you could generate explanation values for new predictions on the fly in a production setting, or interactively explore feature effects in a Jupyter notebook without long delays. Multi-GPU setups push this even further – one benchmark achieved about 1.2 million rows per second throughput when computing SHAP values using eight GPUs in parallel. This level of performance was unimaginable with earlier CPU-only methods. It enables high-volume inference monitoring or dynamic dashboards that explain model decisions almost instantly.

SHAP package

The SHAP library provides a highly intuitive Python interface for computing Shapley values, making model interpretation accessible to data scientists. Its API is well-documented with plenty of examples, so you can quickly learn how to explain your models’ predictions. SHAP integrates smoothly with popular machine learning frameworks and supports a wide range of model types. In practice, you can plug in models from scikit-learn, XGBoost, LightGBM, CatBoost, or even deep learning frameworks and obtain Shapley value explanations with minimal code. Many common model objects are directly supported – for example, tree models from XGBoost, LightGBM, CatBoost, PySpark, and scikit-learn work out-of-the-box with SHAP’s explainer classes. This broad compatibility and user-friendly design have made SHAP a go-to tool for interpretable machine learning.

SHAP doesn’t just stop at computing values – it also includes excellent built-in visualization tools to help make sense of the results. These plots are designed to be interactive in Jupyter notebooks, leveraging JavaScript behind the scenes for rich displays (the library provides an initjs()function to load the necessary JS components when needed). Two key visualization types are the summary plot and the force plot.

Summary Plot: This chart provides an overview of feature importance across the entire dataset. It typically appears as a beeswarm or layered scatter plot where each point is a Shapley value for a feature in one sample. In a single view, you see which features are most influential (ranked by overall impact on the model output) and how their values affect the prediction (color indicating feature value). In essence, the summary plot gives an overview of the most important features across the entire dataset.
Force Plot: For understanding individual predictions, SHAP offers the force plot, which visualizes how each feature pushes an individual prediction higher or lower relative to a baseline. A force plot is a dynamic, bar-like visualization (often shown as an interactive red and blue force diagram in Jupyter) for a single observation. It displays the contribution of each feature value as a force that either increases the prediction (positive Shapley value pushing to the right) or decreases it (negative value pushing to the left) from the model’s base value. Essentially, the force plot “shows how the features of a data point contribute to the model’s prediction”.

Shapely values with Porto Seguro and XGBoost

We are now going to illustrate the above features, easy fast Shapely values calculations with XGBoost and easy visualization with the SHAP package, using the Porto Seguro dataset that we had mentioned in the previous post. We assume that we have already found the optimal hyperparameters for the dataset, and have trained a model with them. We will now calculate the Shapely values and the Shapely interaction values for the validation dataset, and we’ll use an H100 GPU to do so. Thanks to its speed and relatively large vRAM size (80 GB), both of these calculations can be done pretty fast and the results fit easily in memory, something that is not always the case, especially for the Shapley interaction values.

Once we obtain these values, we use SHAP’s excellent visualization tools to display them, both average absolute value for each feature, as well as their overall per-datapoint distributions. The average absolute value of each feature looks like the following:

This plot tells us which features are the most important. Unfortunately, in this dataset all features are anonymized, so we cannot really try to get any deeper insights about them, or use any kind of domain-specific knowledge to engineer those features further in order to extract more predictive information out of them. However, with non anonymized plots such insights could prove to be invaluable.

Next, let’s take a look at the feature importance distributions:

This SHAP summary plot shows each feature’s influence on the model’s output across all validation samples. The features are listed in descending order of overall importance (i.e., how much each feature contributes, on average, to the model’s predictions). For each feature, the horizontal spread of colored points captures the range of possible Shapley values, which can be interpreted as how strongly (and in which direction) that feature affects individual predictions.

A positive SHAP value (points to the right of the vertical line) indicates that the feature value pushes the model output higher, while a negative value (points to the left) means it pushes the output lower. The color scale (blue to pink/red) shows the feature’s actual value: blue typically represents lower feature values, and pink/red represents higher values. Taken together, you can see patterns such as “higher feature values lead to higher (or lower) predictions” if most of the pink points are clustered on one side. For instance, a feature where pink points mostly lie on the right side of the vertical line suggests that high values of that feature increase the predicted outcome. Conversely, if blue points concentrate on the right side, then lower values of that feature tend to raise the model’s prediction.

Looking at the topmost features, you can see they have wide spreads of SHAP values, indicating they strongly influence the model’s output. Lower-ranked features have a narrower range, so they contribute less on average. Overall, this chart helps identify which features matter most (the ones at the top) and how different feature values shift each prediction up or down.

Next, let’s take a look to see if there are any features that don’t make any contribution to the overall absolute Shapley values. These features should almost always be dropped. The only exception is if their interaction with other features is nonzero, but even there, unless those interactions are pretty significant, it’s most likely that it would be fine to just drop them anyways.

We’ve identified six “faulty” features that need to be dropped. After retraining the model on the dataset with those features removed, the score improves slightly from 0.2851 to 0.2859. This is to be expected, and in fact probably underestimates the impact of those “faulty” features on the model. To get a more accurate assessment, we should re-do the hyperparameter optimization. However, in most practical situations this extra step might be an overkill.

Shapely interactions and Feature Engineering

The above example is an instance of feature selection, where we decide on which features to keep and which ones to discard based on how much they contribute to the overall score in terms of their absolute Shapley values. However, without actually knowing the meaning of the features, it’s not really feasible to use feature importance as the basis for feature engineering - a process of constructing new features based on transformation of individual features or the combinations of them. This is where Shapley interaction values can come in handy. We can relatively easily build new features based on our knowledge of which feature interactions are the strongest. Let’s take a look and see how to do it.

We write a new function, plot_top_k_interactions, which computes mean absolute SHAP interaction contributions for each pair of features and then plots the strongest ones as a bar chart. It first slices out the bias term from shap_interactions and takes the average across samples, resulting in a two-dimensional matrix of mean interaction strengths. It iterates through that matrix to build a list of (feature_pair, interaction_strength) tuples, skipping duplicates by only taking lower-triangle indices (j < i) and multiplying each average by 2 to account for symmetric entries. The list is then sorted in descending order by interaction strength. Finally, it extracts the top k pairs, plots them using a basic bar chart (rotating the x-axis labels to avoid overlap), and returns the full list of sorted interactions. This allows us to quickly see which feature interactions have the largest influence on model predictions, aiding in feature engineering and interpretability.

Next, we take the top two feature interactions, and create two more simple features that are just products of the individual features:

And then, as in the previous section, we retrain the XGBoost model, with the original hyperparameters, and our validation score improves further, from 0.2859 to 0.2871. The improvement seems more significant than the improvement that we got from feature selection. This might be just a fluke, but in my experience feature engineering is generally a far more productive step than feature selection. However, a simple multiplication may not always work. Oftentimes it is necessary to use a different way of combining the features, especially when at least one of the two features is categorical.

XGBoost is All You Need - Part 5

Bojan Tunguz — Thu, 20 Feb 2025 13:24:43 GMT

This is the fifth installment in the series of posts about XGBoost based on my 2024 GTC Presentation. You can find the previous posts here: Part 1, Part 2, Part 3, and Part 4.

In the proviso posts I took a somewhat high-level approach, and talked mostly in general terms about what XGBoost was, and how it fits within the whole ecosystem of Data Science and Machine Learning for Tabular Data. In this and the following two posts we are finally going to get our hands a bit dirty. Warning: there will be code! I wanted to showcase a few nontrivial use cases for XGBoost, based on my own work. Many of the lessons from these use cases are widely applicable, and can be extended to different algorithms, but some are either highly specific to XGBoost (the library in particular) and might be clunky to implement with some other approaches.

XGBoost distributed computing and GPU Support

XGBoost has evolved significantly in how it supports GPUs and distributed computing. In 2017, version 0.7 introduced GPU acceleration by leveraging NVIDIA’s CUDA libraries. This offloaded certain operations, such as gradient computation and tree construction, to GPUs and offered a significant speed boost over CPU-based training, especially for large datasets and many boosting iterations.

By 2019, it became clear that a single GPU could still be a bottleneck for increasingly large training tasks. Version 0.9 addressed this by introducing multi-GPU support, allowing XGBoost to distribute the training workload across multiple GPUs on a single machine. This added parallelism accelerated model training further and made it feasible to handle even bigger datasets more efficiently.

Organizations that needed to scale beyond a single machine required a solution for distributing training across clusters. Version 1.0 of XGBoost delivered this capability by integrating with Dask, a Python library designed for parallel and distributed computing. This integration enabled both preprocessing and training tasks to run on multiple machines in a cluster. Larger datasets could be split across these machines, with each node handling part of the workload, leading to faster overall training times.

Finally, in version 1.4, all multi-GPU and multi-machine capabilities were consolidated under Dask’s framework. Instead of relying on separate features for single-node versus multi-node setups, users could simply configure Dask for their cluster. Dask would then manage resource allocation for both CPU and GPU workloads, streamlining the distributed training process and reducing the complexity involved in scaling XGBoost across different hardware configurations.

Before we go any further, though, it would be useful to explain what Dask and Optuna are, especially if you have never come across them or have never used them.

What is Dask?

Dask is a flexible parallel computing library for Python that makes it easy to scale computations from a single laptop to a large cluster. It provides parallel collections like arrays, dataframes, and lists that mimic their in-memory equivalents but can operate on larger-than-memory datasets by breaking them into smaller chunks and distributing work across multiple cores or machines. Dask integrates seamlessly with the broader PyData ecosystem, allowing users to work with familiar libraries like NumPy, pandas, and scikit-learn, but with the option to speed up or scale out whenever needed. Its scheduler dynamically constructs and executes task graphs under the hood, handling optimizations and load balancing so users can focus on writing clear, efficient, and parallel code.

What is Optuna?

Optuna is an open-source Python library designed to automate hyperparameter optimization for machine learning models. Hyperparameter optimization is the process of systematically searching for the best combination of settings (e.g., learning rate, number of layers, regularization parameters) that yields the highest performance for a model. Optuna streamlines this process through an easy-to-use “define-by-run” approach, allowing users to dynamically define the hyperparameter search space. It employs sophisticated search algorithms - such as Bayesian optimization with TPE - and includes features like pruning to terminate underperforming trials early, thereby saving computational resources. Optuna is very framework and library agnostic system, and can be used both with the “classical” ML algorithms, as well as all the most popular Neural Network frameworks. It integrates seamlessly with popular frameworks like PyTorch and TensorFlow, making it both efficient and convenient to discover optimal hyperparameters, ultimately enhancing the accuracy and reliability of machine learning models.

XGBoost and Dask for hyperparameter optimization - an example with Porto Seguro dataset

We’ll show how to combine Dask, XGBoost, and Optuna for hyperparamter optimization. We’ll use the dataset from the Porto Seguro Kaggle competition, and we’ll do the training on a DGX H100, over 8 (!!!) H100 GPUs. This is, granted, a bit of an overkill I terms of compute but it does make training run really, really fast, which comes in really handy when you are trying hundreds, or even thousands, of different hyperparamter combinations.

For the purposes of this example the most relevant thing to know about the Porto Saguro dataset and task is that it was a classification competition with anonymized features.

All the code for the competition can be found in the following notebook on GitHub. The full repo, with all the other scripts and output artifacts, can be found here.

First, we import all the essential libraries for distributed computing, data manipulation, machine learning, and hyperparameter optimization. dask.distributed and dask_cuda enable parallel/distributed computing across multiple GPUs, while pandas and numpy handle data structures and numerical operations. xgboost provides gradient boosting methods, sklearn offers model evaluation utilities (like KFold and roc_auc_score), and optuna is a framework for hyperparameter tuning. The gc (garbage collector) and logging modules help with memory management and logging, respectively.

Then we create a local GPU‐aware Dask cluster with eight GPU workers (via LocalCUDACluster(n_workers=8)) and then initializes a Dask client that connects to it, allowing you to distribute and manage computational tasks across those GPU workers.

We then loop over five training/validation folds, lazily read each fold from CSV with delayed and dd.from_delayed, split out the “target” column, and then stash both the feature matrix and the target vector in lists. In other words, we collects each fold’s data and labels without actually loading them into memory until needed, using Dask’s delayed execution model. I prefer to prepare folds and save them as separate files, because Dask can be a bit finicky about slicing them, especially if they are loaded using the lazy read. We are also using a full 5-fold validation scheme for Optuna. Normally this would be a huge computational and time overkill, but hey, when you have lots of computational resources at your disposal, why not. :D

Optuna objective function

The code below defines a function, which serves as the Optuna objective for tuning XGBoost hyperparameters in a cross‐validated manner using Dask. Inside the objective function, a dictionary of candidate hyperparameters (params) is defined by calling various trial.suggest_* methods (e.g., to choose values for lambda, alpha, colsample_bytree, etc.). A five‐fold cross‐validation (KFold) then splits the training data, and in each fold the code creates Dask‐based DMatrix objects for training and validation. Next, it trains an XGBoost model on each fold with the current candidate parameters and retrieves fold‐specific predictions, storing them in train_oof. After all folds, the function calculates the Gini metric on these out‐of‐fold predictions and returns that metric, which Optuna uses to guide the hyperparameter search.

Logging and optimizing

This snippet configures the Python logging system so that Optuna’s messages go to a file instead of standard error, then creates an Optuna “study” (the container that orchestrates hyperparameter trials) and finally kicks off the optimization process. First, a logger is obtained and its logging level is set to INFO. A FileHandler is added so that all messages at INFO level or above are written to the file optuna_xgb_output_0.log. The lines with optuna.logging.enable_propagation() and optuna.logging.disable_default_handler() ensure that logs are forwarded to the root logger (and hence into the file), while preventing duplicate outputs to standard error. The create_study() call sets up an Optuna study named five_fold_optuna_xgb_0 with a specified SQLite database to record the experiment results. Finally, the code invokes study.optimize(...) with n_trials=3, instructing Optuna to run three hyperparameter‐search trials using the provided objective function. This optimization process runs really, really fast on a DGX H100 - between 30 and 60 seconds per trial!

Dask Training on a Cluster

Dask training can be done on a single CPU, multi GPU machine, or a cluster of CPUs and GPUs. To set up a Dask cluster, you first need to launch a central scheduler process, which acts as the traffic controller for all your workers. From the command line, you can simply run dask scheduler to start up this scheduler. The scheduler will report a TCP address where it is listening (for example, tcp://127.0.0.1:8786). This address is how clients and workers find and communicate with the scheduler. In a simple setup on a single machine, this is often all you need, but in more complex, multi-node environments you would run the scheduler on a network-accessible interface and start workers on remote machines.

After the scheduler is up and running, you can add workers to the cluster. In our example, we would use dask cuda worker 127.0.0.1:8786 (if we want GPU-enabled workers) to connect a worker process to the scheduler. Each worker will register itself with the scheduler, making its CPU or GPU resources available for distributed tasks. Finally, in our Python session, connect a Dask client to this cluster by importing Client from dask.distributed and creating an instance pointing to the same scheduler address, for example client = Client("127.0.0.1:8786"). This tells our Python scripts or notebooks to submit computations through the Dask scheduler and distribute tasks across the available workers.

XGBoost: train anywhere, deploy anywhere

One of the most wonderful things about XGBoost is its extreme widespread adoption and compatibility. It has been ported to almost any compute environment and device you can think of. For instance, I have been able to export the above model, trained on DGX H100, as a json file. That model, in turn, I loaded onto my tiny Raspberry Pi Zero, and was able to run an inference with it!

Book Review - Effective Visualization: Exploiting Matplotlib & Pandas

Bojan Tunguz — Thu, 13 Feb 2025 13:23:59 GMT

Matt Harrison is a prolific author of books on Python and various topics in Data Science and Machine Learning. I’ve known Matt for years, both through his works and personally, and have been a big fan of his work and approach to the field. I have had the privilege of getting early access to his latest book, Effective Visualization: Exploiting Matplotlib & Pandas. In this post I’ll try to summarize my general impressions, as well as what I believe all Data Science practitioners need to focus on.

I’ll be completely honest: my own visualization skills leave a lot to be desired. Form visualization has been something of an afterthought, a nice to have, but far from the core part of my ML/DS workflow. So a book like this one is actually a very useful instruction guide for me as well.

Effective Visualization: Exploiting Matplotlib & Pandas is a must-read for any machine learning practitioner who works with tabular data. The book manages to bridge the often wide gap between theoretical visualization principles and the practical coding techniques needed to bring data to life. In a field where model outputs and feature relationships are buried in rows and columns, Harrison’s work offers a clear path from raw data to compelling, informative visuals that enhance both exploration and communication.

At its core, the book does a remarkable job of combining conceptual guidance with hands-on examples. Many resources out there tend to focus solely on theory or, conversely, on syntax-heavy code without explaining why a particular visual is effective. Harrison strikes the perfect balance: he explains the “why” behind good visual design while showing you exactly how to implement these ideas using Matplotlib and Pandas. This dual approach is especially beneficial when you need to quickly turn a rough exploratory plot into something presentation-ready. Throughout the text, you see concepts introduced and immediately applied to real-world datasets, making it easy to understand how to extend these techniques to your own machine learning projects.

One of the book’s standout features is its focus on the kind of tabular data common in machine learning workflows. Whether you’re analyzing feature distributions, examining correlations, or comparing model predictions, you’ll find that the book covers the essential plot types - histograms, scatter plots, bar charts, and line plots - in a manner that is both accessible and deeply practical. Harrison’s use of Pandas for data manipulation combined with Matplotlib’s robust plotting capabilities means that you’re not just learning to create a chart, but you’re also learning how to integrate these visuals directly into your analysis pipeline. The examples are carefully chosen to demonstrate common pitfalls and best practices, such as how to avoid clutter in a scatter plot or how to use annotations to highlight key insights.

For machine learning practitioners, the ability to quickly iterate on visualizations is invaluable. Harrison shows you how to harness the power of Pandas’ built-in plotting functions to get a fast look at your data, and then how to transition into Matplotlib’s more detailed API when you need finer control over the aesthetics. This layered approach means that you can use high-level functions to prototype a chart and then “drop down” to Matplotlib to polish it up. The book emphasizes method chaining and clean coding practices, so you learn to write visualization code that is both efficient and easy to maintain - a quality that pays dividends when you’re debugging or iterating on a complex machine learning model.

Another significant benefit of Effective Visualization is how it transforms the way you communicate your findings. In the machine learning field, it’s not enough to simply build a good model; you need to explain its behavior and validate its performance to both technical and non-technical audiences. Harrison’s text instills the importance of telling a clear story through your visuals. The book introduces a framework for creating what he calls “CLEAR” visualizations - charts that are clear, limited in design, explanatory, audience-focused, and well-referenced. While the book isn’t solely about machine learning, the techniques it teaches are perfectly suited to illustrating model performance, feature importances, and the often subtle nuances of your data. For instance, you might use a histogram to reveal the distribution of a skewed feature or a scatter plot to compare actual versus predicted values, and then enhance these visuals with annotations that call attention to specific outliers or trends.

The book’s emphasis on customization and iterative refinement means that you gain more than just a set of plotting recipes. You learn the principles of effective visual communication that help you decide what information to display and how to display it. In one memorable example, Harrison demonstrates how to add context to a scatter plot with detailed annotations and customized color palettes that draw attention to the most important parts of the data. This attention to detail is critical for anyone working with machine learning models, where the stakes of misinterpretation can be high. Instead of producing generic, off-the-shelf plots, you’re empowered to create visuals that are tailored to the specific insights you want to convey.

Moreover, the book also delves into more advanced topics such as multi-panel layouts, grid specifications, and even the integration of textual annotations with visual elements. These skills are directly applicable when you’re trying to communicate the results of a complex model or compare multiple segments of your data side by side. For example, when analyzing feature interactions, you might need to create a grid of plots where each subplot represents a different subset of the data. Harrison explains how to do this using Matplotlib’s GridSpec and subplot_mosaic functions, giving you the tools to produce polished and professional graphics that are ready for publication or presentation.

Beyond its technical content, Effective Visualization also inspires a shift in mindset. It encourages readers to view every chart not just as a means to display data, but as an opportunity to tell a story. This storytelling aspect is crucial in machine learning, where the narrative around your model’s performance or a feature’s influence can significantly impact decision-making. The book consistently drives home the point that a well-designed visualization can turn complex data into an accessible and persuasive argument.

While the book is deeply technical, its writing remains engaging and approachable. Harrison avoids overwhelming readers with overly dense code or abstract theory; instead, he opts for clear explanations and a conversational tone that makes the material feel both professional and inviting. The free-flowing nature of the narrative means that you’re not bogged down by rigid sections or bullet-pointed lists. Instead, you’re taken on a journey that gradually builds your skills and confidence in both Python visualization and data storytelling.

Ultimately, Effective Visualization proves itself as an indispensable resource for machine learning practitioners. Its focus on practical, actionable techniques means that you can immediately apply what you learn to your own projects - whether you’re diving into exploratory data analysis, fine-tuning model outputs, or preparing visuals for an important presentation. At the same time, its broader lessons in design and communication are valuable for any data scientist seeking to improve how they share insights.

In a field where the clarity of visual communication can make or break the impact of your analysis, Matt Harrison’s book stands out as a guide that is both comprehensive and highly relevant. By melding practical code examples with sound design principles, Effective Visualization equips you with a toolkit that is indispensable in today’s data-driven world. Whether you’re refining a model, exploring new features, or communicating your results to stakeholders, the lessons in this book ensure that your visuals are not just pretty, but truly effective at conveying the insights hidden within your data.

When to use which approach/technique with a given dataset

Bojan Tunguz — Mon, 10 Feb 2025 13:08:42 GMT

As I’ve mentioned in my initial blog post on this blog, no I am not dogmatic about any particular machine learning algorithm in my own work, and even less so when it comes to what others should use. If you have been using a certain algo most of your professional life, and it works for you, hey, more power to you. I am eminently practical and pragmatic when it comes to data science and ML. I am also insatiably curious and love to tinker, so when I find out about another new technique, ML library, or approach, I love to take it for a spin. I love ML modeling, I find it intrinsically fun, and do it for its own sake. Nonetheless, if I have to do something professionally, and want to get it done in a way that is both foundationally sound and practically robust, the following are my first order recommendations:

Up to a few hundred datapoint, use Stats. Yes, yes, I know, all of machine learning is just a subset of Statistics. But practically and methodologically the way that statistics is used in science and research is very different from the way that we approach Machine Learning in Data Science. I am talking here about calculating correlations between variables, t-tests, and all of that. Descriptive Statistics alone can get you pretty far. Coming up with the predictive rules of thumb based on those will oftentimes be sufficient, especially if in addition to paucity of data points you are dealing with just a handful of variables.
For few hundred to few thousand use linear/logistic regression. Linear algorithms are simple to implement and even simpler to interpret. Furthermore, they tend not to overfit, so if you need to come up with an algorithm that will be robust with the future data, they will probably do just fine.
Between few thousand to about 10,000 it's anyone's guess. Gradient boosters generally do well here, with other "classical" algorithms (SVM for instance) sometimes shining. In my experience the datasets of this size are the ones where nonlinear effects start to really matter, and you potentially have enough data that you will not shoot yourself in the foot by overfitting. Nonetheless, not all data is made the same, especially in the tabular data world, and you need to exercise lots of intuition and judgement when deciding which technique or algo to use.
Many thousands to about a billion datapoint is where gradient boosted trees rule. If you need just one algorithm, go with this. You'll never go wrong. This is the domain where nonlinearities in your data start to become significant, and decision trees are a great algorithm for squeezing them out.
If you have several billion datapoint, or many times that amount, check out neural nets. They have the capacity to absorb those kinds of datasets easily. As mentioned in one of my earlier posts, some research on application of neural networks for tabular data has shown that the reason they struggle with this kind of data is that the decision boundaries between different regions in the tabular data space are really, really jugged. However, when you have lots of data, the region boundaries start looking increasingly smooth, which gives neural network lots of material to work with.

Book Review - Machine Learning for Tabular Data

Bojan Tunguz — Thu, 06 Feb 2025 15:18:09 GMT

Machine Learning for Tabular Data: A Comprehensive Guide for Practitioners by Mark Ryan and Luca Massaron is a breath of fresh air for anyone who feels inundated by the hype around deep learning. Luca is someone I’ve known for a really long time, and I’ve always admired his Kaggle contributions and all the work he had done there. I’ve also been fortunate enough to collaborate with him on a few projects, and it was a pleasure to write a short quote that appears in this book. I’ve had access to the early versions of this book for a few months, and I’d like to offer my honest take on its content, and why it’s so valuable to all the practicing data scientists who deal with tabular data in their daily workflow.

While neural networks and unstructured data often steal the show in academic research, big tech investments, and media coverage, the reality for many of us working in the field is that structured (tabular) data remains the backbone of real-world business applications. That’s exactly where this book shines.

Right from the start, the authors ground the reader with a clear definition of tabular data and explains why it’s so critical across industries like finance, healthcare, and e-commerce. Many books give tabular data only a passing mention before moving on to sexier topics, but here, it’s the main event. The discussion kicks off with an overview of common challenges - things like handling missing data, transforming categorical features, and dealing with imbalanced datasets - and sets the stage for a deeper exploration of the techniques that matter most in practice.

One of the strongest aspects of this guide is its strong emphasis on data preparation, feature selection, and feature engineering. Far too often, we get hung up on which model to use - XGBoost, LightGBM, random forests, or a neural network - without giving feature engineering the attention it deserves. This book provides a thorough rundown of how to properly encode categorical variables, impute missing values, finding leaky features, and create new features that capture hidden relationships. Anyone who has spent time wrangling messy data will appreciate the real-world wisdom in these chapters.

Despite the comprehensive scope, the text strikes a perfect balance between theory and application. You don’t need a PhD in statistics to keep up, but there’s enough depth to satisfy those who like to see some math behind the methods. Core models like gradient boosting, bagging, and ensembling are unpacked in a way that’s accessible, yet doesn’t shy away from the details that matter. The authors also dedicates space to deep learning for tabular data, carefully examining scenarios where it can be beneficial - and where it might fall short compared to more traditional approaches.

One section I found especially helpful is the discussion around when (and when not) to use deep learning. Since tabular datasets often aren’t as large as image or text corpora, and because feature engineering is so important, neural networks can sometimes be overkill. The book does a great job of explaining that while deep learning can excel with complex, high-dimensional data (and certain specialized use cases), it’s not always the magic bullet it’s cracked up to be.

Who will benefit most from this book? If you’re a beginner to intermediate data scientist, you’ll get a rock-solid foundation in everything from data preprocessing to model evaluation. Business analysts and machine learning engineers will also appreciate the hands-on examples using real datasets - there’s nothing abstract about this approach. And if you’re a researcher or academic, you’ll find a well-reasoned comparison of traditional ML methods versus deep learning strategies.

Overall, Machine Learning for Tabular Data is a fantastic resource that fills a gap in the literature. It doesn’t fall into the trap of downplaying classic methods just because deep learning is in vogue. Instead, it offers a practical, well-rounded, and refreshingly clear roadmap for tackling the kinds of problems most data practitioners deal with every day. If tabular data is your bread and butter, this book is worth adding to your reference shelf.

XGBoost is All You Need, Part 4

Bojan Tunguz — Mon, 03 Feb 2025 13:43:08 GMT

After three long “introductory” posts, we finally get to talk about XGBoost, and what’s so special about it! The previous posts can be found here: Part 1, Part 2, Part 3.

So what exactly is XGBoost, who invented it, how does it work, and why is it so popular and enduring? Let’s dig in.

A Brief Intro to XGBoost.

XGBoost is both an algorithm and a software library used in machine learning. As an algorithm, it refers to a method known as regularizing gradient boosting trees, which helps improve prediction accuracy by reducing overfitting. As a library, it provides an efficient, scalable implementation of this algorithm that many practitioners rely on today.

The library was first released on March 27, 2014—over 10 years ago. It started as a research project by Tianqi Chen within the Distributed (Deep) Machine Learning Community. His work focused on building a fast and reliable implementation of gradient boosting, and the project quickly evolved into a widely adopted tool in the field.

One of the key moments in XGBoost’s rise to popularity came when it was used as the winning solution in the Higgs Machine Learning Kaggle Challenge. This achievement showcased its performance and efficiency, and it drew significant attention from the machine learning community. Since then, XGBoost has become the most installed and used gradient boosting trees library. Its core is written in C++, which contributes to its speed and efficiency, and it is available through packages in many popular programming languages. As a result, its install base is on par with some of the most popular deep learning libraries, making it a go-to tool for both researchers and practitioners.

What are decision trees?

Decision trees are a type of supervised machine learning algorithm that can be used for both classification and regression tasks. They model decisions and their possible outcomes as a tree-like structure, making them easy to understand and interpret.

At the core of a decision tree is a series of questions about the data. Each internal node of the tree asks a question about one of the attributes. For example, a node might check if a customer is older than 50 years. Each branch that emerges from a node represents one possible answer to that question. The process continues, with subsequent nodes asking further questions based on the answers given. Eventually, the tree reaches the leaf nodes. In classification tasks, each leaf node assigns a class label, and in regression tasks, it provides a continuous value.

The path from the root of the tree to any leaf node represents a series of decisions or rules that lead to a final prediction. This clear path of decisions makes the model’s reasoning easy to follow and understand.

To build a decision tree, the algorithm starts with the entire dataset and recursively splits it into smaller subsets. At each step, it selects the feature that best separates the data according to a certain criterion. Common criteria include the highest information gain or the lowest impurity. Impurity measures such as Gini impurity or entropy help in evaluating how well a split separates the classes or predicts a continuous outcome. The goal is to create the most homogeneous subsets possible, where the data points within each subset share similar characteristics regarding the target variable.

Decision trees are popular because of their simplicity, ease of interpretation, and ability to handle both numerical and categorical data. Their structure allows users to see exactly how decisions are made, making them a practical tool in many real-world applications.

What is boosting?

Boosting is an ensemble technique that creates a strong classifier by combining several weak classifiers. The process starts with a basic model built from the training data. This initial model may not predict all instances correctly, but it provides a starting point.

Once the first model is in place, a second model is created specifically to address the errors made by the first. This second model focuses on the data points where the initial model struggled, aiming to correct those mistakes. The idea is that by targeting the errors, the overall performance of the combined model improves.

This process of adding new models continues. Each subsequent model is trained to fix the errors of the model or combination of models that came before it. With every new model added, the ensemble becomes better at handling difficult cases. The process stops when either the training set is predicted perfectly or when a predetermined maximum number of models has been added.

Key points of boosting include:

1. Sequential Training:

In boosting, models are trained one after the other. Each new model is built to correct the errors of the ones that came before. Instead of training all models at once, boosting trains them in a sequence. The idea is that by addressing the mistakes made by previous models, the overall performance improves gradually.

2. Focus on Misclassifications:

A core concept in boosting is that not all training examples are treated equally. When a model makes a mistake on a particular data point, the boosting algorithm increases the weight or importance of that example. This means that the next model in the sequence will pay more attention to the examples that were previously misclassified. By focusing on these hard-to-classify cases, boosting helps to reduce errors in the final combined model.

3. Weak Learners:

The models used in boosting are often called weak learners. A weak learner is a model that performs only slightly better than random guessing. Even though each weak learner is not very accurate on its own, boosting combines many weak learners to form a strong overall model. The strength of boosting comes from the collective effort of many weak models working together, each improving on the errors of its predecessors.

4. Reduction in Bias and Variance:

Boosting helps in reducing both bias and variance in a model.

• Bias is the error that arises from overly simplistic assumptions in the learning algorithm. By sequentially adding models that correct previous mistakes, boosting can capture more of the underlying patterns in the data, thus reducing bias.

• Variance refers to the error caused by fluctuations in the training data. Since boosting combines several models, the final prediction becomes more stable and less sensitive to the noise in any single model, thereby reducing variance.

What is Gradient Boosting?

Gradient boosting is a technique that builds on the idea of boosting. What sets gradient boosting apart from regular boosting is its focus on minimizing an arbitrary differentiable loss function using gradient descent.

In regular boosting, the models are combined in a way that generally improves performance by focusing on the mistakes made by previous models. Gradient boosting takes this a step further by explicitly optimizing a loss function that measures the error between the model’s predictions and the actual values. This loss function can be any differentiable function, which means the method is very flexible and can be adapted to different kinds of prediction problems.

The process is carried out in a stage-wise manner. At each stage, a new model is added to the ensemble with the goal of reducing the overall error. To determine how to improve the predictions, gradient boosting uses the gradient descent algorithm. In simple terms, at each step, the algorithm calculates the gradient (or the slope) of the loss function with respect to the current model’s predictions. This gradient tells us in which direction and how much we should adjust the predictions to reduce the error.

More specifically, the new model built at each step is trained to predict the negative gradient of the loss function from the previous model. The negative gradient acts as a corrective signal - it points out the direction in which the current model is making errors and needs improvement. By adding a new model that approximates this negative gradient, the overall prediction is nudged in the right direction, thereby reducing the loss.

This combination of stage-wise modeling and the use of gradient descent for optimization is what makes gradient boosting distinct from regular boosting methods. It ensures that each new model directly contributes to minimizing the specific loss function, leading to a highly accurate and efficient predictive model.

What is eXtreme Gradient Boosting?

Extreme gradient boosting is an enhanced version of the traditional gradient boosting algorithm. It builds on the basic idea of gradient boosting - combining multiple weak models, typically decision trees, to form a strong model - but introduces several key improvements that set it apart.

eXtreme Gradient Boosting is designed with performance in mind. It implements a range of algorithmic and system optimizations that allow it to run faster and use less memory than regular gradient boosting methods. For example, it can efficiently build decision trees using advanced techniques such as approximate tree learning, which speeds up the training process without a significant loss in accuracy.

One of the main advantages of eXtreme Gradient Boosting is its ability to quickly process large datasets. It leverages both parallel and distributed processing, meaning that it can use multiple cores on a single machine or distribute work across several machines. This makes it especially useful for large-scale problems where traditional gradient boosting methods might struggle with speed and computational efficiency.

Overfitting is a common challenge in machine learning, where a model learns the training data too well and performs poorly on unseen data. eXtreme Gradient Boosting addresses this by incorporating regularization techniques (both L1 and L2). These built-in mechanisms help control model complexity, ensuring that the model generalizes better to new data without the need for extensive manual tuning.

In many real-world datasets, missing values are inevitable. XGBoost has a robust way of dealing with missing data. Instead of requiring you to impute or remove missing values before training, it automatically learns the best direction to take in the decision tree when it encounters a missing value. This feature simplifies data preprocessing and can lead to better model performance.

Sparse data - data with a lot of zero or missing entries - is common in fields like text processing or recommender systems. XGBoost is designed to efficiently manage sparse datasets. It uses a sparsity-aware algorithm that takes advantage of the data structure, reducing both computation time and memory usage.

A significant difference between XGBoost and standard gradient boosting methods is its support for parallel and distributed computing. While traditional methods often build trees sequentially, XGBoost can build parts of the model in parallel. This not only speeds up the training process but also makes it scalable, allowing it to handle very large datasets by distributing the workload across multiple machines or processing units.

XGBoost is All You Need, Part 3 - Gradient Boosted Trees

Bojan Tunguz — Thu, 30 Jan 2025 13:08:01 GMT

This is the third part in the series of blog posts about XGBoost, based on my 2024 GTC presentation you can find Part 1 here, and Part 2 here.

Today we want to talk about gradient boosted trees. Even though XGBoost has an option for purely linear boosting, it’s the non-linear version - based on the decision tree algorithms - that gives this library and algorithm its predictive power.

Decision trees are perhaps the simplest predictive algorithm to describe. Say you want to know if it’s raining outside. The only “feature” that you have is whether it’s cloudy or not. You build an algorithm that predicts it will not rain when there are no clouds in the sky, and it will rain if it’s overcast. Your prediction for the clear sky will be completely accurate, with the predictions for the overcast being fairly accurate. You can potentially add other features - such as the density of the cloud coverage, windy conditions, temperature, etc., and build an algorithm based on all of those features that gives you a simple yes or no answer. That’s essentially what a decision tree is. Decision trees are very easy to understand and implement for simple binary choice classifications, but with a bit of work and ingenuity they can also work for regression problems.

The Benefits of Gradient Boosted Trees

Gradient boosted trees offer several advantages when dealing with real-world datasets, especially those that exhibit irregular or complex structures. Unlike some other methods, tree-based models are well-suited for capturing patterns in tabular data without requiring sophisticated preprocessing or extensive feature engineering. This adaptability makes them a practical and accessible choice for many tasks where data may not follow a consistent format.

Another key benefit is the minimal data preprocessing that trees require. Models like linear regression or neural networks often need normalized inputs, one-hot encoding, or other transformations to perform well. In contrast, gradient boosted trees can typically process raw data with little or no adjustment. This not only saves time but also reduces the risk of introducing errors during data preparation. Furthermore, many popular gradient boosting frameworks can handle missing values automatically, either by assigning them to a separate branch during training or learning optimal splits for null values. As a result, you often do not need to impute or remove missing records before modeling.

Tree-based methods are also robust in the presence of outliers because they split the data at different threshold values rather than fitting a single global function. Large or anomalous data points do not disproportionately shift the model’s behavior, which helps maintain stable performance. In addition, tuning a gradient boosted tree model is relatively straightforward. You typically focus on hyperparameters such as the number of trees, tree depth, and learning rate. There is no need to design elaborate network architectures or layer configurations, as is required with many deep learning approaches.

An important byproduct of using gradient boosted trees is the insight they provide into feature importance. Because trees split data based on certain criteria, it is straightforward to see how often and how effectively particular features reduce model error. This information helps with both understanding the data and directing efforts toward feature selection and engineering. When processing large datasets, gradient boosting libraries can also take advantage of modern hardware, including GPUs, to train models faster. The parallel computations reduce training times significantly, making it possible to iterate quickly and refine results.

Finally, many gradient boosting libraries, such as XGBoost, LightGBM, and CatBoost, are easy to install, use, and maintain. They have extensive documentation, active user communities, and regular updates. This combination of simplicity, performance, and maintainability has made gradient boosted trees a popular and efficient choice in machine learning applications involving tabular data.

Comparing decision boundaries. Image taken from arXiv:2207.08815v1.

The Shortcomings of Gradient Boosted Trees

Gradient boosted trees rely on piecewise constant approximations and are therefore not inherently sensitive to linearities in the data. When a clear linear relationship exists, these models may require additional feature engineering—such as explicit interaction terms or polynomial expansions—to capture that structure effectively.

Another limitation is their poor extrapolation ability. Gradient boosted trees learn from patterns observed within the training range, so predictions made on data that falls well outside this range can be unreliable. The models cannot extend learned trends in a way that accurately reflects behavior far beyond what was seen during training.

By design, boosting is a “shallow” method. Each iteration creates a small decision tree, which contributes weak predictions. The strength of the final model comes from combining many of these weak learners, but the approach remains limited to one layer of depth at a time. This makes capturing complex, hierarchical patterns more difficult compared to multi-layer methods.

Gradient boosted trees do not provide automatic feature engineering or selection mechanisms. They rely entirely on the given features and do not embed or transform them in the way some neural network architectures can. As a result, domain knowledge and thoughtful preprocessing are often critical for good performance.

Model sizes for gradient boosted trees can also be quite large. In some cases, thousands of individual trees must be stored to achieve the desired accuracy, which can lead to large files and increased memory requirements. This can pose challenges in environments where storage or memory is constrained.

Finally, deploying gradient boosted tree models in production can be tricky. Integrating these models with existing systems may require specialized libraries for efficient inference. Performance considerations such as prediction latency and scaling to high volumes of requests can further complicate production deployments.

We Need More GBT Research

Even though GBTs are extremely capable and successful, there has been relatively little research on the. One main reason for the slow pace of research is that the basic gradient boosting algorithm has remained largely unchanged for many years. The core idea—building an ensemble of trees where each new tree corrects errors made by the previous ones—has proven very effective. Because the algorithm works well, many researchers have focused on applying it to real-world problems rather than rethinking the core method. Despite its success, there is still room for innovation, especially as new challenges in data and computational resources emerge. Small improvements or adjustments in the algorithm might lead to even better performance or more efficient training processes.

Another factor is that most popular GBT libraries are implemented in languages such as C, C++, or CUDA. These languages are chosen for their speed and efficiency, which are critical for handling large datasets. However, they are not as accessible as Python, which has become the language of choice for many researchers due to its ease of use and readability. When the primary implementations of GBTs are in lower-level languages, it can create a barrier for researchers who want to experiment with or extend the algorithms. This limits the community of researchers who can easily contribute new ideas and small improvements.

The third issue is that many GBT algorithms are not designed in a modular way. A modular codebase allows individual components of an algorithm to be modified or replaced without having to rewrite the entire system. In contrast, the typical GBT implementations are often monolithic. This means that making even small changes to the algorithm can require significant effort. A more modular design would enable researchers to experiment with different components, such as alternative loss functions, tree-building strategies, or regularization methods, without having to dive into a large and complex codebase. This could lead to more iterative improvements and faster innovation.

These challenges—limited evolution of the core algorithm, implementation barriers due to low-level languages, and a lack of modularity—contribute to why there has been relatively little recent research on gradient boosted trees. Addressing these issues could open the door to incremental improvements that build on the strong foundation of the current algorithms. More accessible and modular implementations would allow a wider range of researchers to experiment, share ideas, and develop new techniques that could make gradient boosted trees even more effective. In the long run, this kind of research can help push the boundaries of what these algorithms can achieve in various machine learning tasks.

XGBoost is All You Need

Bojan Tunguz — Mon, 27 Jan 2025 16:49:32 GMT

A few days ago I published the first part in the series of posts in which I aim to explain what I mean by “XGBoost is All You Need”. In the first post I just tried to give a general overview, explain my own background and how my point of view has been forged, and set the record straight that yes, I am very serious about my appreciation for XGBoost, but no, I am not nearly as dogmatic as you may expect.

Today I’d like to take a deeper dive into tabular data. This is the kind of data that XGBoost is primarily designed to handle. It also happens to be the most common form of data used by the Data Science practitioners, analysts, and pretty much anyone who is dealing with any kind of business data. If you’ve ever used Excel, or even created an itemized shopping list, then you have used tabular data. So let’s dig in to see what this data is all about.

What is Tabular Data?

In the simplest terms, historically tabular data is just transactional data. It is the kind of data that we use to record some kind of transaction - a sale, an inventory of goods, and any other kind of transfer of goods or services. It is very likely that transactional data has existed before the dawn of writing, and has been the main driving force for the development of the written language. Tabular data has also been the main driving force behind the development of modern digital computing. The B in IBM, for instance, stands for Business, and the information that businesses really needed to know and understand was stored in tabular datasets. In the latter part of the 20th century this also lead to the growth of the business database industry, and has yielded us tach giants such as Oracle.

Why do we care about tabular data?

When it comes to real-world machine learning applications, tabular data often steals the spotlight. Although there’s no definitive measure, many estimates suggest that well over 50%—and in some circles even north of 90%—of data scientists rely on tables as their primary data structure. This isn’t surprising given the steady growth of relational databases in business and academia, along with the clear financial incentive: structured data powers everything from transactional systems to analytics dashboards, sometimes representing billions of dollars in transactions.

However, tabular data can be tricky. It’s arguably the most heterogeneous type of data out there—each column can represent a completely different domain, with varied distributions, scales, and data types. Such diversity can stymie many off-the-shelf machine learning models that assume cleaner or more uniform input.

From a purely intellectual standpoint, tabular data offers a playground of challenges that demand creative solutions. Each column can represent a distinct distribution, from numeric to categorical, and may include missing values, outliers, or complex interactions. Traditional machine learning models often struggle to adapt to this variability, prompting researchers to devise advanced techniques and algorithms that can handle mixed data types and subtle patterns. This is precisely where methods like XGBoost shine, as they effectively accommodate the wide-ranging nature of tabular data while remaining fast and accurate in practice.

Yeah, but what is really tabular data???

One main misconception about tabular data is that it’s just the format in which the data is stored, and there is nothing intrinsically qualitatively different from other kinds of data. However, tabular data qualitatively structurally different from text, sound, image and video data.

One of the main ways tabular data differs from non-tabular data is the lack of intrinsic local structure. In audio, for instance, neighboring samples in time matter; in images, neighboring pixels in space matter. These forms of data have a “locality” or “ordering” that influences how we process and interpret them. With tabular data, by contrast, you can shuffle the columns any way you want, and it typically won’t affect the basic meaning of the data. The same is true for many cases when you shuffle the rows.

Time-series data sits somewhere in between purely tabular and non-tabular formats. Time-series data certainly has an inherent order—points in time—and this sequential aspect can be crucial for analysis. However, in practice, many analysts will transform time-series data into a tabular format through feature engineering (e.g., computing moving averages, lags, or other aggregate features). Once that transformation happens, the result often behaves more like traditional tabular data, allowing analysts to apply many of the same techniques they would use on any typical row-and-column dataset.

Practical considerations with tabular data

Tabular data underpins many of the most critical functions in business. It’s the direct result of transactional events—everything from sales orders and customer registrations to shipping updates and returns. Because these datasets are generated in the course of real-world processes, acquiring them can be costly and time-consuming: you need an actual business running its operations just to gather meaningful numbers. Naturally, the bigger the business, the larger (and often richer) these tabular datasets become.

The scale and significance of such data also mean that these datasets are closely guarded. In many companies, they are considered crown jewels, offering insights that can shape strategy and give competitive advantages. This exclusivity poses challenges for those looking to build expertise. Tabular data modeling lies at the heart of Data Science as a discipline, yet access to large, high-quality datasets can be scarce, especially for students or researchers who don’t have direct industry ties.

Even for those with access, mastering tabular data techniques isn’t straightforward. It often involves hands-on practice with bespoke datasets, a clear understanding of the underlying business processes, and mentorship from experienced practitioners. Kaggle’s earlier competitions were a valuable resource in this regard, exposing participants to a variety of tabular problems. But in industries like healthcare, finance, or insurance—which generate some of the richest tabular data—strict regulations and a conservative stance on data sharing can limit opportunities for open experimentation.

Another key factor is that, in many situations, simply acquiring more and better-quality data can outperform the benefits of building more advanced models. The tight interweaving of data with business processes makes the data itself immensely valuable. Taken together, these considerations show that tabular data is a complex topic, with its nuances spanning everything from business strategy and regulations to machine learning and mentorship—a real challenge for those looking to become experts in this fundamental pillar of Data Science.

Tabular data ML has resisted the DL revolution

Neural networks have become a mainstay in fields like computer vision and natural language processing, but their effectiveness on tabular data is far less established. Despite some recent progress, using deep learning to model tabular data remains, to a large extent, an “unsolved” issue. Many researchers and practitioners still rely heavily on gradient-boosted decision trees for tasks like classification and regression on structured datasets. In fact, tools such as XGBoost, LightGBM, and CatBoost often deliver impressive results right out of the box, with minimal tuning required.

That’s not to say people haven’t tried to develop neural network frameworks that can challenge tree-based models on tabular data. Several new libraries and methodologies have emerged, promising robust performance and innovative architectures designed for structured data. However, for the most part, these neural network solutions either excel only on a few handpicked datasets or require extremely long training times to match the performance of faster and more straightforward tree-based methods. A critical factor at play is the inherent “locality inductive bias” in neural networks, which can limit their ability to capture patterns that aren’t naturally localized in a tabular structure.

Ultimately, while it’s exciting to see ongoing research efforts in this area, the reality for many real-world applications is that gradient-boosted decision trees remain the go-to option. They offer strong performance with fewer computational demands and often simpler tuning processes. Until neural networks overcome these limitations—particularly around training time and generalization beyond carefully selected datasets—the advice that “XGBoost is all you need” is likely to remain true for the majority of tabular problems.

XGBoost is All You Need

Bojan Tunguz — Mon, 20 Jan 2025 13:01:18 GMT

I’ve been meaning to write this post for a really, really long time. Yes, I’ve promised that this blog will not be just about XGBoost, or primarily about it. But both the blog name and my own approach to ML for tabular data have been heavily influenced by XGBoost, and I feel that I owe it to both myself and many others to explain what I had in mind, and why the above title is not nearly as crazy as you may think.

First things first: I am NOT the creator of XGBoost, nor have I ever directly worked on its development. XGBoost was developed by Tianqi Chen, and was first released in 2014. I will cover that history a bit more at some point in the future. You can get a quick intro to those early days in this article.

This and the subsequent few posts will be loosely based on the presentation that I gave at GTC 2024. I’ve been meaning to put it all into a blog post (or a series of blog posts), but the timing was never quite right. I am finally ready to do it.

My own background

In order to fully appreciate where I am coming from in my assessment of XGBoost (and Data Science in general), I believe it’s best to share a bit of my own background.

I am a Theoretical Physicist by training. I’ve obtained undergraduate and masters degrees from Stanford, and a PhD from University of Illinois at Urbana-Champaign. Physics has always been my first intellectual love, but Physics doesn't pay the bills. The job marketplace for someone with my particular set of skill is virtually nonexistent. So I had to switch careers. I stumbled upon Data Science and Machine Learning thanks to soem cool MOOCs. I unsuccessfully tried to leverage the skills I learned in those courses for an opening in the professional DS/ML world. And then I discovered Kaggle. In my estimate, Kaggle to this day remains the best online learning and credentialing resource. I’ve used it to pick up many new DS, ML, and AI skills, and was fortunate enough to use those skills and credentials to transition into the tech industry. I’ve worked at several startups, and eventually found myself at NVIDIA.

All of my DS, ML, and AI skills are almost completely self-taught. Almost all of them have also been honed in very practically-minded environments. This is where my own biases come from. I am not that interested in some new flashy tool or algorithm that is marginally - if at all! - better than something that has stood the test of time and is a reliable workhorse for a data practitioner’s daily professional applications.

XGBoost is all you need

“XGBoost is all you need”—it’s a phrase that started off as a bit of a tongue-in-cheek remark but has since taken on a life of its own. To be clear, the point isn’t that XGBoost is the single solution to every machine learning problem you might encounter. Instead, it’s a nod to how robust, flexible, and widely adopted XGBoost has become in the landscape of Gradient Boosted Trees (GBTs). The phrase is also a nod to the famous transformers paper “Attention is All You Need.”

GBTs are known to excel on tabular data, making libraries like XGBoost, LightGBM, CatBoost, and even scikit-learn’s HistGradientBoosting top picks for many real-world machine learning tasks. In my own practice, I cycle through all of these, because each has its strengths and weaknesses. No single GBT library is universally better than the others; which one is best in a given situation can depend on factors like data size, hardware, and the complexity of the modeling task.

Beyond raw predictive performance and efficiency, the maturity and robustness of each library matter a great deal. XGBoost’s wide community support, thorough documentation, and active development make it a reliable choice. Practical considerations—such as ease of installation, compatibility with your system’s hardware, and straightforward maintenance—are also crucial. This extends to GPU support, which can provide huge speed-ups when training on large datasets, and XGBoost has been a front-runner in this arena.

Under the hood, NVIDIA plays a major role in maintaining and advancing XGBoost, with contributors like Rory Mitchell, Hyunsu Cho, and Jiaming Yuan (alongside many other dedicated developers) working to improve scalability, efficiency, and GPU integration. Their efforts ensure that XGBoost remains a powerful, flexible, and future-proof option for anyone working with tabular data and gradient-boosted models. NVIDIA’s support for XGBoost was one of the main factors for my (sometimes heavy-handed) promotion of this library. It was my support of the “home team”, as well as my own ability to access the maintainers easily that made it particularly appealing to me. Being able to log into Slack and chat with them whenever I had any serious issues or questions was priceless.

What I don’t mean by my tagline

When I say that XGBoost is “all you need,” I’m not suggesting that it is universally superior to every other machine learning algorithm. There are plenty of cases where linear models, random forests, or even specialized methods like support vector machines can be a better fit. It all comes down to understanding your data, the problem constraints, and the performance metrics that matter most for your application. XGBoost excels in many scenarios, but it’s not an automatic trump card in every contest.

I’m also not implying that XGBoost outperforms every other gradient boosting framework or library in existence. Different GBT implementations can shine in specific hardware or data conditions. LightGBM, CatBoost, and others each come with their own advantages, such as handling categorical variables differently or offering certain performance enhancements. The “all you need” phrase shouldn’t be interpreted as a universal decree that invalidates these alternatives.

It’s equally important to note that neural networks still have their place and are the go-to method for tasks involving text, images, or highly unstructured data. Just because XGBoost is a powerful general-purpose tool doesn’t mean deep learning methods are obsolete, even for tabular data. Likewise, good old-fashioned feature engineering is far from dead. While tree-based methods can reduce the need for manual feature construction compared to linear models, taking time to craft meaningful features can still provide significant performance gains.

Finally, the phrase “all you need” certainly doesn’t dismiss advanced techniques like ensembling. Sometimes stacking XGBoost with other models—or even multiple XGBoost configurations—can yield stronger results. And while XGBoost is powerful, you typically wouldn’t frame it as a standalone engine for building a complete ML system. It’s a high-performing machine learning library, not a plug-and-play, all-encompassing ML solution.

Summary

In this post I’ve tried to provide some context and background for why I am such an unapologetic XGBoost enjoyer. I’ve also tried to make the case that I am not delusional, and that I am, in fact, appreciative of many other ML tools and algorithms for tabular data, and use them on a regular basis in my own work. In the next few posts I’ll try to elaborate on some of the above points, and provide you with some additional details and context.

eXtreme Gradient Blogging

Bojan Tunguz — Mon, 13 Jan 2025 12:13:40 GMT

I’ve been meaning to start this blog for a really long time. Before you ask, and I know this is the first thing everyone is thinking, no this is NOT a blog about XGBoost. Oh, XGBoost will be featured here, and probably very prominently, but it will not be the sole focus of this blog.

So what is this blog about?

In simple terms, this is a technical blog about more advanced aspects of Data Science and Machine Learning, with the primary focus on tabular data. Most of the topics covered here will be considered “trad” ML. There will also be a very heavy emphasis on the applied topics and research, highly relevant for the practitioners in this field.

Over the past 10-15 years neural networks and deep learning have become the backbone of all of the most exciting advancements in Machine Learning: from “simple” image recognition, text classification, and voice recognition, to modern Generative AI and LLMs, these algorithms have gone from strength to strength. They richly deserve all the recognition and attention that has been heaped upon them. Unfortunately, this overwhelming concentration of attention, effort, and resources has come at the expense of all the other Machine Learning techniques and algorithms. All the research and development in those other branches of ML has severely stagnated. It is one of the aims of this blog to rekindle, however modestly, research efforts in those other fields.

Around the same time when Deep Learning was taking off, Data Science became a very trendy moniker for a new professional occupation, dubbed unironically “the sexiest job of the 21st century”. This was also around the time when I was changing careers, and Data Science, on the surface of it, seemed like the perfect new profession for me. It combined analytical and technical skills, in just the right proportion it seemed, commensurate with my own capabilities and interests. I had taken its name at the full face value. For me, to this day, Data Science is about two things: 1. Data, and 2. Science. Fast forward a decade or so, and several jobs across the technology landscape, and I’ve learned the hard way that I was sold a bill of goods. At every single workplace that I had worked at, Data Scientists were expected to be just another form of Software Engineers. The unique insights that they could bring to the table were, in the best case scenarios unappreciated. At worst, the colleagues and the management were actively hostile to them.

There is a question, in the days of ever approaching AGI, if Data Science has a future. It is a good question, a version of which can be applied to almost any white collar profession. If it does have a future, though, then I we owe it to the field to make it live up to its true promise and potential. It is my hope that this blog will also be helpful in that regard - make Data Science what it really should be all about: inquiry, research, insights, understanding. Provide practitioners with the best possible tools and insights so that they can get their job done as well as possible. And maybe, just maybe, gain respect within the hallowed halls of the tech industry.

Coming soon

Bojan Tunguz — Sun, 12 Jan 2025 23:06:01 GMT

This is XGBlog.

Subscribe now