XGBoost is All You Need - Part 5

Nontrivial use case 1: multi-GPU and multi-machine training

Feb 20, 2025

This is the fifth installment in the series of posts about XGBoost based on my 2024 GTC Presentation. You can find the previous posts here: Part 1, Part 2, Part 3, and Part 4.

In the proviso posts I took a somewhat high-level approach, and talked mostly in general terms about what XGBoost was, and how it fits within the whole ecosystem of Data Science and Machine Learning for Tabular Data. In this and the following two posts we are finally going to get our hands a bit dirty. Warning: there will be code! I wanted to showcase a few nontrivial use cases for XGBoost, based on my own work. Many of the lessons from these use cases are widely applicable, and can be extended to different algorithms, but some are either highly specific to XGBoost (the library in particular) and might be clunky to implement with some other approaches.

XGBoost distributed computing and GPU Support

XGBoost has evolved significantly in how it supports GPUs and distributed computing. In 2017, version 0.7 introduced GPU acceleration by leveraging NVIDIA’s CUDA libraries. This offloaded certain operations, such as gradient computation and tree construction, to GPUs and offered a significant speed boost over CPU-based training, especially for large datasets and many boosting iterations.

By 2019, it became clear that a single GPU could still be a bottleneck for increasingly large training tasks. Version 0.9 addressed this by introducing multi-GPU support, allowing XGBoost to distribute the training workload across multiple GPUs on a single machine. This added parallelism accelerated model training further and made it feasible to handle even bigger datasets more efficiently.

Organizations that needed to scale beyond a single machine required a solution for distributing training across clusters. Version 1.0 of XGBoost delivered this capability by integrating with Dask, a Python library designed for parallel and distributed computing. This integration enabled both preprocessing and training tasks to run on multiple machines in a cluster. Larger datasets could be split across these machines, with each node handling part of the workload, leading to faster overall training times.

Finally, in version 1.4, all multi-GPU and multi-machine capabilities were consolidated under Dask’s framework. Instead of relying on separate features for single-node versus multi-node setups, users could simply configure Dask for their cluster. Dask would then manage resource allocation for both CPU and GPU workloads, streamlining the distributed training process and reducing the complexity involved in scaling XGBoost across different hardware configurations.

Before we go any further, though, it would be useful to explain what Dask and Optuna are, especially if you have never come across them or have never used them.

What is Dask?

Dask is a flexible parallel computing library for Python that makes it easy to scale computations from a single laptop to a large cluster. It provides parallel collections like arrays, dataframes, and lists that mimic their in-memory equivalents but can operate on larger-than-memory datasets by breaking them into smaller chunks and distributing work across multiple cores or machines. Dask integrates seamlessly with the broader PyData ecosystem, allowing users to work with familiar libraries like NumPy, pandas, and scikit-learn, but with the option to speed up or scale out whenever needed. Its scheduler dynamically constructs and executes task graphs under the hood, handling optimizations and load balancing so users can focus on writing clear, efficient, and parallel code.

What is Optuna?

Optuna is an open-source Python library designed to automate hyperparameter optimization for machine learning models. Hyperparameter optimization is the process of systematically searching for the best combination of settings (e.g., learning rate, number of layers, regularization parameters) that yields the highest performance for a model. Optuna streamlines this process through an easy-to-use “define-by-run” approach, allowing users to dynamically define the hyperparameter search space. It employs sophisticated search algorithms - such as Bayesian optimization with TPE - and includes features like pruning to terminate underperforming trials early, thereby saving computational resources. Optuna is very framework and library agnostic system, and can be used both with the “classical” ML algorithms, as well as all the most popular Neural Network frameworks. It integrates seamlessly with popular frameworks like PyTorch and TensorFlow, making it both efficient and convenient to discover optimal hyperparameters, ultimately enhancing the accuracy and reliability of machine learning models.

XGBoost and Dask for hyperparameter optimization - an example with Porto Seguro dataset

We’ll show how to combine Dask, XGBoost, and Optuna for hyperparamter optimization. We’ll use the dataset from the Porto Seguro Kaggle competition, and we’ll do the training on a DGX H100, over 8 (!!!) H100 GPUs. This is, granted, a bit of an overkill I terms of compute but it does make training run really, really fast, which comes in really handy when you are trying hundreds, or even thousands, of different hyperparamter combinations.

For the purposes of this example the most relevant thing to know about the Porto Saguro dataset and task is that it was a classification competition with anonymized features.

All the code for the competition can be found in the following notebook on GitHub. The full repo, with all the other scripts and output artifacts, can be found here.

First, we import all the essential libraries for distributed computing, data manipulation, machine learning, and hyperparameter optimization. dask.distributed and dask_cuda enable parallel/distributed computing across multiple GPUs, while pandas and numpy handle data structures and numerical operations. xgboost provides gradient boosting methods, sklearn offers model evaluation utilities (like KFold and roc_auc_score), and optuna is a framework for hyperparameter tuning. The gc (garbage collector) and logging modules help with memory management and logging, respectively.

Then we create a local GPU‐aware Dask cluster with eight GPU workers (via LocalCUDACluster(n_workers=8)) and then initializes a Dask client that connects to it, allowing you to distribute and manage computational tasks across those GPU workers.

We then loop over five training/validation folds, lazily read each fold from CSV with delayed and dd.from_delayed, split out the “target” column, and then stash both the feature matrix and the target vector in lists. In other words, we collects each fold’s data and labels without actually loading them into memory until needed, using Dask’s delayed execution model. I prefer to prepare folds and save them as separate files, because Dask can be a bit finicky about slicing them, especially if they are loaded using the lazy read. We are also using a full 5-fold validation scheme for Optuna. Normally this would be a huge computational and time overkill, but hey, when you have lots of computational resources at your disposal, why not. :D

Optuna objective function

The code below defines a function, which serves as the Optuna objective for tuning XGBoost hyperparameters in a cross‐validated manner using Dask. Inside the objective function, a dictionary of candidate hyperparameters (params) is defined by calling various trial.suggest_* methods (e.g., to choose values for lambda, alpha, colsample_bytree, etc.). A five‐fold cross‐validation (KFold) then splits the training data, and in each fold the code creates Dask‐based DMatrix objects for training and validation. Next, it trains an XGBoost model on each fold with the current candidate parameters and retrieves fold‐specific predictions, storing them in train_oof. After all folds, the function calculates the Gini metric on these out‐of‐fold predictions and returns that metric, which Optuna uses to guide the hyperparameter search.

Logging and optimizing

This snippet configures the Python logging system so that Optuna’s messages go to a file instead of standard error, then creates an Optuna “study” (the container that orchestrates hyperparameter trials) and finally kicks off the optimization process. First, a logger is obtained and its logging level is set to INFO. A FileHandler is added so that all messages at INFO level or above are written to the file optuna_xgb_output_0.log. The lines with optuna.logging.enable_propagation() and optuna.logging.disable_default_handler() ensure that logs are forwarded to the root logger (and hence into the file), while preventing duplicate outputs to standard error. The create_study() call sets up an Optuna study named five_fold_optuna_xgb_0 with a specified SQLite database to record the experiment results. Finally, the code invokes study.optimize(...) with n_trials=3, instructing Optuna to run three hyperparameter‐search trials using the provided objective function. This optimization process runs really, really fast on a DGX H100 - between 30 and 60 seconds per trial!

Dask Training on a Cluster

Dask training can be done on a single CPU, multi GPU machine, or a cluster of CPUs and GPUs. To set up a Dask cluster, you first need to launch a central scheduler process, which acts as the traffic controller for all your workers. From the command line, you can simply run dask scheduler to start up this scheduler. The scheduler will report a TCP address where it is listening (for example, tcp://127.0.0.1:8786). This address is how clients and workers find and communicate with the scheduler. In a simple setup on a single machine, this is often all you need, but in more complex, multi-node environments you would run the scheduler on a network-accessible interface and start workers on remote machines.

After the scheduler is up and running, you can add workers to the cluster. In our example, we would use dask cuda worker 127.0.0.1:8786 (if we want GPU-enabled workers) to connect a worker process to the scheduler. Each worker will register itself with the scheduler, making its CPU or GPU resources available for distributed tasks. Finally, in our Python session, connect a Dask client to this cluster by importing Client from dask.distributed and creating an instance pointing to the same scheduler address, for example client = Client("127.0.0.1:8786"). This tells our Python scripts or notebooks to submit computations through the Dask scheduler and distribute tasks across the available workers.

XGBoost: train anywhere, deploy anywhere

One of the most wonderful things about XGBoost is its extreme widespread adoption and compatibility. It has been ported to almost any compute environment and device you can think of. For instance, I have been able to export the above model, trained on DGX H100, as a json file. That model, in turn, I loaded onto my tiny Raspberry Pi Zero, and was able to run an inference with it!

XGBlog

Discussion about this post