Book Review - Machine Learning for Tabular Data

XGBoost, Deep Learning, and AI

Feb 06, 2025

Machine Learning for Tabular Data: A Comprehensive Guide for Practitioners by Mark Ryan and Luca Massaron is a breath of fresh air for anyone who feels inundated by the hype around deep learning. Luca is someone I’ve known for a really long time, and I’ve always admired his Kaggle contributions and all the work he had done there. I’ve also been fortunate enough to collaborate with him on a few projects, and it was a pleasure to write a short quote that appears in this book. I’ve had access to the early versions of this book for a few months, and I’d like to offer my honest take on its content, and why it’s so valuable to all the practicing data scientists who deal with tabular data in their daily workflow.

While neural networks and unstructured data often steal the show in academic research, big tech investments, and media coverage, the reality for many of us working in the field is that structured (tabular) data remains the backbone of real-world business applications. That’s exactly where this book shines.

Right from the start, the authors ground the reader with a clear definition of tabular data and explains why it’s so critical across industries like finance, healthcare, and e-commerce. Many books give tabular data only a passing mention before moving on to sexier topics, but here, it’s the main event. The discussion kicks off with an overview of common challenges - things like handling missing data, transforming categorical features, and dealing with imbalanced datasets - and sets the stage for a deeper exploration of the techniques that matter most in practice.

One of the strongest aspects of this guide is its strong emphasis on data preparation, feature selection, and feature engineering. Far too often, we get hung up on which model to use - XGBoost, LightGBM, random forests, or a neural network - without giving feature engineering the attention it deserves. This book provides a thorough rundown of how to properly encode categorical variables, impute missing values, finding leaky features, and create new features that capture hidden relationships. Anyone who has spent time wrangling messy data will appreciate the real-world wisdom in these chapters.

Despite the comprehensive scope, the text strikes a perfect balance between theory and application. You don’t need a PhD in statistics to keep up, but there’s enough depth to satisfy those who like to see some math behind the methods. Core models like gradient boosting, bagging, and ensembling are unpacked in a way that’s accessible, yet doesn’t shy away from the details that matter. The authors also dedicates space to deep learning for tabular data, carefully examining scenarios where it can be beneficial - and where it might fall short compared to more traditional approaches.

One section I found especially helpful is the discussion around when (and when not) to use deep learning. Since tabular datasets often aren’t as large as image or text corpora, and because feature engineering is so important, neural networks can sometimes be overkill. The book does a great job of explaining that while deep learning can excel with complex, high-dimensional data (and certain specialized use cases), it’s not always the magic bullet it’s cracked up to be.

Who will benefit most from this book? If you’re a beginner to intermediate data scientist, you’ll get a rock-solid foundation in everything from data preprocessing to model evaluation. Business analysts and machine learning engineers will also appreciate the hands-on examples using real datasets - there’s nothing abstract about this approach. And if you’re a researcher or academic, you’ll find a well-reasoned comparison of traditional ML methods versus deep learning strategies.

Overall, Machine Learning for Tabular Data is a fantastic resource that fills a gap in the literature. It doesn’t fall into the trap of downplaying classic methods just because deep learning is in vogue. Instead, it offers a practical, well-rounded, and refreshingly clear roadmap for tackling the kinds of problems most data practitioners deal with every day. If tabular data is your bread and butter, this book is worth adding to your reference shelf.

XGBlog

Discussion about this post