A few days ago I published the first part in the series of posts in which I aim to explain what I mean by “XGBoost is All You Need”. In the first post I just tried to give a general overview, explain my own background and how my point of view has been forged, and set the record straight that yes, I am very serious about my appreciation for XGBoost, but no, I am not nearly as dogmatic as you may expect.
Today I’d like to take a deeper dive into tabular data. This is the kind of data that XGBoost is primarily designed to handle. It also happens to be the most common form of data used by the Data Science practitioners, analysts, and pretty much anyone who is dealing with any kind of business data. If you’ve ever used Excel, or even created an itemized shopping list, then you have used tabular data. So let’s dig in to see what this data is all about.
What is Tabular Data?
In the simplest terms, historically tabular data is just transactional data. It is the kind of data that we use to record some kind of transaction - a sale, an inventory of goods, and any other kind of transfer of goods or services. It is very likely that transactional data has existed before the dawn of writing, and has been the main driving force for the development of the written language. Tabular data has also been the main driving force behind the development of modern digital computing. The B in IBM, for instance, stands for Business, and the information that businesses really needed to know and understand was stored in tabular datasets. In the latter part of the 20th century this also lead to the growth of the business database industry, and has yielded us tach giants such as Oracle.
Why do we care about tabular data?
When it comes to real-world machine learning applications, tabular data often steals the spotlight. Although there’s no definitive measure, many estimates suggest that well over 50%—and in some circles even north of 90%—of data scientists rely on tables as their primary data structure. This isn’t surprising given the steady growth of relational databases in business and academia, along with the clear financial incentive: structured data powers everything from transactional systems to analytics dashboards, sometimes representing billions of dollars in transactions.
However, tabular data can be tricky. It’s arguably the most heterogeneous type of data out there—each column can represent a completely different domain, with varied distributions, scales, and data types. Such diversity can stymie many off-the-shelf machine learning models that assume cleaner or more uniform input.
From a purely intellectual standpoint, tabular data offers a playground of challenges that demand creative solutions. Each column can represent a distinct distribution, from numeric to categorical, and may include missing values, outliers, or complex interactions. Traditional machine learning models often struggle to adapt to this variability, prompting researchers to devise advanced techniques and algorithms that can handle mixed data types and subtle patterns. This is precisely where methods like XGBoost shine, as they effectively accommodate the wide-ranging nature of tabular data while remaining fast and accurate in practice.
Yeah, but what is really tabular data???
One main misconception about tabular data is that it’s just the format in which the data is stored, and there is nothing intrinsically qualitatively different from other kinds of data. However, tabular data qualitatively structurally different from text, sound, image and video data.
One of the main ways tabular data differs from non-tabular data is the lack of intrinsic local structure. In audio, for instance, neighboring samples in time matter; in images, neighboring pixels in space matter. These forms of data have a “locality” or “ordering” that influences how we process and interpret them. With tabular data, by contrast, you can shuffle the columns any way you want, and it typically won’t affect the basic meaning of the data. The same is true for many cases when you shuffle the rows.
Time-series data sits somewhere in between purely tabular and non-tabular formats. Time-series data certainly has an inherent order—points in time—and this sequential aspect can be crucial for analysis. However, in practice, many analysts will transform time-series data into a tabular format through feature engineering (e.g., computing moving averages, lags, or other aggregate features). Once that transformation happens, the result often behaves more like traditional tabular data, allowing analysts to apply many of the same techniques they would use on any typical row-and-column dataset.
Practical considerations with tabular data
Tabular data underpins many of the most critical functions in business. It’s the direct result of transactional events—everything from sales orders and customer registrations to shipping updates and returns. Because these datasets are generated in the course of real-world processes, acquiring them can be costly and time-consuming: you need an actual business running its operations just to gather meaningful numbers. Naturally, the bigger the business, the larger (and often richer) these tabular datasets become.
The scale and significance of such data also mean that these datasets are closely guarded. In many companies, they are considered crown jewels, offering insights that can shape strategy and give competitive advantages. This exclusivity poses challenges for those looking to build expertise. Tabular data modeling lies at the heart of Data Science as a discipline, yet access to large, high-quality datasets can be scarce, especially for students or researchers who don’t have direct industry ties.
Even for those with access, mastering tabular data techniques isn’t straightforward. It often involves hands-on practice with bespoke datasets, a clear understanding of the underlying business processes, and mentorship from experienced practitioners. Kaggle’s earlier competitions were a valuable resource in this regard, exposing participants to a variety of tabular problems. But in industries like healthcare, finance, or insurance—which generate some of the richest tabular data—strict regulations and a conservative stance on data sharing can limit opportunities for open experimentation.
Another key factor is that, in many situations, simply acquiring more and better-quality data can outperform the benefits of building more advanced models. The tight interweaving of data with business processes makes the data itself immensely valuable. Taken together, these considerations show that tabular data is a complex topic, with its nuances spanning everything from business strategy and regulations to machine learning and mentorship—a real challenge for those looking to become experts in this fundamental pillar of Data Science.
Tabular data ML has resisted the DL revolution
Neural networks have become a mainstay in fields like computer vision and natural language processing, but their effectiveness on tabular data is far less established. Despite some recent progress, using deep learning to model tabular data remains, to a large extent, an “unsolved” issue. Many researchers and practitioners still rely heavily on gradient-boosted decision trees for tasks like classification and regression on structured datasets. In fact, tools such as XGBoost, LightGBM, and CatBoost often deliver impressive results right out of the box, with minimal tuning required.
That’s not to say people haven’t tried to develop neural network frameworks that can challenge tree-based models on tabular data. Several new libraries and methodologies have emerged, promising robust performance and innovative architectures designed for structured data. However, for the most part, these neural network solutions either excel only on a few handpicked datasets or require extremely long training times to match the performance of faster and more straightforward tree-based methods. A critical factor at play is the inherent “locality inductive bias” in neural networks, which can limit their ability to capture patterns that aren’t naturally localized in a tabular structure.
Ultimately, while it’s exciting to see ongoing research efforts in this area, the reality for many real-world applications is that gradient-boosted decision trees remain the go-to option. They offer strong performance with fewer computational demands and often simpler tuning processes. Until neural networks overcome these limitations—particularly around training time and generalization beyond carefully selected datasets—the advice that “XGBoost is all you need” is likely to remain true for the majority of tabular problems.
Great post!
Love the post and the series! I hope you come up with something similar in the future.
I also have a question. When you wrote
‘’’’A critical factor at play is the inherent “locality inductive bias” in neural networks, which can limit their ability to capture patterns that aren’t naturally localized in a tabular structure.’’’
Isn’t locality inductive bias only present in CNN or RNN architectures but not in fully connected networks and transformers?