When to use which approach/technique with a given dataset
These are my rules of thumb, and caveats could fill an entire book
As I’ve mentioned in my initial blog post on this blog, no I am not dogmatic about any particular machine learning algorithm in my own work, and even less so when it comes to what others should use. If you have been using a certain algo most of your professional life, and it works for you, hey, more power to you. I am eminently practical and pragmatic when it comes to data science and ML. I am also insatiably curious and love to tinker, so when I find out about another new technique, ML library, or approach, I love to take it for a spin. I love ML modeling, I find it intrinsically fun, and do it for its own sake. Nonetheless, if I have to do something professionally, and want to get it done in a way that is both foundationally sound and practically robust, the following are my first order recommendations:
Up to a few hundred datapoint, use Stats. Yes, yes, I know, all of machine learning is just a subset of Statistics. But practically and methodologically the way that statistics is used in science and research is very different from the way that we approach Machine Learning in Data Science. I am talking here about calculating correlations between variables, t-tests, and all of that. Descriptive Statistics alone can get you pretty far. Coming up with the predictive rules of thumb based on those will oftentimes be sufficient, especially if in addition to paucity of data points you are dealing with just a handful of variables.
For few hundred to few thousand use linear/logistic regression. Linear algorithms are simple to implement and even simpler to interpret. Furthermore, they tend not to overfit, so if you need to come up with an algorithm that will be robust with the future data, they will probably do just fine.
Between few thousand to about 10,000 it's anyone's guess. Gradient boosters generally do well here, with other "classical" algorithms (SVM for instance) sometimes shining. In my experience the datasets of this size are the ones where nonlinear effects start to really matter, and you potentially have enough data that you will not shoot yourself in the foot by overfitting. Nonetheless, not all data is made the same, especially in the tabular data world, and you need to exercise lots of intuition and judgement when deciding which technique or algo to use.
Many thousands to about a billion datapoint is where gradient boosted trees rule. If you need just one algorithm, go with this. You'll never go wrong. This is the domain where nonlinearities in your data start to become significant, and decision trees are a great algorithm for squeezing them out.
If you have several billion datapoint, or many times that amount, check out neural nets. They have the capacity to absorb those kinds of datasets easily. As mentioned in one of my earlier posts, some research on application of neural networks for tabular data has shown that the reason they struggle with this kind of data is that the decision boundaries between different regions in the tabular data space are really, really jugged. However, when you have lots of data, the region boundaries start looking increasingly smooth, which gives neural network lots of material to work with.
Really appreciate that you share your deep expertise. These handy rules of thumb cut through so much noise. As a "quantitative user researcher," I hover in the zone between t-tests & logistic regressions (rarely collect samples larger than 1K responses)
Loving the Blog. Huge fan of your work! Thank you for sharing