Nuit Blanche: Are Random Forests Truly the Best Classifiers?

Friday, August 12, 2016

Are Random Forests Truly the Best Classifiers?

In Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? it looked like Random Forests were a good universal tool to use on a large variety of benchmarks. This seemed to be resonating with findings of Kaggle competitions where Random Forest were always a good first tool for classification. (Nowadays, it looks like XGBoost is a good goto technique.)

XGBoost: a scalable tree boosting system. 60% of @kaggle 2015 winning solutions used XGBoost https://t.co/tnC67VGL2M pic.twitter.com/qoSpbhHp08
— Ben Hamner (@benhamner) 10 mars 2016

It looks like the good scores on that paper might have had several issues, as pointed in this new paper:

This procedure is incorrect because it does not use a held-out test set....While hyperparameter tuning and testing do technically use different sets of examples, since one is a subset of the other, the two sets must be disjoint to avoid bias. .

Of interest seems the good score of the elm kernel approach. Here is the new paper: Are Random Forests Truly the Best Classifiers? by Michael Wainberg, Babak Alipanahi, Brendan J. Frey

Abstract The JMLR study Do we need hundreds of classifiers to solve real world classification problems? benchmarks 179 classifiers in 17 families on 121 data sets from the UCI repository and claims that “the random forest is clearly the best family of classifier”. In this response, we show that the study's results are biased by the lack of a held-out test set and the exclusion of trials with errors. Further, the study's own statistical tests indicate that random forests do not have significantly higher percent accuracy than support vector machines and neural networks, calling into question the conclusion that random forests are the best classifiers.

The implementation should eventually be here (I have asked JMLR).

In effect using the holdout test set requires a new way of performing data analysis, see:

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !