Discover the most talked about and latest scientific content & concepts.

L Song, P Langfelder and S Horvath
BACKGROUND: Ensemble predictors such as the random forest are known to have superior accuracy but their black-boxpredictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretableespecially when forward feature selection is used to construct the model. However, forward feature selectiontends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goalto combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regressionmodeling (interpretability). To address this goal several articles have explored GLM based ensemblepredictors. Since limited evaluations suggested that these ensemble predictors were less accurate thanalternative predictors, they have found little attention in the literature. RESULTS: Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmarkdata, and simulations are used to give GLM based ensemble predictors a new and careful look. A novelbootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability(random subspace method, optional interaction terms, forward variable selection) often outperforms a host ofalternative prediction methods including random forests and penalized regression models (ridge regression,elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importancemeasures that can be used to define a “thinned” ensemble predictor (involving few features) that retainsexcellent predictive accuracy. CONCLUSION: RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictiveaccuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selectedgeneralized linear model (interpretability). These methods are implemented in the freely available R softwarepackage randomGLM.
Facebook likes*
News coverage*
SC clicks
Statistics, Predictor, Statistical classification, Generalized linear model, Prediction, Random multinomial logit, Machine learning, Regression analysis
MeSH headings
comments powered by Disqus

* Data courtesy of