binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions
Name:
s12859-020-03718-9.pdf
Size:
1.453Mb
Format:
PDF
Description:
Final Published Version
Author
Zaim, Samir RachidKenost, Colleen
Berghout, Joanne
Chiu, Wesley
Wilson, Liam
Zhang, Hao Helen
Lussier, Yves A.
Affiliation
Univ Arizona Hlth Sci, Ctr Biomed Informat & BiostatUniv Arizona, Grad Interdisciplinary Program Stat
Univ Arizona, Coll Sci, Dept Math
Univ Arizona, Canc Ctr
Univ Arizona, BIO5 Inst
Issue Date
2020-08
Metadata
Show full item recordPublisher
BMCCitation
Zaim, S. R., Kenost, C., Berghout, J., Chiu, W., Wilson, L., Zhang, H. H., & Lussier, Y. A. (2020). binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions. BMC bioinformatics, 21(1), 1-22.Journal
BMC BIOINFORMATICSRights
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
Background: In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (RF) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the "P > > N" high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions. Results: In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers' main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone. Conclusion: binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers' main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.Note
Open access journalISSN
1471-2105PubMed ID
32859146Version
Final published versionae974a485f413a2113503eed53cd6c53
10.1186/s12859-020-03718-9
Scopus Count
Collections
Except where otherwise noted, this item's license is described as © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Related articles
- Correction to: binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions.
- Authors: Zaim SR, Kenost C, Berghout J, Chiu W, Wilson L, Zhang HH, Lussier YA
- Issue date: 2020 Nov 2
- Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.
- Authors: Le TT, Simmons WK, Misaki M, Bodurka J, White BC, Savitz J, McKinney BA
- Issue date: 2017 Sep 15
- Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.
- Authors: Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, Moraleda C, Rogers L, Daniels K, Green P
- Issue date: 2022 Feb 1
- Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease.
- Authors: Lee Y, Cappellato M, Di Camillo B
- Issue date: 2022 Dec 28
- Using a machine learning approach to identify key prognostic molecules for esophageal squamous cell carcinoma.
- Authors: Li MX, Sun XM, Cheng WG, Ruan HJ, Liu K, Chen P, Xu HJ, Gao SG, Feng XS, Qi YJ
- Issue date: 2021 Aug 9

