A machine learning-enabled open biodata resource inventory from the scientific literature
Name:
journal.pone.0294812.pdf
Size:
3.068Mb
Format:
PDF
Description:
Final Published Version
Affiliation
Department of Biosystems Engineering, University of ArizonaIssue Date
2023-11-28
Metadata
Show full item recordPublisher
The Public Library of Science (PLOS)Citation
Imker HJ, Schackart KE III, Istrate A-M, Cook CE (2023) A machine learning-enabled open biodata resource inventory from the scientific literature. PLoS ONE 18(11): e0294812. https://doi.org/10.1371/journal.pone.0294812Journal
PloS oneRights
© 2023 Imker et al. This is an open access article distributed under the terms of the Creative Commons Attribution License.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
Modern biological research depends on data resources. These resources archive difficult-to-reproduce data and provide added-value aggregation, curation, and analyses. Collectively, they constitute a global infrastructure of biodata resources. While the organic proliferation of biodata resources has enabled incredible research, sustained support for the individual resources that make up this distributed infrastructure is a challenge. The Global Biodata Coalition (GBC) was established by research funders in part to aid in developing sustainable funding strategies for biodata resources. An important component of this work is understanding the scope of the resource infrastructure; how many biodata resources there are, where they are, and how they are supported. Existing registries require self-registration and/or extensive curation, and we sought to develop a method for assembling a global inventory of biodata resources that could be periodically updated with minimal human intervention. The approach we developed identifies biodata resources using open data from the scientific literature. Specifically, we used a machine learning-enabled natural language processing approach to identify biodata resources from titles and abstracts of life sciences publications contained in Europe PMC. Pretrained BERT (Bidirectional Encoder Representations from Transformers) models were fine-tuned to classify publications as describing a biodata resource or not and to predict the resource name using named entity recognition. To improve the quality of the resulting inventory, low-confidence predictions and potential duplicates were manually reviewed. Further information about the resources were then obtained using article metadata, such as funder and geolocation information. These efforts yielded an inventory of 3112 unique biodata resources based on articles published from 2011-2021. The code was developed to facilitate reuse and includes automated pipelines. All products of this effort are released under permissive licensing, including the biodata resource inventory itself (CC0) and all associated code (BSD/MIT). Copyright: © 2023 Imker et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Note
Open access journalISSN
1932-6203PubMed ID
38015968Version
Final Published Versionae974a485f413a2113503eed53cd6c53
10.1371/journal.pone.0294812
Scopus Count
Collections
Except where otherwise noted, this item's license is described as © 2023 Imker et al. This is an open access article distributed under the terms of the Creative Commons Attribution License.
Related articles
- A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.
- Authors: Stojanov R, Popovski G, Cenikj G, Koroušić Seljak B, Eftimov T
- Issue date: 2021 Aug 9
- Using text-mining to measure the scientific impact and legacy of ELIXIR, a distributed research infrastructure for life science data.
- Authors: De Leo F, Balsyte E, Petryszak R, D'Ambrosio M, Bruno C, Cook M, Mičetić I, Martin CS
- Issue date: 2024
- Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.
- Authors: Crider K, Williams J, Qi YP, Gutman J, Yeung L, Mai C, Finkelstain J, Mehta S, Pons-Duran C, Menéndez C, Moraleda C, Rogers L, Daniels K, Green P
- Issue date: 2022 Feb 1
- Improving the FAIRness and Sustainability of the NHGRI Resources Ecosystem.
- Authors: Babb L, Bult C, Carey VJ, Carroll RJ, Hitz BC, Mungall CJ, Rehm HL, Schatz MC, Wagner A, NHGRI Resource Workshop Community
- Issue date: 2025 Aug 19
- Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews.
- Authors: Panayi A, Ward K, Benhadji-Schaff A, Ibanez-Lopez AS, Xia A, Barzilay R
- Issue date: 2023 Oct 6

