We are upgrading the repository! A content freeze is in effect until December 6th, 2024 - no new submissions will be accepted; however, all content already published will remain publicly available. Please reach out to repository@u.library.arizona.edu with your questions, or if you are a UA affiliate who needs to make content available soon. Note that any new user accounts created after September 22, 2024 will need to be recreated by the user in November after our migration is completed.
NCBI's Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
Author
Connor, RyanBrister, Rodney
Buchmann, Jan P
Deboutte, Ward
Edwards, Rob
Martí-Carreras, Joan
Tisza, Mike
Zalunin, Vadim
Andrade-Martínez, Juan
Cantu, Adrian
D'Amour, Michael
Efremov, Alexandre
Fleischmann, Lydia
Forero-Junco, Laura
Garmaeva, Sanzhima
Giluso, Melissa
Glickman, Cody
Henderson, Margaret
Kellman, Benjamin
Kristensen, David
Leubsdorf, Carl
Levi, Kyle
Levi, Shane
Pakala, Suman
Peddu, Vikas
Ponsero, Alise
Ribeiro, Eldred
Roy, Farrah
Rutter, Lindsay
Saha, Surya
Shakya, Migun
Shean, Ryan
Miller, Matthew
Tully, Benjamin
Turkington, Christopher
Youens-Clark, Ken
Vanmechelen, Bert
Busby, Ben
Affiliation
Univ Arizona, Dept Biosyst EngnIssue Date
2019-09
Metadata
Show full item recordPublisher
MDPICitation
Connor, R.; Brister, R.; Buchmann, J.P.; Deboutte, W.; Edwards, R.; Martí-Carreras, J.; Tisza, M.; Zalunin, V.; Andrade-Martínez, J.; Cantu, A.; D’Amour, M.; Efremov, A.; Fleischmann, L.; Forero-Junco, L.; Garmaeva, S.; Giluso, M.; Glickman, C.; Henderson, M.; Kellman, B.; Kristensen, D.; Leubsdorf, C.; Levi, K.; Levi, S.; Pakala, S.; Peddu, V.; Ponsero, A.; Ribeiro, E.; Roy, F.; Rutter, L.; Saha, S.; Shakya, M.; Shean, R.; Miller, M.; Tully, B.; Turkington, C.; Youens-Clark, K.; Vanmechelen, B.; Busby, B. NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements. Genes 2019, 10, 714.Journal
GENESRights
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.Note
Open access journalISSN
2073-4425PubMed ID
31527408Version
Final published versionSponsors
Intramural Research Program of the National Library of MedicineUnited States Department of Health & Human ServicesNational Institutes of Health (NIH) - USANIH National Library of Medicine (NLM); HONOURs Marie-Sklodowska-Curie training network [721367]; National Cancer InstituteUnited States Department of Health & Human ServicesNational Institutes of Health (NIH) - USANIH National Cancer Institute (NCI) [1R35CA220523-01A1]; Graduate School of Medical Sciences, University of Groningenae974a485f413a2113503eed53cd6c53
10.3390/genes10090714
Scopus Count
Collections
Except where otherwise noted, this item's license is described as © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Related articles
- VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.
- Authors: Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F
- Issue date: 2017 Jul 6
- ViraPipe: scalable parallel pipeline for viral metagenome analysis from next generation sequencing reads.
- Authors: Maarala AI, Bzhalava Z, Dillner J, Heljanko K, Bzhalava D
- Issue date: 2018 Mar 15
- Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut.
- Authors: Vázquez-Castellanos JF, García-López R, Pérez-Brocal V, Pignatelli M, Moya A
- Issue date: 2014 Jan 18
- PARTIE: a partition engine to separate metagenomic and amplicon projects in the Sequence Read Archive.
- Authors: Torres PJ, Edwards RA, McNair KA
- Issue date: 2017 Aug 1
- IDseq-An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring.
- Authors: Kalantar KL, Carvalho T, de Bourcy CFA, Dimitrov B, Dingle G, Egger R, Han J, Holmes OB, Juan YF, King R, Kislyuk A, Lin MF, Mariano M, Morse T, Reynoso LV, Cruz DR, Sheu J, Tang J, Wang J, Zhang MA, Zhong E, Ahyong V, Lay S, Chea S, Bohl JA, Manning JE, Tato CM, DeRisi JL
- Issue date: 2020 Oct 15