WDRASS: A Web-scale Dataset for Document Retrieval and Answer Sentence Selection
Publisher
Association for Computing MachineryCitation
Zhang, Z., Vu, T., Gandhi, S., Chadha, A., & Moschitti, A. (2022). WDRASS: A Web-scale Dataset for Document Retrieval and Answer Sentence Selection. International Conference on Information and Knowledge Management, Proceedings, 4707–4711.Rights
Copyright © 2022 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
Open-Domain Question Answering (ODQA) systems generate answers from relevant text returned by search engines, e.g., lexical features-based such as BM25, or embeddings-based such as dense passage retrieval (DPR). Few datasets are available for this task: they mainly focus on QA systems based on machine reading (MR) approach, and show problematic evaluation, mostly based on uncontextualized short answer matching. In this paper, we present WDRASS, a dataset for ODQA based on answer sentence selection (AS2) models, which consider sentences as candidate answers for QA systems. WDRASS consists of g1/464k questions and 800k+ labeled passages and sentences extracted from 30M documents. We evaluate the dataset by training models on it and comparing with the same models trained on Google NQ. Our experiments show that WDRASS significantly improves the performance of retrieval and reranking models, thus boosting the accuracy of downstream QA tasks. We believe our dataset can produce significant impact in advancing IR research. © 2022 Owner/Author.Note
Open access articleISBN
9781450392365Version
Final published versionae974a485f413a2113503eed53cd6c53
10.1145/3511808.3557678
Scopus Count
Collections
Except where otherwise noted, this item's license is described as Copyright © 2022 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License.