WDRASS: A Web-scale Dataset for Document Retrieval and Answer Sentence Selection
AffiliationUniversity of Arizona
MetadataShow full item record
PublisherAssociation for Computing Machinery
CitationZhang, Z., Vu, T., Gandhi, S., Chadha, A., & Moschitti, A. (2022). WDRASS: A Web-scale Dataset for Document Retrieval and Answer Sentence Selection. International Conference on Information and Knowledge Management, Proceedings, 4707–4711.
RightsCopyright © 2022 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License.
Collection InformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at firstname.lastname@example.org.
AbstractOpen-Domain Question Answering (ODQA) systems generate answers from relevant text returned by search engines, e.g., lexical features-based such as BM25, or embeddings-based such as dense passage retrieval (DPR). Few datasets are available for this task: they mainly focus on QA systems based on machine reading (MR) approach, and show problematic evaluation, mostly based on uncontextualized short answer matching. In this paper, we present WDRASS, a dataset for ODQA based on answer sentence selection (AS2) models, which consider sentences as candidate answers for QA systems. WDRASS consists of g1/464k questions and 800k+ labeled passages and sentences extracted from 30M documents. We evaluate the dataset by training models on it and comparing with the same models trained on Google NQ. Our experiments show that WDRASS significantly improves the performance of retrieval and reranking models, thus boosting the accuracy of downstream QA tasks. We believe our dataset can produce significant impact in advancing IR research. © 2022 Owner/Author.
NoteOpen access article
VersionFinal published version
Except where otherwise noted, this item's license is described as Copyright © 2022 Copyright held by the owner/author(s). This work is licensed under a Creative Commons Attribution International 4.0 License.