Show simple item record

dc.contributor.authorJansen, P.A.
dc.contributor.authorSmith, K.
dc.contributor.authorMoreno, D.
dc.contributor.authorOrtiz, H.
dc.date.accessioned2022-05-19T23:19:49Z
dc.date.available2022-05-19T23:19:49Z
dc.date.issued2021
dc.identifier.citationJansen, P., Smith, K., Moreno, D., & Ortiz, H. (2021). On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7529–7542, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
dc.identifier.isbn9781955917094
dc.identifier.doi10.18653/v1/2021.emnlp-main.596
dc.identifier.urihttp://hdl.handle.net/10150/664431
dc.description.abstractBuilding compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these “multi-hop” explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models. © 2021 Association for Computational Linguistics
dc.language.isoen
dc.publisherAssociation for Computational Linguistics (ACL)
dc.rightsCopyright © 2021 Association for Computational Linguistics, licensed on a Creative Commons Attribution 4.0 International License.
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleOn the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings
dc.typeProceedings
dc.typetext
dc.contributor.departmentUniversity of Arizona
dc.identifier.journalEMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
dc.description.noteOpen access journal
dc.description.collectioninformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.
dc.eprint.versionFinal published version
dc.source.journaltitleEMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings
refterms.dateFOA2022-05-19T23:19:49Z


Files in this item

Thumbnail
Name:
2021.emnlp-main.596.pdf
Size:
7.831Mb
Format:
PDF
Description:
Final Published Version

This item appears in the following Collection(s)

Show simple item record

Copyright © 2021 Association for Computational Linguistics, licensed on a Creative Commons Attribution 4.0 International License.
Except where otherwise noted, this item's license is described as Copyright © 2021 Association for Computational Linguistics, licensed on a Creative Commons Attribution 4.0 International License.