BYTESIZED32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Name:
2023.emnlp-main.830.pdf
Size:
456.8Kb
Format:
PDF
Description:
Final Published Version
Citation
Ruoyao Wang, Graham Todd, Xingdi Yuan, Ziang Xiao, Marc-Alexandre Côté, and Peter Jansen. 2023. ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13455–13471, Singapore. Association for Computational Linguistics.Journal
EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, ProceedingsRights
© 2023 Association for Computational Linguistics. ACL materials are Copyright © 1963–2024 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.Collection Information
This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.Abstract
In this work we investigate the capacity of language models to generate explicit, inter pretable, and interactive world models of sci entific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of PYTHON code. To facilitate this task, we introduce BYTESIZED321, a corpus of 32 reasoning-focused text games totalling 20k lines of PYTHON code. We empirically demon strate that GPT-4 can use these games as tem plates for single-shot in-context learning, suc cessfully producing runnable games on unseen topics in 28% of cases. When allowed to self reflect on program errors, game runnability substantially increases to 57%. While evalu ating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation. ©2023 Association for Computational Linguistics.Note
Open access journalISSN
xxxx-xxxxISBN
979-889176060-8Version
Final Published Versionae974a485f413a2113503eed53cd6c53
10.18653/v1/2023.emnlp-main.830
Scopus Count
Collections
Except where otherwise noted, this item's license is described as © 2023 Association for Computational Linguistics. ACL materials are Copyright © 1963–2024 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.

