Show simple item record

dc.contributor.authorSchulhoff, S.
dc.contributor.authorPinto, J.
dc.contributor.authorKhan, A.
dc.contributor.authorBouchard, L.-F.
dc.contributor.authorSi, C.
dc.contributor.authorAnati, S.
dc.contributor.authorTagliabue, V.
dc.contributor.authorKost, A.L.
dc.contributor.authorCarnahan, C.
dc.contributor.authorBoyd-Graber, J.
dc.date.accessioned2024-08-03T03:55:17Z
dc.date.available2024-08-03T03:55:17Z
dc.date.issued2023-12-06
dc.identifier.citationSander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Kost, Christopher Carnahan, and Jordan Boyd-Graber. 2023. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4945–4977, Singapore. Association for Computational Linguistics.
dc.identifier.isbn979-889176060-8
dc.identifier.issnxxxx-xxxx
dc.identifier.doi10.18653/v1/2023.emnlp-main.302
dc.identifier.urihttp://hdl.handle.net/10150/673142
dc.description.abstractLarge Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts. © 2023 Association for Computational Linguistics.
dc.language.isoen
dc.publisherAssociation for Computational Linguistics (ACL)
dc.rights© 2023 Association for Computational Linguistics. ACL materials are Copyright © 1963–2024 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.
dc.rights.urihttps://creativecommons.org/licenses/by-nc-sa/3.0/deed.en
dc.titleIgnore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition
dc.typeProceedingts
dc.typetext
dc.contributor.departmentUniversity of Arizona
dc.identifier.journalEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
dc.description.noteOpen access journal
dc.description.collectioninformationThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at repository@u.library.arizona.edu.
dc.eprint.versionFinal Published Version
dc.source.journaltitleEMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
refterms.dateFOA2024-08-03T03:55:17Z


Files in this item

Thumbnail
Name:
2023.emnlp-main.302.pdf
Size:
4.368Mb
Format:
PDF
Description:
Final Published Version

This item appears in the following Collection(s)

Show simple item record

© 2023 Association for Computational Linguistics. ACL materials are Copyright © 1963–2024 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.
Except where otherwise noted, this item's license is described as © 2023 Association for Computational Linguistics. ACL materials are Copyright © 1963–2024 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.