Proj CJI Paper Reading: “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models-软件工程牛翰社区-编程开发-牛翰网

Proj CJI Paper Reading: “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Abstract

  • Github: https://github.com/verazuo/jailbreak_llms
  • Tasks:
    1. Tool: JAILBREAKHUB
    • Task: jailbreaking LLM with blackbox model using collected prompts
    • 实验:
      • Model: 6个LLMs: Chat-GPT (GPT-3.5), GPT-4, PaLM2, ChatGLM, Dolly, and Vicuna
      • 效果
        1. 成功攻击
        2. 找到了5条能够高效攻击GPT-3.5和GPT4的prompts
    1. 分析jailbreaking模式
    • dataset: 1,405 jailbreak prompts spanning from December 2022 to December 2023

    • data sources:

      • Reddit
        • r/ChatGPT
        • r/ChatGPTPromptGenius
    • findings

      1. identify 131 jailbreak communities
      2. 发现了jailbreak prompts的特性和主要攻击策略,例如prompt injection和privilege escalation
      3. observe that jailbreak prompts increasingly shift from online Web communities to prompt aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days.
    1. 创建dataset,包含107,250 samples across 13 forbidden scenarios.
      • Topic
        • Illegal Activity
        • Hate Speech
        • Malware
        • Physical Harm
        • Economic Harm
        • Fraud
        • Pornography
        • Political Lobbying
        • Privacy Violence
        • Legal Opinion
        • Financial Advice
        • Health Consultation
        • Gov Decision

3. Data Collection

Platform Source # Posts # UA # Adv UA # Prompts # Jailbreaks Prompt Time Range
Reddit r/ChatGPT 163549 147 147 176 176 2023.02-2023.11
Reddit r/ChatGPTPromptGenius 3536 305 21 654 24 2022.12-2023.11
Reddit r/ChatGPTJailbreak 1602 183 183 225 225 2023.02-2023.11
Discord ChatGPT 609 259 106 544 214 2023.02-2023.12
Discord ChatGPT Prompt Engineering 321 96 37 278 67 2022.12-2023.12
Discord Spreadsheet Warriors 71 3 3 61 61 2022.12-2023.09
Discord AI Prompt Sharing 25 19 13 24 17 2023.03-2023.04
Discord LLM Promptwriting 184 64 41 167 78 2023.03-2023.12
Discord BreakGPT 36 10 10 32 32 2023.04-2023.09
Website AIPRM 2777 23 3930 25 2023.01-2023.06
Website FlowGPT 3505 254 8754 405 2022.12-2023.12
Website JailbreakChat 79 79 2023.02-2023.05
Dataset AwesomeChatGPTPrompts 166 2
Dataset OCR-Prompts 50 0
Total 169,933 7,308 803 15,140 1,405 2022.12-2023.12

来源链接:https://www.cnblogs.com/xuesu/p/18666312

请登录后发表评论

    没有回复内容