Proj CJI Paper Reading: “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models-软件工程牛翰社区-编程开发-牛翰网

评分

Proj CJI Paper Reading: “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

6个月前更新

70

Abstract

Github: https://github.com/verazuo/jailbreak_llms
Tasks:
1. Tool: JAILBREAKHUB
- Task: jailbreaking LLM with blackbox model using collected prompts
- 实验：
  - Model: 6个LLMs: Chat-GPT (GPT-3.5), GPT-4, PaLM2, ChatGLM, Dolly, and Vicuna
  - 效果
    1. 成功攻击
    2. 找到了5条能够高效攻击GPT-3.5和GPT4的prompts
1. 分析jailbreaking模式
- dataset: 1,405 jailbreak prompts spanning from December 2022 to December 2023
- data sources:
  - Reddit
    - r/ChatGPT
    - r/ChatGPTPromptGenius
- findings
  1. identify 131 jailbreak communities
  2. 发现了jailbreak prompts的特性和主要攻击策略，例如prompt injection和privilege escalation
  3. observe that jailbreak prompts increasingly shift from online Web communities to prompt aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days.
1. 创建dataset，包含107,250 samples across 13 forbidden scenarios.
  - Topic
    - Illegal Activity
    - Hate Speech
    - Malware
    - Physical Harm
    - Economic Harm
    - Fraud
    - Pornography
    - Political Lobbying
    - Privacy Violence
    - Legal Opinion
    - Financial Advice
    - Health Consultation
    - Gov Decision

3. Data Collection

Platform	Source	# Posts	# UA	# Adv UA	# Prompts	# Jailbreaks	Prompt Time Range
Reddit	r/ChatGPT	163549	147	147	176	176	2023.02-2023.11
Reddit	r/ChatGPTPromptGenius	3536	305	21	654	24	2022.12-2023.11
Reddit	r/ChatGPTJailbreak	1602	183	183	225	225	2023.02-2023.11
Discord	ChatGPT	609	259	106	544	214	2023.02-2023.12
Discord	ChatGPT Prompt Engineering	321	96	37	278	67	2022.12-2023.12
Discord	Spreadsheet Warriors	71	3	3	61	61	2022.12-2023.09
Discord	AI Prompt Sharing	25	19	13	24	17	2023.03-2023.04
Discord	LLM Promptwriting	184	64	41	167	78	2023.03-2023.12
Discord	BreakGPT	36	10	10	32	32	2023.04-2023.09
Website	AIPRM	–	2777	23	3930	25	2023.01-2023.06
Website	FlowGPT	–	3505	254	8754	405	2022.12-2023.12
Website	JailbreakChat	–	–	–	79	79	2023.02-2023.05
Dataset	AwesomeChatGPTPrompts	–	–	–	166	2	–
Dataset	OCR-Prompts	–	–	–	50	0	–
Total		169,933	7,308	803	15,140	1,405	2022.12-2023.12

来源链接：https://www.cnblogs.com/xuesu/p/18666312

评分

欢迎为Ta评分