UniSa - IRIS Institutional Research Information System

In recent years, significant advancements in Generative Artificial Intelligence (GenAI) have resulted in an expansion of its applications and an increased susceptibility to cyber-attacks. These attacks can bypass ethical guidelines and integrated protections, posing a significant threat to cybersecurity and information integrity, such as prompt injections aiming to produce and spread misinformation. To develop robust cybersecurity solutions, it is essential to understand the vulnerabilities of GenAI and analyze the characteristics of potential attacks. This manuscript proposes a mechanism for identifying jailbreaking prompts for manipulating Large Language Models (LLMs). The designed process involves the interaction between an LLM Attacker and an LLM Victim. LLM Attacker generates potential jailbreaking prompts to induce the LLM Victim to generate unethical content. The prompts and their corresponding persuasion success are collected during their interaction. In this way, a new synthetic dataset of 3000 prompts has been constructed. Such a dataset is exploited to train a new model for detecting hidden persuasion in prompts that can induce an LLM to produce deviating content. This new model, assisted by algorithms for eXaplainable Artificial Intelligence (xAI), works as an anti-persuasion f ilter interposed between the input prompt and the victim model. It identifies attempts to mislead LLM and tries to neutralize them by modifying words recognized as crucial by xAI algorithms like SHAP and LIME. Experimentation reveals that adopting SHAP and removing the first ten most important words in the original prompt allows for neutralizing.80% of persuasive prompts.

Detecting Jailbreaking Prompts: an Anti-Persuasion Filter Framework

Giuseppe Fenza;Mariacristina Gallo;Vincenzo Loia;Alessandro Nicolosi;Claudio Stanzione

2024

Abstract

In recent years, significant advancements in Generative Artificial Intelligence (GenAI) have resulted in an expansion of its applications and an increased susceptibility to cyber-attacks. These attacks can bypass ethical guidelines and integrated protections, posing a significant threat to cybersecurity and information integrity, such as prompt injections aiming to produce and spread misinformation. To develop robust cybersecurity solutions, it is essential to understand the vulnerabilities of GenAI and analyze the characteristics of potential attacks. This manuscript proposes a mechanism for identifying jailbreaking prompts for manipulating Large Language Models (LLMs). The designed process involves the interaction between an LLM Attacker and an LLM Victim. LLM Attacker generates potential jailbreaking prompts to induce the LLM Victim to generate unethical content. The prompts and their corresponding persuasion success are collected during their interaction. In this way, a new synthetic dataset of 3000 prompts has been constructed. Such a dataset is exploited to train a new model for detecting hidden persuasion in prompts that can induce an LLM to produce deviating content. This new model, assisted by algorithms for eXaplainable Artificial Intelligence (xAI), works as an anti-persuasion f ilter interposed between the input prompt and the victim model. It identifies attempts to mislead LLM and tries to neutralize them by modifying words recognized as crucial by xAI algorithms like SHAP and LIME. Experimentation reveals that adopting SHAP and removing the first ten most important words in the original prompt allows for neutralizing.80% of persuasive prompts.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2024

Appare nelle tipologie:

4.1 Contributi in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4920490

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

social impact