Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization

Amit LeVi*1, Rom Himelstein*2, Yaniv Nemcovsky*1, Avi Mendelson1, Chaim Baskin3

1Department of Computer Science, Technion - Israel Institute of Technology
2Department of Data and Decision Science, Technion - Israel Institute of Technology
3School of Electrical and Computer Engineering, Ben-Gurion University of the Negev

*Indicates Equal Contribution

Abstract

Jailbreak attacks aim to exploit large language models (LLMs) and pose a significant threat to their proper conduct; they seek to bypass mod- els’ safeguards and often provoke transgressive behaviors. However, existing automatic jailbreak attacks require extensive computational resources and are prone to converge on suboptimal solutions. In this work, we propose Compliance Refusal Initialization (CRI), a novel, attack-agnostic framework that efficiently initializes the optimization in the proximity of the compliance subspace of harmful prompts. By narrowing the initial gap to the adversarial objective, CRI substantially im- proves adversarial success rates (ASR) and dras- tically reduces computational overhead—often requiring just a single optimization step. We eval- uate CRI on the widely-used AdvBench dataset over the standard jailbreak attacks of GCG and AutoDAN. Results show that CRI boosts ASR and decreases the median steps to success by up to ×60.

We also tested the robustness of one of the DeepSeek models by running the GCG jailbreak attack with CRI on the HarmBench benchmark, demonstrating faster convergence and high adversarial success.

Compliance & Refusal

Refusal-Compliance Visualization

Visualization highlighting the “Compliance-Refusal” direction and its overlapping jailbreak embeddings.

Introduction

LLMs have recently emerged with extraordinary capabilities and have rapidly become integral to numerous fields, transforming everyday tasks such as text generation, image generation, and complex decision-making tasks. Despite their advantages, the widespread deployment of LLMs has unveiled critical security vulnerabilities, making them susceptible to malicious misuse. A common strategy to enhance the robustness of LLMs is alignment training, effectively separating the input space into Compliance and Refusal subspaces. However, this segmentation has inadvertently fueled adversarial jailbreak attacks that transform inputs to coerce models into providing unwanted compliance outputs.

In our work, we leverage the distinction between refusal and compliance behaviors. By fine-tuning the initialization of jailbreak prompts, we show that CRI can significantly enhance attack success rates while reducing the computational overhead. Our experimental results indicate that the alignment process itself may induce vulnerable directions in the latent representation of LLMs, which attackers can exploit. Understanding and exposing these vulnerabilities is a crucial step toward developing more secure and robust language models.

The remainder of this document details our CRI approach, experimental findings, and discussions on broader impacts and ethical considerations.

Framework

CRI Framework Overview

A high-level overview of our Compliance-Refusal-Based Initialization (CRI) process.

Overall Results

We summarize our findings across AdvBench, demonstrating that CRI consistently reduces the optimization steps required to achieve a successful jailbreak while elevating the adversarial success rates. These results reinforce the practical impact of CRI and emphasize potential vulnerabilities in alignment-trained LLMs.

Overall Results Screenshot

Aggregated performance metrics showcasing CRI’s efficacy across various scenarios.

Overall Results Screenshot

ASR to steps.

Overall Results Screenshot

Loss to steps.

Fast LLM Jailbreak Robustness Test with GCG + CRI (HarmBench in minutes)

To further demonstrate the effectiveness of CRI, we conducted a fast LLM jailbreak robustness test on one of the DeepSeek models using the GCG + CRI jailbreak attack on the HarmBench benchmark. Below is an overview of how we set up and ran our experiments:

  1. Applying CRI: Before launching the jailbreak attack, we generated a set of CRI-based initial transformations. These transformations were pre-trained once (using a few samples from HarmBench), aiming to place the prompts closer to the compliance subspace.
  2. Running the GCG Attack: We then ran GCG with the CRI-initialized prompts. GCG iteratively refines the prompt by leveraging gradient signals. Due to the more informative starting point from CRI, GCG converged more quickly toward successful jailbreaks, reducing total optimization steps.
  3. Measuring Robustness: We measured adversarial success rates (ASR), the number of steps to success, and overall computational overhead. We demonstrate this Adversarial test on DeepSeek model, using CRI(ours) init exhibited faster convergence to harmful completions compared to standard or random initializations, highlighting CRI’s efficiency.

This underscores CRI’s potential for fast, systematic robustness evaluation in future LLM security research.

AutoDan + CRI + Prompt Injection(target)

Fast Jailbreak Screenshot

Example for Jailbreak output

BibTeX

@misc{levi2025enhancingjailbreakattackscompliancerefusalbased,
      title={Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization}, 
      author={Amit Levi and Rom Himelstein and Yaniv Nemcovsky and Avi Mendelson and Chaim Baskin},
      year={2025},
      eprint={2502.09755},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2502.09755}, 
}