Jailbreak attacks aim to exploit large language models (LLMs) and pose a significant threat to their proper conduct; they seek to bypass mod- els’ safeguards and often provoke transgressive behaviors. However, existing automatic jailbreak attacks require extensive computational resources and are prone to converge on suboptimal solutions. In this work, we propose Compliance Refusal Initialization (CRI), a novel, attack-agnostic framework that efficiently initializes the optimization in the proximity of the compliance subspace of harmful prompts. By narrowing the initial gap to the adversarial objective, CRI substantially im- proves adversarial success rates (ASR) and dras- tically reduces computational overhead—often requiring just a single optimization step. We eval- uate CRI on the widely-used AdvBench dataset over the standard jailbreak attacks of GCG and AutoDAN. Results show that CRI boosts ASR and decreases the median steps to success by up to ×60.
We also tested the robustness of one of the DeepSeek models by running the GCG jailbreak attack with CRI on the HarmBench benchmark, demonstrating faster convergence and high adversarial success.
Visualization highlighting the “Compliance-Refusal” direction and its overlapping jailbreak embeddings.
LLMs have recently emerged with extraordinary capabilities and have rapidly become integral to numerous fields, transforming everyday tasks such as text generation, image generation, and complex decision-making tasks. Despite their advantages, the widespread deployment of LLMs has unveiled critical security vulnerabilities, making them susceptible to malicious misuse. A common strategy to enhance the robustness of LLMs is alignment training, effectively separating the input space into Compliance and Refusal subspaces. However, this segmentation has inadvertently fueled adversarial jailbreak attacks that transform inputs to coerce models into providing unwanted compliance outputs.
In our work, we leverage the distinction between refusal and compliance behaviors. By fine-tuning the initialization of jailbreak prompts, we show that CRI can significantly enhance attack success rates while reducing the computational overhead. Our experimental results indicate that the alignment process itself may induce vulnerable directions in the latent representation of LLMs, which attackers can exploit. Understanding and exposing these vulnerabilities is a crucial step toward developing more secure and robust language models.
The remainder of this document details our CRI approach, experimental findings, and discussions on broader impacts and ethical considerations.
A high-level overview of our Compliance-Refusal-Based Initialization (CRI) process.
We summarize our findings across AdvBench, demonstrating that CRI consistently reduces the optimization steps required to achieve a successful jailbreak while elevating the adversarial success rates. These results reinforce the practical impact of CRI and emphasize potential vulnerabilities in alignment-trained LLMs.
Aggregated performance metrics showcasing CRI’s efficacy across various scenarios.
ASR to steps.
Loss to steps.
To further demonstrate the effectiveness of CRI, we conducted a fast LLM jailbreak robustness test on one of the DeepSeek models using the GCG + CRI jailbreak attack on the HarmBench benchmark. Below is an overview of how we set up and ran our experiments:
This underscores CRI’s potential for fast, systematic robustness evaluation in future LLM security research.
Example for Jailbreak output
@misc{levi2025enhancingjailbreakattackscompliancerefusalbased,
title={Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization},
author={Amit Levi and Rom Himelstein and Yaniv Nemcovsky and Avi Mendelson and Chaim Baskin},
year={2025},
eprint={2502.09755},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2502.09755},
}