Robustness for AI Safety

Date
Feb 21, 2025, 10:00 am – 11:30 am
Location
Sherrerd Hall, Room 306

Speaker

Details

Event Description

The advancement of large language models (LLMs) has accelerated progress in artificial intelligence (AI) at an unprecedented pace. Meanwhile, enhanced AI capabilities raise the stakes of misuse, while the expanded applicability of AI also widens the range of possible harm. AI safety, therefore, has now become a top priority.

Currently, AI safety practices are predominantly built upon the alignment-based method­ ology, which trains models to reject harmful requests. While this has become a common practice, a primary focus of this thesis is to demonstrate its lack of robustness.

First, we present an evasion attack that can exploit adversarial examples to universally jailbreak safety-aligned LLMs in inference time, making them responsive to harmful queries and disregarding safety constraints. This establishes a fundamental connection between the long-studied vulnerability of neural networks to adversarial examples and emerging challenges in AI safety. Given that adversarial examples remain an unresolved problem, the fact that they can be used to bypass the safety alignment suggests that achieving robust AI safety may pose similarly intractable challenges.

Second, we show how custom fine-tuning of safety-aligned LLMs can degrade or erase the safety alignment. Against adversaries, we show that as few as ten malicious fine-tuning examples can strip LLM safety guardrails at a minimal cost. We also illustrate that even legitimate fine-tuning of an aligned LLM on benign downstream datasets risks compromising the safety alignment. This robustness vulnerability suggests a tension between the LLM safety alignment and the customizability of models.

Then, we trace these robustness vulnerabilities to an underlying issue called "shallow safety alignment," wherein the safety alignment adapts a model's generative distribution primarily over only its first few output tokens. We investigate its counterfactual-"deep safety alignment"-that pushes the effects to more than just a few tokens deep. We show this can provide improvement for the robustness of the safety alignment.

Finally, we touch base on the security principles that should be followed in approaching robustness for AI safety. To exemplify these principles, we present a case study evaluating recent defenses against fine-tuning attacks, highlighting the importance of following established security principles in evaluating and designing robust defenses.

Advisers: Prateek Mittal, and Peter Henderson