Speaker
Details
As frontier Large Language Models (LLMs) continue to scale, unforeseen capabilities are emerging from these models at an unprecedented pace, often surprising even AI experts in the field. These increasingly advanced capabilities have elevated AI safety - once a niche concern - to one of the highest priorities of our time. There is a growing concern that advanced AI systems, which we currently have or will soon have, may be repurposed for malicious uses, such as influence operations, cyber-attacks, and even bioweapons development.
Currently, the safety of Large Language Models (LLMs) heavily hinges on AI alignment approaches. These approaches train models to refuse harmful requests, safeguarding against misuse. In the first part of this talk, I will present my research on red-teaming these alignment approaches, illustrating their susceptibility to adversarial attacks. I will particularly focus on discussing how the ongoing trends of open-sourcing, customization, and multimodality are escalating these adversarial challenges. The second part of the talk will delve into why these safety alignments falter and discuss promising directions for improving their robustness. Finally, the talk will wrap up with a discussion that ties these adversarial challenges to their broader policy implications, setting the stage for a future-oriented discussion on LLM safety
Advisers: Prateek Mittal and Peter Henderson