Speaker
Details
Adapting models to private data is necessary to align models to human preferences. We want to prevent the model from learning undesirable behavior while maintaining its utility on the bulk of the data distribution. This has proven to be a challenging problem in many areas of privacy and security in AI. We observe that controlling the behavior of a large model in all settings is challenging because the model's behavior can be unexpectedly shifted on a small set of out-of-distribution inputs. Our underlying insight is that some parameters in an overparametrized model are naturally sparsely updated, meaning that those parameters are only used to fit some out-of-distribution data. This enables attackers to craft data or gradient updates that target those sparsely updated parameters, changing the model's behavior on a small set of inputs. In the same vein, we propose sparsity-based defenses to limit the update surface of a model to the parameters that are most important for the bulk of training. A foundational concept in computer security is limiting the attack surface, and we implement this in machine learning via sparse training. We apply our methodology of sparsity in attacks and defenses to areas of AI security and privacy.