Getting Lost in ML Safety Vibes

Virginia Smith
Leonardo Associate Professor of Machine Learning at Carnegie Mellon University

Banatao Auditorium | 310 Sutardja Dai Hall
Monday, November 18, 2024 at 4 PM

Machine learning applications are increasingly reliant on black-box pre-trained models. To ensure the safe use of these models, techniques such as unlearning, guardrails, and watermarking have been proposed to curb model behavior and audit usage. Unfortunately, while these post-hoc approaches give positive safety ‘vibes’ when evaluated in isolation, our work shows that existing techniques are quite brittle when deployed as part of larger systems. In a series of recent works, we show that: (a) small amounts of auxiliary data can be used to ‘jog’ the memory of unlearned models; (b) current unlearning benchmarks obscure deficiencies in both finetuning and guardrail-based approaches; and (c) simple, scalable attacks erode existing LLM watermarking systems and reveal fundamental trade-offs in watermark design. Together, these results highlight major deficiencies in the practical use of post-hoc ML safety methods. We end by discussing promising alternatives to ML safety, which aim to ensure safety by design during the development of ML systems.

Speaker Bio

Virginia Smith is the Leonardo Associate Professor of Machine Learning at Carnegie Mellon University. Her current work addresses challenges related to safety and efficiency in large-scale machine learning systems. Several awards, including a Sloan Research Fellowship, NSF CAREER Award, MIT TR35 Innovator Award, Intel Rising Star Award, Samsung AI Researcher of the Year Award, and faculty awards from Google, Apple, and Meta have recognized Virginia’s work. Before CMU, Virginia was a postdoc at Stanford University and received a Ph.D. in Computer Science from UC Berkeley.