Safety

Coding Agents and Operational Safety

Overview

Autonomous coding agents built on large language models are wired directly into development workflows: they edit files, run commands, configure environments, and fix bugs with growing autonomy. Most safety evaluations of these tools focus on explicitly malicious prompts, but we argue this misses the larger and more common danger: agents that fail during ordinary, goal-directed work through destructive operations, constraint violations, authorization bypasses, and silent errors that surface only after damage is done.

Trustworthy LLMs and VLMs

Overview

Large language and vision-language models are deployed in settings where biased, inconsistent, or manipulated behavior can affect users, yet their internals are often unavailable or hard to inspect. We develop methods that expose and characterize such hidden failures, treating trustworthiness as a property that must be tested for rather than assumed — and connecting each testing method to a concrete path for mitigation or defense.

A recurring theme in our work is that trustworthiness must account for a model’s reasoning process, not only its final answer. Attacks and guardrails that operate on outputs alone tend to leave reasoning traces that are inconsistent or easy to flag, but as models increasingly expose their chain-of-thought, the reasoning itself becomes both a new attack surface and a new opportunity for defense. We study how bias and backdoor threats propagate through model behavior, how to characterize them with principled signals, and how to build safeguards that hold up against adaptive adversaries.

Long-Term Fairness and ML Safety

Overview

Many ML-enabled systems operate in dynamic environments: the system’s decisions change the environment, and those changes feed back into its future inputs. Certain self-reinforcing loops can amplify errors, entrench bias, and cause fairness violations in the long term even when immediate outcomes are fair. In predictive policing, for example, a model that flags a neighborhood as high-crime sends more patrols there, producing more recorded arrests, which the model reads as even higher crime. The same pattern appears in loan approvals that affect credit scores and in medical risk scoring that influences treatment access.

Safety Assurance of ML-Based Systems

Overview

ML-based software makes predictions in settings where failures carry real safety consequences. Our motivating case study was the DHS passenger screening challenge, hosted on Kaggle with the largest prize pool in its history ($1.5 million): TSA screens more than two million passengers daily, high false alarm rates create checkpoint bottlenecks, and false negatives pose severe safety risks. We built abstractions of such ML-enabled systems and inferred preconditions that provide probable guarantees on the safety of their predictions.