<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Safety on reSAID Lab</title><link>https://resaid-lab.github.io/categories/safety/</link><description>Recent content in Safety on reSAID Lab</description><generator>Hugo</generator><language>en-US</language><lastBuildDate>Mon, 16 Nov 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://resaid-lab.github.io/categories/safety/index.xml" rel="self" type="application/rss+xml"/><item><title>Coding Agents and Operational Safety</title><link>https://resaid-lab.github.io/projects/agentic-code-safety/</link><pubDate>Mon, 16 Nov 2026 00:00:00 +0000</pubDate><guid>https://resaid-lab.github.io/projects/agentic-code-safety/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Autonomous coding agents built on large language models are wired directly into development workflows: they edit files, run commands, configure environments, and fix bugs with growing autonomy. Most safety evaluations of these tools focus on explicitly malicious prompts, but we argue this misses the larger and more common danger: agents that fail during ordinary, goal-directed work through destructive operations, constraint violations, authorization bypasses, and silent errors that surface only after damage is done.&lt;/p&gt;</description></item><item><title>Trustworthy LLMs and VLMs</title><link>https://resaid-lab.github.io/projects/llm-bias-testing/</link><pubDate>Tue, 08 Sep 2026 00:00:00 +0000</pubDate><guid>https://resaid-lab.github.io/projects/llm-bias-testing/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Large language and vision-language models are deployed in settings where biased,
inconsistent, or manipulated behavior can affect users, yet their internals are
often unavailable or hard to inspect. We develop methods that expose and
characterize such hidden failures, treating trustworthiness as a property that
must be tested for rather than assumed — and connecting each testing method to a
concrete path for mitigation or defense.&lt;/p&gt;
&lt;p&gt;A recurring theme in our work is that trustworthiness must account for a model&amp;rsquo;s
reasoning process, not only its final answer. Attacks and guardrails that operate
on outputs alone tend to leave reasoning traces that are inconsistent or easy to
flag, but as models increasingly expose their chain-of-thought, the reasoning
itself becomes both a new attack surface and a new opportunity for defense. We
study how bias and backdoor threats propagate through model behavior, how to
characterize them with principled signals, and how to build safeguards that hold
up against adaptive adversaries.&lt;/p&gt;</description></item><item><title>Long-Term Fairness and ML Safety</title><link>https://resaid-lab.github.io/projects/fairsense/</link><pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate><guid>https://resaid-lab.github.io/projects/fairsense/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Many ML-enabled systems operate in dynamic environments: the system&amp;rsquo;s decisions
change the environment, and those changes feed back into its future inputs. Certain
self-reinforcing loops can amplify errors, entrench bias, and cause fairness
violations in the long term even when immediate outcomes are fair. In predictive
policing, for example, a model that flags a neighborhood as high-crime sends more
patrols there, producing more recorded arrests, which the model reads as even higher
crime. The same pattern appears in loan approvals that affect credit scores and in
medical risk scoring that influences treatment access.&lt;/p&gt;</description></item><item><title>Safety Assurance of ML-Based Systems</title><link>https://resaid-lab.github.io/projects/safety-assurance-ml/</link><pubDate>Wed, 01 Nov 2023 00:00:00 +0000</pubDate><guid>https://resaid-lab.github.io/projects/safety-assurance-ml/</guid><description>&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;ML-based software makes predictions in settings where failures carry real safety consequences. Our motivating case study was the DHS passenger screening challenge, hosted on Kaggle with the largest prize pool in its history ($1.5 million): TSA screens more than two million passengers daily, high false alarm rates create checkpoint bottlenecks, and false negatives pose severe safety risks. We built abstractions of such ML-enabled systems and inferred preconditions that provide probable guarantees on the safety of their predictions.&lt;/p&gt;</description></item></channel></rss>