LLMs
Trustworthy LLMs and VLMs
Overview
Large language and vision-language models are deployed in settings where biased, inconsistent, or manipulated behavior can affect users, yet their internals are often unavailable or hard to inspect. We develop methods that expose and characterize such hidden failures, treating trustworthiness as a property that must be tested for rather than assumed — and connecting each testing method to a concrete path for mitigation or defense.
A recurring theme in our work is that trustworthiness must account for a model’s reasoning process, not only its final answer. Attacks and guardrails that operate on outputs alone tend to leave reasoning traces that are inconsistent or easy to flag, but as models increasingly expose their chain-of-thought, the reasoning itself becomes both a new attack surface and a new opportunity for defense. We study how bias and backdoor threats propagate through model behavior, how to characterize them with principled signals, and how to build safeguards that hold up against adaptive adversaries.
LLM Reasoning and Planning
Overview
Large language models can appear to reason, yet generation is autoregressive: each token is chosen from the immediate context, one step at a time. This local view is powerful, but it explains familiar failure modes, such as reasoning that drifts, contradicts itself, takes redundant detours, or commits early to a path that later proves wrong. We study how to make model reasoning globally coherent, efficient, and trustworthy by helping a model decide where it is going before it takes the next step.