Sophia Xiao Pu

2113 Henley Hall

UC Santa Barbara

Santa Barbara, CA 93106

Welcome! I am a second-year CS PhD student at University of California, Santa Barbara advised by Prof. William Wang.

I obtained my bachelor’s degree from Peking University, where I was advised by Prof. Xiaojun Wan. I also worked with Prof. Tianxing He at Tsinghua University and Prof. Yulia Tsvetkov at University of Washington.

My research interests lie broadly in Language and Vision, particularly in:

Trustworthy AI: detecting machine-generated text (EMNLP 2023), attacking LLM watermarks (NAACL 2025), and evaluating oversensitivity in LLMs (EMNLP 2025)
Efficiency: prompt compression (EMNLP 2024), overthinking in reasoning models (COLM 2025).
Other: extrinsic evaluation for text summaries (COLING 2024).

(Only [co-]lead-author papers are listed.)

I am actively looking for motivated undergraduate or master’s students to collaborate on exciting topics such as multimodal evaluation, reasoning, and more.

news

Oct 08, 2025	Presenting ThoughtTerminator at COLM, Montreal.
Aug 20, 2025	Our work on LLM oversensitivity will appear in EMNLP Findings!
Aug 08, 2025	New on arXiv: LLM roleplay (yes, anime characters 😉) 🎭✨
Jul 07, 2025	Our paper on overthinking in reasoning models got accepted to COLM!
Jun 16, 2025	I start my internship at AWS, Santa Clara.

selected preprints & publications

* denotes equal contribution.

EMNLP

Dynamic Evaluation for Oversensitivity in LLMs

Sophia Xiao Pu, Sitao Cheng, Xin Eric Wang, and William Yang Wang

Findings of EMNLP, 2025

Abs PAPER

Oversensitivity—where language models defensively reject benign prompts—not only disrupts user interactions but also obscures the boundaries between harmful and harmless content. Existing benchmarks rely on static datasets that degrade over time as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OverBench, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 26 models. OverBench provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.
COLM

THOUGHT TERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Sophia Xiao Pu^*, Michael Saxon^*, Wenyue Hua, and William Yang Wang

COLM, 2025

Abs PAPER

Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking–generating large amounts of unnecessary tokens which don’t improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.
NAACL

B^4: A Black-Box Scrubbing Attack on LLM Watermarks

Baizhou Huang^*, Sophia Xiao Pu^*, and Xiaojun Wan

NAACL (Oral), 2025

Abs PAPER

Watermarking has emerged as a prominent technique for LLM-generated content detection by embedding imperceptible patterns. Despite supreme performance, its robustness against adversarial attacks remains underexplored. Previous work typically considers a grey-box attack setting, where the specific type of watermark is already known. Some even necessitates knowledge about hyperparameters of the watermarking method. Such prerequisites are unattainable in real-world scenarios. Targeting at a more realistic black-box threat model with fewer assumptions, we here propose B^4, a black-box scrubbing attack on watermarks. Specifically, we formulate the watermark scrubbing attack as a constrained optimization problem by capturing its objectives with two distributions, a Watermark Distribution and a Fidelity Distribution. This optimization problem can be approximately solved using two proxy distributions. Experimental results across 12 different settings demonstrate the superior performance of B^4 compared with other baselines.
EMNLP

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Sophia Xiao Pu, Tianxing He, and Xiaojun Wan

Findings of EMNLP, 2024

Abs PAPER

Prompt compression condenses contexts while maintaining their informativeness for different usage scenarios. It not only shortens the inference time and reduces computational costs during the usage of large language models, but also lowers expenses when using closed-source models. In a preliminary study, we discover that when instructing language models to compress prompts, different compression styles (e.g., extractive or abstractive) impact performance of compressed prompts on downstream tasks. Building on this insight, we propose Style-Compress, a lightweight framework that adapts a smaller language model to compress prompts for a larger model on a new task without additional training. Our approach iteratively generates and selects effective compressed prompts as task-specific demonstrations through style variation and in-context learning, enabling smaller models to act as efficient compressors with task-specific examples. Style-Compress outperforms two baseline compression models in four tasks: original prompt reconstruction, text summarization, multi-hop QA, and CoT reasoning. In addition, with only 10 samples and 100 queries for adaptation, prompts compressed by Style-Compress achieve performance on par with or better than original prompts at a compression ratio of 0.25 or 0.5.
COLING

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

Sophia Xiao Pu, Mingqi Gao, and Xiaojun Wan

LREC-COLING (Oral), 2024

Abs PAPER Slides

Research on automated text summarization typically uses human and automatic evaluation methods. While most recent studies focus on intrinsic evaluation, which assesses the general quality of summaries, e.g. coherence and informativeness, we concentrate on task-based extrinsic evaluation to determine the usefulness of summaries. We incorporate three downstream tasks, namely question answering, text classification, and text similarity assessment, and measure the usefulness of summaries for these tasks by several metrics. Our findings reveal that summaries are generally useful in tasks that require a comprehensive grasp of the text but are less useful in tasks requiring a more specific understanding of the text. We also analyze the usefulness and inherent properties of summaries from different models, and find that fine-tuned models consistently produce more useful summaries across all three tasks. In contrast, zero-shot models tend to lean towards text classification and similarity assessment, providing more general and less detailed summaries. Additionally, we assess the correlation between 14 intrinsic automatic metrics and human judgments. Intrinsic metrics perform well in evaluating summaries for question answering but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic metrics for assessing summary performance and usefulness.
EMNLP

On the Zero-Shot Generalization of Machine-Generated Text Detectors

Sophia Xiao Pu, Jingyu Zhang, Xiaochuang Han, Yulia Tsvetkov, and Tianxing He

Findings of EMNLP, NeurIPS-ENLSP, 2023

Abs PAPER Poster

The rampant proliferation of large language models, fluent enough to generate text indistinguishable from human-written language, gives unprecedented importance to the detection of machine-generated text. This work is motivated by an important research question: How will the detectors of machine-generated text perform on outputs of a new generator, that the detectors were not trained on? We begin by collecting generation data from a wide range of LLMs, and train neural detectors on data from each generator and test its performance on held-out generators. While none of the detectors can generalize to all generators, we observe a consistent and interesting pattern that the detectors trained on data from a medium-size LLM can zero-shot generalize to the larger version. As a concrete application, we demonstrate that robust detectors can be built on an ensemble of training data from medium-sized models.
Summarization is (almost) dead

Sophia Xiao Pu^*, Mingqi Gao^*, and Xiaojun Wan

arXiv preprint arXiv:2309.09558, 2023

Abs PAPER

How well can large language models (LLMs) generate summaries? We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of LLMs across five distinct summarization tasks. Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. Specifically, LLM-generated summaries exhibit better factual consistency and fewer instances of extrinsic hallucinations. Due to the satisfactory performance of LLMs in summarization tasks (even surpassing the benchmark of reference summaries), we believe that most conventional works in the field of text summarization are no longer necessary in the era of LLMs. However, we recognize that there are still some directions worth exploring, such as the creation of novel datasets with higher quality and more reliable evaluation methods.

Outside of research, I’m also an amateur pipa player, a Dream of the Red Chamber (红楼梦) enthusiast, and a curious language learner.