colored-dye's blog

baoyuntai [at] outlook [dot] com

This is Yuntai Bao, a third-year PhD candidate at School of Software Technology, Zhejiang University, advised by Xuhong Zhang. I’m expected to graduate in 2028. My research interest includes mechanistic interpretability (mech interp), AI safety, neural network learning dynamics as well as general principles of ML systems. I have experiences in steering vectors, model probes and training data attribution.

Currently, I am committed to pragmatic interpretability in order to enable effective and efficient (compute & data) model control via theoretical/empirical insights from mech interp. Beyond interpretability, I am also working on LLM post-training including RL, knowledge distillation and LLM-based agents. I also have experiences in cryptography and software/OS security.

Please feel free to reach out~

selected publications

(* indicates equal contribution)

ICML
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

Yuntai Bao, Qinfeng Li, Xinyan Yu, Xuhong Zhang, Ge Su, Wenqi Zhang, Liu Yan, Haiqin Weng, and Jianwei Yin

In Forty-third International Conference on Machine Learning, 2026

Abs arXiv Bib PDF Blog Code

Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.
@inproceedings{bao2026towards, title = {Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions}, author = {Bao, Yuntai and Li, Qinfeng and Yu, Xinyan and Zhang, Xuhong and Su, Ge and Zhang, Wenqi and Yan, Liu and Weng, Haiqin and Yin, Jianwei}, booktitle = {Forty-third International Conference on Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=AaT3liS5PE}, }
ICML
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

Qinfeng Li^*, Yuntai Bao^*, Jianghui Hu^*, Wenqi Zhang, Jintao Chen, Huifeng Zhu, Yier Jin, and Xuhong Zhang

In Forty-third International Conference on Machine Learning, 2026

Abs arXiv Bib PDF Code

LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function-preserving obfuscated prompts by anchoring semantics with code symbols and then using target-model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross-LLM portability, maintains target performance, and remains robust against adaptive attackers.
@inproceedings{li2026praglocker, title = {PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts}, author = {Li, Qinfeng and Bao, Yuntai and Hu, Jianghui and Zhang, Wenqi and Chen, Jintao and Zhu, Huifeng and Jin, Yier and Zhang, Xuhong}, booktitle = {Forty-third International Conference on Machine Learning}, year = {2026}, url = {https://openreview.net/forum?id=PWhmZ04OTr}, }
ICLR
Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, and Jianwei Yin

2026

Abs arXiv Bib PDF Blog Code Poster Slides

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering.
@article{bao2026faithful, title = {Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions}, author = {Bao, Yuntai and Zhang, Xuhong and Chen, Jintao and Su, Ge and Cai, Yuxiang and Peng, Hao and Sun, Bing and Weng, Haiqin and Yan, Liu and Yin, Jianwei}, booktitle = {The Fourteenth International Conference on Learning Representations}, year = {2026}, url = {https://openreview.net/forum?id=LoisXFZL3k}, }
IJCAI
Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, and Jianwei Yin

In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, Aug 2025

Main Track

Abs DOI arXiv Bib PDF Code

Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute “multi-stage” influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates.
@inproceedings{bao2025scalable, title = {Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization}, author = {Bao, Yuntai and Zhang, Xuhong and Du, Tianyu and Zhao, Xinkui and Zong, Jiang and Peng, Hao and Yin, Jianwei}, booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, {IJCAI-25}}, publisher = {International Joint Conferences on Artificial Intelligence Organization}, editor = {Kwok, James}, pages = {8022--8030}, year = {2025}, month = aug, note = {Main Track}, doi = {10.24963/ijcai.2025/892}, url = {https://doi.org/10.24963/ijcai.2025/892}, }
Findings of ACL
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, and Jianwei Yin

In Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025

Abs DOI arXiv Bib PDF Code

Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the “truth direction”, which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts.Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation.Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs.These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs.
@inproceedings{bao2025probing, title = {Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in {LLM}s Across Logical Transformations and Question Answering Tasks}, author = {Bao, Yuntai and Zhang, Xuhong and Du, Tianyu and Zhao, Xinkui and Feng, Zhengwen and Peng, Hao and Yin, Jianwei}, editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2025}, month = jul, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-acl.38/}, doi = {10.18653/v1/2025.findings-acl.38}, pages = {682--700}, isbn = {979-8-89176-256-5}, }