CV
Contact Information
| Name | Yuntai Bao |
| Professional Title | PhD student |
| baoyuntai@outlook.com | |
| Location | School of Software Technology, Zhejiang University, Ningbo, Zhejiang Province 315000 |
Professional Summary
My research interest includes mechanistic interpretability (mech interp), AI safety, neural network learning dynamics as well as general principles of ML systems. Currently I am committed to achieving effective and efficient model control via mechanistic interpretability.
Education
-
2023 - 2028 Zhejiang, China
Ph.D.
School of Software Technology, Zhejiang University
Artificial Intelligence
-
2019 - 2023 Zhejiang, China
B.Eng.
College of Computer Science and Technology, Zhejiang University
Information Security
Publications
-
2026 Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
ICML 2026
We propose a principled framework for training steering vectors that jointly learns direction and strength, eliminating post-hoc tuning. We further introduce Prompt-Only SV, achieving stronger control of LLMs while better preserving performance and robustness.
-
2026 PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
ICLR 2026
We propose PragLocker, a black-box method that protects LLM agent prompts by transforming them into obfuscated, model-specific forms that remain functional on the target model but fail on others. Experiments show it effectively prevents prompt reuse across LLMs while preserving task performance and robustness.
-
2026 Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
ICLR 2026
This paper introduces Concept Distributed Alignment Search (CDAS), a steering method that employs a distribution matching objective and distributed interchange interventions to faithfully manipulate internal concept features without overfitting to external preferences. CDAS achieves stable bi-directional control—effectively overriding safety refusals and neutralizing backdoors—while preserving general model utility.
-
2025 Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
IJCAI 2025
This paper introduces a scalable multi-stage influence function that attributes the predictions of fine-tuned LLMs back to their pretraining data, and this approach efficiently scales to billion-parameter models.
-
2025 Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Findings of ACL 2025
This paper investigates the internal representation of truth in LLMs, revealing that consistent “truth directions” emerge primarily in capable models and generalize effectively across logical transformations and diverse question-answering tasks. The truthfulness probes can be practically applied to selective question answering, improving task accuracy by filtering out incorrect model outputs.