CV | colored-dye's blog

Contact Information

Name	Yuntai Bao
Professional Title	PhD student
Email	baoyuntai@outlook.com
Location	School of Software Technology, Zhejiang University, Ningbo, Zhejiang Province 315000

Professional Summary

My research interest includes mechanistic interpretability (mech interp), AI safety, neural network learning dynamics as well as general principles of ML systems. Currently I am committed to achieving effective and efficient model control via mechanistic interpretability.

Education

2023 - 2028

Zhejiang, China
Ph.D.

School of Software Technology, Zhejiang University

Artificial Intelligence
2019 - 2023

Zhejiang, China

B.Eng.

College of Computer Science and Technology, Zhejiang University

Information Security

Publications

2026

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

ICML 2026

We propose a principled framework for training steering vectors that jointly learns direction and strength, eliminating post-hoc tuning. We further introduce Prompt-Only SV, achieving stronger control of LLMs while better preserving performance and robustness.
2026

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

ICLR 2026

We propose PragLocker, a black-box method that protects LLM agent prompts by transforming them into obfuscated, model-specific forms that remain functional on the target model but fail on others. Experiments show it effectively prevents prompt reuse across LLMs while preserving task performance and robustness.
2026

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

ICLR 2026

This paper introduces Concept Distributed Alignment Search (CDAS), a steering method that employs a distribution matching objective and distributed interchange interventions to faithfully manipulate internal concept features without overfitting to external preferences. CDAS achieves stable bi-directional control—effectively overriding safety refusals and neutralizing backdoors—while preserving general model utility.
2025

Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

IJCAI 2025

This paper introduces a scalable multi-stage influence function that attributes the predictions of fine-tuned LLMs back to their pretraining data, and this approach efficiently scales to billion-parameter models.
2025

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Findings of ACL 2025

This paper investigates the internal representation of truth in LLMs, revealing that consistent “truth directions” emerge primarily in capable models and generalize effectively across logical transformations and diverse question-answering tasks. The truthfulness probes can be practically applied to selective question answering, improving task accuracy by filtering out incorrect model outputs.

Skills

Programming languages: Python, C/C++

Languages

Chinese : Native speaker

English : Fluent

Interests

Mechanistic interpretability: causal variable localization, circuit analysis

Representation steering: steering vector

Contact Information

Professional Summary

Education

Ph.D.

School of Software Technology, Zhejiang University

Artificial Intelligence

B.Eng.

College of Computer Science and Technology, Zhejiang University

Information Security

Publications

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

ICML 2026

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

ICLR 2026

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

ICLR 2026

Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

IJCAI 2025

Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Findings of ACL 2025

Skills

Languages

Interests