MyPhoneBench

Do Phone-Use Agents Respect Your Privacy?
The first end-to-end verifiable privacy evaluation framework for phone-use agents.

📄 Paper 💻 Code 📝 Blog 📊 Trajectories

A phone-use agent crossing privacy boundaries during a benign KFC ordering task

A phone agent ordering a hamburger from KFC — but along the way it grabs the phone number without permission, enters the SMS code, enables persistent login, and types the number into a marketing pop-up.

Leaderboard

5 frontier models evaluated across 10 mock Android apps and 300 tasks. Sorted by normalized overall score.

#	Model	Task SR (%)	Privacy (%)	PQSR@0.7 (%)	Later-Session (%)	Overall

PQSR (Privacy-Qualified Success Rate) is the primary comparison metric — it jointly requires task completion and privacy compliance (≥ 0.7). Overall = 50% PQSR + 20% Task SR + 10% Privacy + 20% Later-Session.
Click any column header to re-sort.

Key Findings

Three takeaways from evaluating 5 frontier phone-use agents on privacy behavior.

Capability ≠ Restraint

Given the same tasks and the same privacy protocol, different models show vastly different "boundary awareness." Some skip unnecessary fields and save useful preferences; others over-request data and fill marketing traps.

No Model Wins All Three Axes

Task success, privacy-compliant completion, and later-session preference transfer are three distinct capabilities. The leader changes every time you switch the evaluation dimension.

The Biggest Risk: Over-Helpfulness

The most pervasive failure isn't adversarial attacks or permission confusion — it's agents filling optional personal fields "just because they can." Completion-oriented bias is the hardest privacy challenge.

Evaluation Framework

Three components that make phone-agent privacy measurable, verifiable, and reproducible.

iMy Privacy Contract

A minimal executable privacy protocol that gives agents clear boundaries and gives users full control.

LOW/HIGH data access tiers with permission gating
4 privacy tools: request_permission, read_profile, save_profile, ask_user
User-controlled memory: always visible, editable, deletable

10 Instrumented Mock Apps

Covering healthcare, dining, travel, insurance, government and more — with SQLite backends and form_drafts logging for full auditability.

3 Privacy Probes

Over-Permissioning (OP) — requesting data the task doesn't need
Trap Resistance (TR) — avoiding plausible but unnecessary re-disclosure widgets
Form Minimization (FM) — refraining from filling optional personal fields

Citation

If you use MyPhoneBench in your research, please cite:

@article{tang2026myphonebench, title={Do Phone-Use Agents Respect Your Privacy?}, author={Zhengyang Tang and Ke Ji and Xidong Wang and Zihan Ye and Xinyuan Wang and Yiduo Guo and Ziniu Li and Chenxin Li and Jingyuan Hu and Shunian Chen and Tongxu Luo and Jiaxi Bi and Zeyu Qin and Shaobo Wang and Xin Lai and Pengyuan Lyu and Junyi Li and Can Xu and Chengquan Zhang and Han Hu and Ming Yan and Benyou Wang}, journal={arXiv preprint arXiv:2604.00986}, year={2026}, }