Arthur Chen | Home - Arthur Chen

About Me

Hello! My name is Arthur Chen (陳皓楠). I’m a master’s student at the R2L Lab of the University of Waterloo, advised by Victor Zhong. I am also a researcher at the Vector Institute. During my undergraduate studies, I built Pyserini - infra tools for retrieval with Jimmy Lin and UniIR - universal multimodal retrievers with Wenhu Chen.

Interests

I work on adapting ML models (especially agents) to new environments (e.g., new tools, codebases, or workflows) at test-time. Why at test-time? Because the crucial context – environment-specific rules, dynamics, custom I/O formats, and so on – is often unavailable during training and only privately revealed in the deployed setting. Models must therefore learn from that context on the fly to succeed.

My interests are in:

Test-Time Adaptation: adapting models (e.g. agents) to new environments after deployment – via interaction or synthetic data (without costly re-labeling or private user data).
Automated Agent Verification: evaluating long-horizon agents with per-instance programmatic checks, so success and failure can be measured without relying only on human judgment or unstable LLM judges.
Self-Improving Data Synthesis: closing the loop on synthetic data generation—automatically iterating pipelines so they improve dataset quality with less hand-tuned engineering.

News

I’m on the job market for research and engineering roles! – please reach out if you think we could work something out together!

[Feb. 2026]: Agentic AI lightning talk on “Adapting Agents to Unseen Environments” – Remarkable 2026 at Vector Institute.
[Dec. 2025]: Invited talk on “Test-Time Adaptation via Data Synthesis” – Bloomberg CTO Office.
[May. 2025]: Started my internship at Salesforce AI Research to work on agents!
[Apr. 2024]: Placed 3rd at the 2024 Citadel & Citadel Securities Invitational Datathon among ~100 finalists selected from thousands of applicants.

Selected Papers

For update-to-date papers, please refer to Google Scholar.

Test-Time Adaptation for LLM Agents via Environment Interaction
Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong
Introduces efficient adaptation strategies for LLM agents to adapt to new environments at test-time via interaction.
International Conference on Learning Representations (ICLR), 2026.
Links: paper • code • project page

SynQuE: Estimating Synthetic Dataset Quality Without Annotations
Arthur Chen, Victor Zhong
SynQuE is a framework and benchmark for ranking synthetic datasets by their expected real-world performance without requiring any labeled real data.
Transactions on Machine Learning Research (TMLR), 2026.
Presented at International Conference on Learning Representations (ICLR), DATA-FM Workshop, 2026.
Links: paper • code • project page

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers
Cong Wei, Yang Chen, Arthur Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
UniIR is an instruction-guided multimodal retriever for eight retrieval tasks, evaluated by the standardized M-BEIR benchmark.
European Conference on Computer Vision (ECCV), Oral Presentation 2024.
Links: paper • code • project page

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Arthur Chen, et al.
VideoScore is an automatic metric for AI-generated videos that simulates detailed human feedback to predict quality scores.
Empirical Methods in Natural Language Processing (EMNLP), 2024.
Links: paper • code • project page