Ran Xu

2000 N Shoreline Blvd

Mountain View, CA 94043

My name is Ran Xu. I’m a research scientist at Google DeepMind.

My research centers on LLM agents and post-training. In particular, I am interested in enabling language models to effectively use external tools, including search [COLM ‘25; NeurIPS ‘25], code [EMNLP ‘24a; ICLR ‘26a; ICLR ‘26b] as external tools. More broadly, I also work on general LLM post-training over different stages (SFT, DPO, RL) [NAACL ‘25; arxiv ‘26a; arXiv ‘26b]. Overall, my ultimate goal is to build more capable yet secure language models that can better reason, act, and interact with the world.

Before joining Google, I obtained my PhD degree at Department of Computer Science at Emory University in 2026, co-advised by Prof. Carl Yang and Prof. Joyce C. Ho. Prior to that, I obtained my bachelor’s degree (with Highest Honors) also from the Department of Computer Science, Emory University in 2021.

Educations

Emory University (2021 - 2026): Ph.D. in Computational Science and Informatics; GPA: 3.98/4.00; Research Focus: Large Language Models, Retrieval-augmented Generation, Agents, Data Synthesis with applications in healthcare.; Advisor: Prof. Carl Yang & Prof. Joyce Ho
Emory University (2017 - 2021): B.S. in Computer Science, Double Major in Applied Mathematics; GPA: 3.97/4.00; Research Focus: Natural Language Processing.; Advisor: Prof. Jinho Choi

Industrial Experience

Google DeepMind (March 2026 - Present): Research Scientist

Search Intelligence, Google DeepMind (Jun 2025 - Nov 2025): Research Intern; Topic: Agentic Judge Training via Tool-Augmented RL [ICLR 2026].; Mentors: Jingjing Chen, Jiayu Ye, Yu Wu, Manager: Hongkun Yu.

AI Lab, Tencent America (Feb 2025 - May 2025): Artificial General Intelligence Research Intern; Topic: Retrieval-augmented GUI Agents with Skill Generation [EMNLP 2025 Main Conference].; Mentors: Kaixin Ma, Wenhao Yu, Hongming Zhang, Manager: Dong Yu.

Query Understanding Team, Amazon (May 2024 - Oct 2024): Applied Scientist Intern; Topic: LLM Self-training for Retrieval-augmented Generation [NAACL 2025 Main Conference].; Mentor: Hui Liu, Manager: Qi He.

Meta Platforms, Inc. (May 2020 - Aug 2020): Enterprise Engineer Intern; Mentor: Zexi Zhang

News

Mar 9, 2026	I finished my PhD study and joined Google DeepMind as a research scientist.
Jan 26, 2026	Two papers on LLM Agents are accepted to ICLR 2026 with one as Oral (top 1.1%).
Sep 18, 2025	Our paper AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play is accepted to NeurIPS 2025 as Spotlight (top 3.2%). See you in San Diego!
Aug 20, 2025	Our paper on improving GUI Agents with tutorials is accepted to EMNLP 2025 Main Conference.
Sep 20, 2024	Three papers on LLMs for Text Retrieval, LLM Agents for Complex Tabular Reasoning and LLM Test-time Adaptation are accepted to EMNLP 2024.

Selected Publications

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu*, Yuchen Zhuang*, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May Dongmei Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, and Wenqi Shi

Proceedings of ICLR, 2026. (Oral)

Abs arXiv Code

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, and Hongkun Yu

Proceedings of ICLR, 2026.

Abs arXiv

Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a Python executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that enables bootstrapping directly from a base model without distillation. On six public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero—trained entirely without distillation—matches the performance of the distilled variants, showing that tool-augmented judges can self-improve through iterative reinforcement learning.
AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play

Ran Xu, Yuchen Zhuang, Zihan Dong, Jonathan Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, and Carl Yang

Proceedings of NeurIPS, 2025. (Spotlight)

Abs arXiv Code Huggingface

Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks.
SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Ran Xu, Hui Liu, Sreyashi Nag, Zhenwei Dai, Yaochen Xie, Xianfeng Tang, Chen Luo, Yang Li, Joyce C. Ho, Carl Yang, and Qi He

Proceedings of NAACL, 2025.

Abs arXiv

Retrieval-augmented generation (RAG) enhances the question answering (QA) abilities of large language models (LLMs) by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips LLMs with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes LLMs on instruction-following, question-answering, and search-related data. Then, it prompts LLMs to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these synthetic examples, the LLMs can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets across three different domains verify the efficacy of SimRAG over baselines by 1.2%–8.6%.
Counterfactual and Factual Reasoning over Hypergraphs for Interpretable Clinical Predictions on EHR

Ran Xu, Yue Yu, Chao Zhang, Mohammed K Ali, Joyce C Ho, and Carl Yang

Proceedings of ML4H, 2022. (Best Paper Award)

Abs PDF Code

Electronic Health Record modeling is crucial for digital medicine. However, existing models ignore higher-order interactions among medical codes and their causal relations towards downstream clinical predictions. To address such limitations, we propose a novel framework CACHE, to provide effective and insightful clinical predictions based on hypergraph representation learning and counterfactual and factual reasoning techniques. Experiments on two real EHR datasets show the superior performance of CACHE. Case studies with a domain expert illustrate a preferred capability of CACHE in generating clinically meaningful interpretations towards the correct predictions.