Junyu Xie

I am a final-year DPhil (PhD) student at the Visual Geometry Group (VGG), University of Oxford, advised by Prof. Andrew Zisserman and Prof. Weidi Xie.

During Summer 2025, I interned at Google DeepMind London as a student researcher, working on dense video perception and 4D reconstruction.

Prior to that, I completed my undergraduate studies at the University of Cambridge and received MSc and BA degrees in Natural Sciences (Physics), during which I did summer internships on machine learning and physics at Caltech, Fudan University, and the University of Cambridge.

Research

My research focuses on understanding motion and interactions in video, including multi-modal video understanding, dense motion understanding, and object-centric learning.

News

Publications

GMOS: grounding moving object segmentation in 3D space and time
Grounding Moving Object Segmentation in 3D Space and Time

Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

arXiv preprint, 2026

@article{xie2026gmos,
  title={Grounding Moving Object Segmentation in 3D Space and Time},
  author={Xie, Junyu and Han, Tengda and Xie, Weidi and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2605.30352},
  year={2026}
}

We present GMOS, a framework that grounds moving object segmentation in 3D space and time, predicting moving objects per-frame from RGB video and achieving state-of-the-art results. The approach is supported by our new GMOS-2K dataset, comprising 2,210 real-world videos with per-object temporal motion annotations, and the temporally fine-grained MOS-I evaluation protocol.

D4RT: efficiently reconstructing dynamic scenes
Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi

CVPR, 2026 🏆 Best Paper Award

@InProceedings{zhang2026d4rt,
  title={Efficiently Reconstructing Dynamic Scenes One D4RT at a Time},
  author={Zhang, Chuhan and Le Moing, Guillaume and Koppula, Skanda and Rocco, Ignacio and Momeni, Liliane and Xie, Junyu and Sun, Shuyang and Sukthankar, Rahul and Barral, Jo{\"e}lle K. and Hadsell, Raia and Ghahramani, Zoubin and Zisserman, Andrew and Zhang, Junlin and Sajjadi, Mehdi S. M.},
  booktitle={CVPR},
  year={2026}
}

D4RT is a feedforward model that utilizes a unified transformer architecture and a novel querying mechanism to jointly infer depth, spatio-temporal correspondence, and camera parameters from a single video, achieving state-of-the-art performance in 4D reconstruction and tracking tasks.

Shot-by-Shot audio description generation
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, Weidi Xie, Andrew Zisserman

ICCV, 2025

@InProceedings{xie2025shotbyshot,
  title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ICCV},
  year={2025}
}

We introduce an enhanced two-stage training-free framework for Audio Description (AD) generation. We consider the "shot" as the fundamental unit in movies and TV series, incorporating shot-based temporal context and film grammar information into VideoLLM perception. Additionally, we formulate a new metric (Action Score) that assesses whether the predicted ADs capture the correct action information.

Character-centric understanding of animated movies
Character-Centric Understanding of Animated Movies

Zhongrui Gui, Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

ACM MM, 2025

@InProceedings{gui2025character,
  title={Character-Centric Understanding of Animated Movies},
  author={Zhongrui Gui and Junyu Xie and Tengda Han and Weidi Xie and Andrew Zisserman},
  booktitle={ACMMM},
  year={2025}
}

To address the challenge of recognising highly variable animated characters, this work introduces a novel audio-visual pipeline and the CMD-AM dataset, utilising a multi-modal character bank to significantly improve accessibility through generated audio descriptions and character-aware subtitles.

AutoAD-Zero framework for zero-shot audio description
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

ACCV, 2024

@InProceedings{xie2024autoad0,
  title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}

We propose AutoAD-Zero, a training-free framework aiming at zero-shot Audio Description (AD) generation for movies and TV series. The overall framework features two stages (dense description + AD summary), with character information injected by visual-textual prompting.

Moving object segmentation with SAM and optical flow
Moving Object Segmentation: All You Need Is SAM (and Flow)

Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman

ACCV, 2024 Oral

@InProceedings{xie2024flowsam,
  title={Moving Object Segmentation: All You Need Is SAM (and Flow)},
  author={Junyu Xie and Charig Yang and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}

This paper focuses on motion segmentation by incorporating optical flow into the Segment Anything Model (SAM), applying flow information as direct inputs (FlowISAM) or prompts (FlowPSAM).

Appearance-based refinement for object-centric motion segmentation
Appearance-Based Refinement for Object-Centric Motion Segmentation

Junyu Xie, Weidi Xie, Andrew Zisserman

ECCV, 2024

@InProceedings{xie2024appearrefine,
  title={Appearance-Based Refinement for Object-Centric Motion Segmentation},
  author={Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle={ECCV},
  year={2024}
}

This paper aims at improving flow-only motion segmentation (e.g. OCLR predictions) by leveraging appearance information across video frames. A selection-correction pipeline is developed, along with a test-time model adaptation scheme that further alleviates the Sim2Real disparity.

SHAP-EDITOR instruction-guided latent 3D editing
SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds

Minghao Chen, Junyu Xie, Iro Laina, Andrea Vedaldi

CVPR, 2024

@InProceedings{chen2024shap,
  title={SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds},
  author={Chen, Minghao and Xie, Junyu and Laina, Iro and Vedaldi, Andrea},
  booktitle=CVPR,
  year={2024}
}

This paper presents a method, named SHAP-EDITOR, aiming at fast 3D editing (within one second). To achieve this, we propose to learn a universal editing function that can be applied to different objects in a feed-forward manner.

OCLR: object-centric layered representation for moving object segmentation
Segmenting Moving Objects via an Object-Centric Layered Representation

Junyu Xie, Weidi Xie, Andrew Zisserman

NeurIPS, 2022

@InProceedings{xie2022segmenting,
  title     = {Segmenting Moving Objects via an Object-Centric Layered Representation},
  author    = {Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle = {NeurIPS},
  year      = {2022}
}

We propose the OCLR model for discovering, tracking and segmenting multiple moving objects in a video without relying on human annotations. This object-centric segmentation model utilises depth-ordered layered representations and is trained following a Sim2Real procedure.