Junyu Xie

Grounding Moving Object Segmentation in 3D Space and Time

Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

arXiv preprint, 2026

@article{xie2026gmos,
  title={Grounding Moving Object Segmentation in 3D Space and Time},
  author={Xie, Junyu and Han, Tengda and Xie, Weidi and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2605.30352},
  year={2026}
}

We present GMOS, a framework that grounds moving object segmentation in 3D space and time, predicting moving objects per-frame from RGB video and achieving state-of-the-art results. The approach is supported by our new GMOS-2K dataset, comprising 2,210 real-world videos with per-object temporal motion annotations, and the temporally fine-grained MOS-I evaluation protocol.

D4RT: efficiently reconstructing dynamic scenes

Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi

CVPR, 2026 🏆 Best Paper Award

arXiv/ BibTeX/ Project Page

@InProceedings{zhang2026d4rt,
  title={Efficiently Reconstructing Dynamic Scenes One D4RT at a Time},
  author={Zhang, Chuhan and Le Moing, Guillaume and Koppula, Skanda and Rocco, Ignacio and Momeni, Liliane and Xie, Junyu and Sun, Shuyang and Sukthankar, Rahul and Barral, Jo{\"e}lle K. and Hadsell, Raia and Ghahramani, Zoubin and Zisserman, Andrew and Zhang, Junlin and Sajjadi, Mehdi S. M.},
  booktitle={CVPR},
  year={2026}
}

D4RT is a feedforward model that utilizes a unified transformer architecture and a novel querying mechanism to jointly infer depth, spatio-temporal correspondence, and camera parameters from a single video, achieving state-of-the-art performance in 4D reconstruction and tracking tasks.

Shot-by-Shot audio description generation

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, Weidi Xie, Andrew Zisserman

ICCV, 2025

arXiv/ BibTeX/ Project Page/ Code/ Metric (Action Score)

@InProceedings{xie2025shotbyshot,
  title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ICCV},
  year={2025}
}

We introduce an enhanced two-stage training-free framework for Audio Description (AD) generation. We consider the "shot" as the fundamental unit in movies and TV series, incorporating shot-based temporal context and film grammar information into VideoLLM perception. Additionally, we formulate a new metric (Action Score) that assesses whether the predicted ADs capture the correct action information.

Character-Centric Understanding of Animated Movies

Zhongrui Gui, Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

ACM MM, 2025

arXiv/ BibTeX/ Project Page/ Code

@InProceedings{gui2025character,
  title={Character-Centric Understanding of Animated Movies},
  author={Zhongrui Gui and Junyu Xie and Tengda Han and Weidi Xie and Andrew Zisserman},
  booktitle={ACMMM},
  year={2025}
}

To address the challenge of recognising highly variable animated characters, this work introduces a novel audio-visual pipeline and the CMD-AM dataset, utilising a multi-modal character bank to significantly improve accessibility through generated audio descriptions and character-aware subtitles.

AutoAD-Zero framework for zero-shot audio description

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

ACCV, 2024

arXiv/ BibTeX/ Project Page/ Code/ Dataset (TV-AD)

@InProceedings{xie2024autoad0,
  title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}

We propose AutoAD-Zero, a training-free framework aiming at zero-shot Audio Description (AD) generation for movies and TV series. The overall framework features two stages (dense description + AD summary), with character information injected by visual-textual prompting.

Moving object segmentation with SAM and optical flow

Moving Object Segmentation: All You Need Is SAM (and Flow)

Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman

ACCV, 2024 Oral

arXiv/ BibTeX/ Project Page/ Code

@InProceedings{xie2024flowsam,
  title={Moving Object Segmentation: All You Need Is SAM (and Flow)},
  author={Junyu Xie and Charig Yang and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}

This paper focuses on motion segmentation by incorporating optical flow into the Segment Anything Model (SAM), applying flow information as direct inputs (FlowISAM) or prompts (FlowPSAM).

Appearance-Based Refinement for Object-Centric Motion Segmentation

Junyu Xie, Weidi Xie, Andrew Zisserman

ECCV, 2024

arXiv/ BibTeX/ Project Page

@InProceedings{xie2024appearrefine,
  title={Appearance-Based Refinement for Object-Centric Motion Segmentation},
  author={Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle={ECCV},
  year={2024}
}

This paper aims at improving flow-only motion segmentation (e.g. OCLR predictions) by leveraging appearance information across video frames. A selection-correction pipeline is developed, along with a test-time model adaptation scheme that further alleviates the Sim2Real disparity.

SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds

Minghao Chen, Junyu Xie, Iro Laina, Andrea Vedaldi

CVPR, 2024

arXiv/ BibTeX/ Project Page/ Code/ Demo

@InProceedings{chen2024shap,
  title={SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds},
  author={Chen, Minghao and Xie, Junyu and Laina, Iro and Vedaldi, Andrea},
  booktitle=CVPR,
  year={2024}
}

This paper presents a method, named SHAP-EDITOR, aiming at fast 3D editing (within one second). To achieve this, we propose to learn a universal editing function that can be applied to different objects in a feed-forward manner.

OCLR: object-centric layered representation for moving object segmentation

Segmenting Moving Objects via an Object-Centric Layered Representation

Junyu Xie, Weidi Xie, Andrew Zisserman

NeurIPS, 2022

arXiv/ BibTeX/ Project Page/ Code

@InProceedings{xie2022segmenting,
  title     = {Segmenting Moving Objects via an Object-Centric Layered Representation},
  author    = {Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle = {NeurIPS},
  year      = {2022}
}

We propose the OCLR model for discovering, tracking and segmenting multiple moving objects in a video without relying on human annotations. This object-centric segmentation model utilises depth-ordered layered representations and is trained following a Sim2Real procedure.

Junyu Xie

Research

News

Publications