Junyu Xie

I am currently a fourth-year DPhil student at Visual Geometry Group (VGG), University of Oxford, advised by Prof. Andrew Zisserman and Prof. Weidi Xie.

Prior to that, I completed my undergraduate studies at University of Cambridge and received MSc and BA degrees in Natural Sciences (Physics), during which I did summer interns on machine learning and physics at Caltech, Fudan University, and University of Cambridge.

Email / Google Scholar / Github

Research

My research focuses on long-form video understanding, object-centric learning, and motion segmentation. I am also interested in representation learning, image and video generation, and multimodal language model.

	Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation Junyu Xie , Tengda Han, Max Bain, Arsha Nagrani, Eshika Khandelwal, Gül Varol, Weidi Xie, Andrew Zisserman Arxiv, 2025 ArXiv / Bibtex / Project page / Code / Metric (Action Score) @article{xie2025shotbyshot, title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation}, author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman}, journal={arXiv preprint arXiv:2504.01020}, year={2025} } In this work, we introduce an enhanced two-stage training-free framework for Audio Description (AD) generation. We consider "shot" as the fundamental unit in movie and TV series, incorporating shot-based temporal context and film grammar information into VideoLLM perception. Additionally, we formulate a new metric (Action Score) that assesses whether the predicted ADs captures the correct action information.
	AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description Junyu Xie , Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman In ACCV, 2024 ArXiv / Bibtex / Project page / Code / Dataset (TV-AD) @InProceedings{xie2024autoad0, title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description}, author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman}, booktitle={ACCV}, year={2024} } In this paper, we propose AutoAD-Zero, which is a training-free framework aiming at zero-shot Audio Description (AD) generation for movies and TV series. The overall framework feature two stages (dense description + AD summary), with the character information injected by visual-textual prompting.
	Moving Object Segmentation: All You Need Is SAM (and Flow) Junyu Xie , Charig Yang, Weidi Xie, Andrew Zisserman In ACCV, 2024 (Oral) ArXiv / Bibtex / Project page / Code @InProceedings{xie2024flowsam, title={Moving Object Segmentation: All You Need Is SAM (and Flow)}, author={Junyu Xie and Charig Yang and Weidi Xie and Andrew Zisserman}, booktitle={ACCV}, year={2024} } This paper focuses on motion segmentation by incorporating optical flow into the Segment Anything model (SAM), applying flow information as direct inputs (FlowISAM) or prompts (FlowPSAM).
	Appearance-Based Refinement for Object-Centric Motion Segmentation Junyu Xie , Weidi Xie, Andrew Zisserman In ECCV, 2024 ArXiv / Bibtex / Project page @InProceedings{xie2024appearrefine, title={Appearance-Based Refinement for Object-Centric Motion Segmentation}, author={Junyu Xie and Weidi Xie and Andrew Zisserman}, booktitle={ECCV}, year={2024} } This paper aims at improving flow-only motion segmentation (e.g. OCLR predictions) by leveraging appearance information across video frames. A selection-correction pipeline is developed, along with a test-time model adaptation scheme that further alleviates the Sim2Real disparity.
	SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds Minghao Chen, Junyu Xie, Iro Laina, Andrea Vedaldi In CVPR , 2024 ArXiv / Bibtex / Project page / Code / Demo @InProceedings{chen2024shap, title={SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds}, author={Chen, Minghao and Xie, Junyu and Laina, Iro and Vedaldi, Andrea}, booktitle=CVPR, year={2024} } This paper present a method, named SHAP-EDITOR, aiming at fast 3D editing (within one second). To acheve this, we propose to learn a universal editing function that can be applied to different objects in a feed-forward manner.
	Segmenting Moving Objects via an Object-Centric Layered Representation Junyu Xie , Weidi Xie, Andrew Zisserman In NeurIPS, 2022 ArXiv / Bibtex / Project page / Code @InProceedings{xie2022segmenting, title = {Segmenting Moving Objects via an Object-Centric Layered Representation}, author = {Junyu Xie and Weidi Xie and Andrew Zisserman}, booktitle = {NeurIPS}, year = {2022} } In this paper, we propose the OCLR model for discovering, tracking and segmenting multiple moving objects in a video without relying on human annotations. This object-centric segmentation model utilises depth-ordered layered representations and is trained following a Sim2Real procedure.

This website template is originally designed by Jon Barron.