Research
My research focuses on long-form video understanding, object-centric learning, and motion segmentation. I am also interested in representation learning, image and video generation, and multimodal language model.
|
|
AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
Junyu Xie ,
Tengda Han,
Max Bain,
Arsha Nagrani,
Gül Varol,
Weidi Xie,
Andrew Zisserman
In ACCV, 2024  
ArXiv /
Bibtex /
Project page /
Code /
Dataset (TV-AD)
@article{xie2024autoad0,
title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
journal={arXiv preprint arXiv:2407.15850},
year={2024}
}
In this paper, we propose AutoAD-Zero, which is a training-free framework aiming at zero-shot Audio Description (AD) generation for movies and TV series. The overall framework feature two stages (dense description + AD summary), with the character information injected by visual-textual prompting.
|
|
Moving Object Segmentation: All You Need Is SAM (and Flow)
Junyu Xie ,
Charig Yang,
Weidi Xie,
Andrew Zisserman
In ACCV, 2024   (Oral)
ArXiv /
Bibtex /
Project page /
Code
@article{xie2024flowsam,
title={Moving Object Segmentation: All You Need Is SAM (and Flow)},
author={Junyu Xie and Charig Yang and Weidi Xie and Andrew Zisserman},
journal={arXiv preprint arXiv:2404.12389},
year={2024}
}
This paper focuses on motion segmentation by incorporating optical flow into the Segment Anything model (SAM), applying flow information as direct inputs (FlowISAM) or prompts (FlowPSAM).
|
|
Appearance-Based Refinement for Object-Centric Motion Segmentation
Junyu Xie ,
Weidi Xie,
Andrew Zisserman
In ECCV, 2024  
ArXiv /
Bibtex /
Project page
@InProceedings{xie2024appearrefine,
title={Appearance-Based Refinement for Object-Centric Motion Segmentation},
author={Junyu Xie and Weidi Xie and Andrew Zisserman},
booktitle={ECCV},
year={2024}
}
This paper aims at improving flow-only motion segmentation (e.g. OCLR predictions) by leveraging appearance information across video frames. A selection-correction pipeline is developed, along with a test-time model adaptation scheme that further alleviates the Sim2Real disparity.
|
|
SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds
Minghao Chen,
Junyu Xie,
Iro Laina,
Andrea Vedaldi
In CVPR , 2024  
ArXiv /
Bibtex /
Project page /
Code /
Demo
@InProceedings{chen2024shap,
title={SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds},
author={Chen, Minghao and Xie, Junyu and Laina, Iro and Vedaldi, Andrea},
booktitle=CVPR,
year={2024}
}
This paper present a method, named SHAP-EDITOR, aiming at fast 3D editing (within one second). To acheve this, we propose to learn a universal editing function that can be applied to different objects in a feed-forward manner.
|
|
Segmenting Moving Objects via an Object-Centric Layered Representation
Junyu Xie ,
Weidi Xie,
Andrew Zisserman
In NeurIPS, 2022  
ArXiv /
Bibtex /
Project page /
Code
@InProceedings{xie2022segmenting,
title = {Segmenting Moving Objects via an Object-Centric Layered Representation},
author = {Junyu Xie and Weidi Xie and Andrew Zisserman},
booktitle = {NeurIPS},
year = {2022}
}
In this paper, we propose the OCLR model for discovering, tracking and segmenting multiple moving objects in a video without relying on human annotations. This object-centric segmentation model utilises depth-ordered layered representations and is trained following a Sim2Real procedure.
|
This website template is originally designed by Jon Barron.
|
|