| 
          
            | Research My research focuses on long-form video understanding, object-centric learning, and motion segmentation. I am also interested in representation learning, image and video generation, and multimodal language model. |  
            
                |   | Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation Junyu Xie ,
                   Tengda Han,
                   Max Bain,
                   Arsha Nagrani,
                   Eshika Khandelwal,
                  
                   Gül Varol,
                   Weidi Xie,
                   Andrew Zisserman
 In ICCV, 2025   (new)
 ArXiv /
                   Bibtex  /
                   Project page  /
                  Code /
                  Metric (Action Score)
 
                  
                  
@InProceedings{xie2025shotbyshot,
  title={Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and Eshika Khandelwal and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ICCV},
  year={2025}
}
                   In this work, we introduce an enhanced two-stage training-free framework for Audio Description (AD) generation. We consider "shot" as the fundamental unit in movie and TV series, incorporating shot-based temporal context and film grammar information into VideoLLM perception. Additionally, we formulate a new metric (Action Score) that assesses whether the predicted ADs captures the correct action information. |  
          |   | AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description Junyu Xie ,
             Tengda Han,
             Max Bain,
             Arsha Nagrani,
             Gül Varol,
             Weidi Xie,
             Andrew Zisserman
 In ACCV, 2024  
 ArXiv /
             Bibtex  /
             Project page  /
            Code /
            Dataset (TV-AD)
 
            
            
@InProceedings{xie2024autoad0,
  title={AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description},
  author={Junyu Xie and Tengda Han and Max Bain and Arsha Nagrani and G\"ul Varol and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}
             In this paper, we propose AutoAD-Zero, which is a training-free framework aiming at zero-shot Audio Description (AD) generation for movies and TV series. The overall framework feature two stages (dense description + AD summary), with the character information injected by visual-textual prompting. |  
          |   | Moving Object Segmentation: All You Need Is SAM (and Flow) Junyu Xie ,
             Charig Yang,
             Weidi Xie,
             Andrew Zisserman
 In ACCV, 2024    (Oral)
 ArXiv /
             Bibtex  /
             Project page  /
            Code
 
            
            
@InProceedings{xie2024flowsam,
  title={Moving Object Segmentation: All You Need Is SAM (and Flow)},
  author={Junyu Xie and Charig Yang and Weidi Xie and Andrew Zisserman},
  booktitle={ACCV},
  year={2024}
}
             This paper focuses on motion segmentation by incorporating optical flow into the Segment Anything model (SAM), applying flow information as direct inputs (FlowISAM) or prompts (FlowPSAM). |  
        |   | Appearance-Based Refinement for Object-Centric Motion Segmentation Junyu Xie ,
           Weidi Xie,
           Andrew Zisserman
 In ECCV, 2024  
 ArXiv /
           Bibtex  / 
           Project page
 
            
            
@InProceedings{xie2024appearrefine,
  title={Appearance-Based Refinement for Object-Centric Motion Segmentation},
  author={Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle={ECCV},
  year={2024}
}
             This paper aims at improving flow-only motion segmentation (e.g. OCLR predictions) by leveraging appearance information across video frames. A selection-correction pipeline is developed, along with a test-time model adaptation scheme that further alleviates the Sim2Real disparity. |  
            |   | SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds Minghao Chen,
               Junyu Xie,
               Iro Laina,
               Andrea Vedaldi
 In CVPR , 2024  
 ArXiv /
               Bibtex  /
              Project page /
              Code /
              Demo
 
                
                
@InProceedings{chen2024shap,
  title={SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds},
  author={Chen, Minghao and Xie, Junyu and Laina, Iro and Vedaldi, Andrea},
  booktitle=CVPR,
  year={2024}
}
                 This paper present a method, named SHAP-EDITOR, aiming at fast 3D editing (within one second). To acheve this, we propose to learn a universal editing function that can be applied to different objects in a feed-forward manner. |  
            |   | Segmenting Moving Objects via an Object-Centric Layered Representation Junyu Xie ,
               Weidi Xie,
               Andrew Zisserman
 In NeurIPS, 2022  
 ArXiv /
               Bibtex  /
               Project page  /
              Code
 
                
                
@InProceedings{xie2022segmenting,
  title     = {Segmenting Moving Objects via an Object-Centric Layered Representation}, 
  author    = {Junyu Xie and Weidi Xie and Andrew Zisserman},
  booktitle = {NeurIPS},
  year      = {2022}
}
                 In this paper, we propose the OCLR model for discovering, tracking and segmenting multiple moving objects in a video without relying on human annotations. This object-centric segmentation model utilises depth-ordered layered representations and is trained following a Sim2Real procedure. |  
            
                | 
 
                    This website template is originally designed by Jon Barron.
                 |  |