DiffVSR

Revealing an Effective Recipe for Taming

Robust Video Super-Resolution Against Complex Degradations

Xiaohui Li1,2*,   Yihao Liu2*†,   Shuo Cao4,2,   Ziyan Chen3,   Shaobin Zhuang1,  
Xiangyu Chen1,   Yinan He2,   Yi Wang2,   Yu Qiao2,3
1Shanghai Jiao Tong University,   2Shanghai Artificial Intelligence Laboratory,  
3Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,
4University of Science and Technology of China  
Corresponding authors

Real-World Videos (upscale ×4)

Complex degraded videos (upscale ×4)

Comparisons

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 1.1
Input
Scene 1.2
SeedVR
Scene 1.3
STAR
Scene 1.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 2.1
Input
Scene 2.2
SeedVR
Scene 2.3
STAR
Scene 2.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 3.1
Input
Scene 3.2
SeedVR
Scene 3.3
STAR
Scene 3.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 4.1
Input
Scene 4.2
SeedVR
Scene 4.3
STAR
Scene 4.4
DiffVSR
  1. All test videos of SeedVR are obtained from their project page with uncompressed originals.
  2. For fair comparison, both methods' videos underwent identical compression during concatenation.
  3. While SeedVR focuses on generative super-resolution, our DiffVSR emphasizes fidelity to the original content.

Abstract

Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.

Method

Overview of our proposed DiffVSR framework. (a) Model architecture with enhanced UNet and VAE. (b) Architectural improvements for feature extraction and reconstruction. (c) Progressive Learning Strategy (PLS), our core innovation for handling complex degradations. (d) Multi-Scale Temporal Attention (MSTA) for capturing temporal dependencies at different scales.

Interweaved Latent Transition approach illustrated. By combining strategic noise rescheduling across overlapping regions with position-based latent interpolation between adjacent subsequences, this lightweight solution ensures temporal consistency without requiring additional training or computational resources.

DiffVSR Demo