DiffVSR

Enhancing Real-World Video Super-Resolution with Diffusion Models

for Advanced Visual Quality and Temporal Consistency

Xiaohui Li1,2*,   Yihao Liu2*†,   Shuo Cao4,2,   Ziyan Chen3,   Shaobin Zhuang1,  
Xiangyu Chen1,   Yinan He2,   Yi Wang2,   Yu Qiao2,3
1Shanghai Jiao Tong University,   2Shanghai Artificial Intelligence Laboratory,  
3Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,
4University of Science and Technology of China  
Corresponding authors

Real-World Videos (upscale ×4)

Complex degraded videos (upscale ×4)

Comparisons

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 1.1
Input
Scene 1.2
SeedVR
Scene 1.3
STAR
Scene 1.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 2.1
Input
Scene 2.2
SeedVR
Scene 2.3
STAR
Scene 2.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 3.1
Input
Scene 3.2
SeedVR
Scene 3.3
STAR
Scene 3.4
DiffVSR
STAR vs. DiffVSR
SeedVR vs. DiffVSR
Scene 4.1
Input
Scene 4.2
SeedVR
Scene 4.3
STAR
Scene 4.4
DiffVSR
  1. All test videos of SeedVR are obtained from their project page with uncompressed originals.
  2. For fair comparison, both methods' videos underwent identical compression during concatenation.
  3. While SeedVR focuses on generative super-resolution, our DiffVSR emphasizes fidelity to the original content.

Abstract

We present DiffVSR, a diffusion-based framework for real-world video super-resolution that effectively addresses the challenges of maintaining both high fidelity and temporal consistency. Diffusion models have demonstrated exceptional capabilities in image generation and restoration, yet their application to video super-resolution faces significant challenges in handling complex motion dynamics and maintaining temporal coherence. To address these issues, our approach introduces several key innovations. For intra-sequence coherence, we develop a multi-scale temporal attention module and a temporal-enhanced VAE decoder to capture fine-grained motion details and ensure spatial accuracy. For inter-sequence stability, we propose a noise rescheduling mechanism combined with an interweaved latent transition approach, which enhances temporal consistency across frames without introducing additional training overhead. To effectively train DiffVSR, we design progressive learning that transitions from simple to complex degradations, enabling robust optimization even with limited high-quality video data. Benefiting from these designs, DiffVSR achieves stable training and effectively handles real-world video degradation scenarios. Extensive experiments show that DiffVSR surpasses existing state-of-the-art video super-resolution methods in both visual quality and temporal consistency. Moreover, DiffVSR sets a new benchmark for real-world video super-resolution, paving the way for high-quality and temporally consistent video restoration in practical applications.

Method

Overview of our proposed DiffVSR framework. (a) The overall model architecture integrates enhanced UNet and VAE decoder for high-quality frame restoration. (b) Detailed designs of our modified UNet and VAE decoder blocks for better feature extraction and reconstruction. (c) Progressive Learning Strategy that enables stable training and robust performance across various degradation levels. (d) Multi-Scale Temporal Attention (MSTA) mechanism designed for capturing temporal dependencies at different scales. Notably, spatial layer includes ResBlock2D and Spatial Attention, while temporal layer contains ResBlock3D, Temporal Attention and MSTA module.

DiffVSR Demo