We present DiffVSR, a diffusion-based framework for real-world video super-resolution that effectively addresses the challenges of maintaining both high fidelity and temporal consistency. Diffusion models have demonstrated exceptional capabilities in image generation and restoration, yet their application to video super-resolution faces significant challenges in handling complex motion dynamics and maintaining temporal coherence. To address these issues, our approach introduces several key innovations. For intra-sequence coherence, we develop a multi-scale temporal attention module and a temporal-enhanced VAE decoder to capture fine-grained motion details and ensure spatial accuracy. For inter-sequence stability, we propose a noise rescheduling mechanism combined with an interweaved latent transition approach, which enhances temporal consistency across frames without introducing additional training overhead. To effectively train DiffVSR, we design progressive learning that transitions from simple to complex degradations, enabling robust optimization even with limited high-quality video data. Benefiting from these designs, DiffVSR achieves stable training and effectively handles real-world video degradation scenarios. Extensive experiments show that DiffVSR surpasses existing state-of-the-art video super-resolution methods in both visual quality and temporal consistency. Moreover, DiffVSR sets a new benchmark for real-world video super-resolution, paving the way for high-quality and temporally consistent video restoration in practical applications.
Overview of our proposed DiffVSR framework. (a) The overall model architecture integrates enhanced UNet and VAE decoder for high-quality frame restoration. (b) Detailed designs of our modified UNet and VAE decoder blocks for better feature extraction and reconstruction. (c) Progressive Learning Strategy that enables stable training and robust performance across various degradation levels. (d) Multi-Scale Temporal Attention (MSTA) mechanism designed for capturing temporal dependencies at different scales. Notably, spatial layer includes ResBlock2D and Spatial Attention, while temporal layer contains ResBlock3D, Temporal Attention and MSTA module.