Logo image
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
Journal article   Open access   Peer reviewed

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Zinuo Li, Hamid Laga and Farid Boussaid
IEEE access, Vol.13, pp.141313-141327
2025
pdf
Published13.15 MBDownloadView
CC BY V4.0 Open Access

Abstract

Australia CLIP image encoding Computational modeling Computer architecture Diffusion models image-to-video generation Noise reduction spatial cross-attention temporal-cross-attention Text to video Three-dimensional displays Training Video generation Videos Visualization
This paper explores the effectiveness-specifically in improving video consistency-and the computational burden of Contrastive Language-Image Pre-Training (CLIP) embeddings in video generation. The investigation is conducted using the Stable Video Diffusion (SVD) framework, a state-of-the-art method for generating high-quality videos from image inputs. The diffusion process in SVD generates videos by iteratively denoising noisy inputs over multiple steps. Our analysis reveals that employing CLIP in the cross-attention mechanism at every step of this denoising process has limited impact on maintaining subject and background consistency while imposing a significant computational burden on the video generation network. To address this, we propose Video Computation Cut (VCUT), a novel, training-free optimization method that significantly reduces computational demands without compromising output quality. VCUT replaces the computationally intensive temporal cross-attention with a one-time computed linear layer, cached and reused across inference steps. This innovation reduces up to 322T MACs per 25-frame video, decreases model parameters by 50M, and cuts latency by 20% compared to baseline methods. By streamlining the SVD architecture, our approach makes high-quality video generation more accessible, cost-effective, and eco-friendly, paving the way for real-time applications in telemedicine, remote learning, and automated content creation.

Details

Metrics

2 File views/ downloads
17 Record Views
Logo image