Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

Ashkan Taghipour; Morteza Ghahremani; Mohammed Bennamoun; Aref Miri Rekavandi; Zinuo Li; Hamid Laga; Farid Boussaid

doi:10.1109/ACCESS.2025.3595822

Back

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

Journal article

Open access

Peer reviewed

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri Rekavandi, Zinuo Li, Hamid Laga and Farid Boussaid

IEEE access, Vol.13, pp.141313-141327

2025

DOI: https://doi.org/10.1109/ACCESS.2025.3595822

Files and links (1)

pdf

Published13.15 MBDownload View

CC BY V4.0, Open Access

Abstract

Australia

CLIP image encoding

Computational modeling

Computer architecture

Diffusion models

image-to-video generation

Noise reduction

spatial cross-attention

temporal-cross-attention

Text to video

Three-dimensional displays

Training

Video generation

Videos

Visualization

This paper explores the effectiveness-specifically in improving video consistency-and the computational burden of Contrastive Language-Image Pre-Training (CLIP) embeddings in video generation. The investigation is conducted using the Stable Video Diffusion (SVD) framework, a state-of-the-art method for generating high-quality videos from image inputs. The diffusion process in SVD generates videos by iteratively denoising noisy inputs over multiple steps. Our analysis reveals that employing CLIP in the cross-attention mechanism at every step of this denoising process has limited impact on maintaining subject and background consistency while imposing a significant computational burden on the video generation network. To address this, we propose Video Computation Cut (VCUT), a novel, training-free optimization method that significantly reduces computational demands without compromising output quality. VCUT replaces the computationally intensive temporal cross-attention with a one-time computed linear layer, cached and reused across inference steps. This innovation reduces up to 322T MACs per 25-frame video, decreases model parameters by 50M, and cuts latency by 20% compared to baseline methods. By streamlining the SVD architecture, our approach makes high-quality video generation more accessible, cost-effective, and eco-friendly, paving the way for real-time applications in telemedicine, remote learning, and automated content creation.

Details

Title: Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
Authors/Creators: Ashkan Taghipour - The University of Western Australia
Morteza Ghahremani - Technical University of Munich
Mohammed Bennamoun - The University of Western Australia
Aref Miri Rekavandi - The University of Melbourne
Zinuo Li - The University of Western Australia
Hamid Laga - Murdoch University, Centre for Biosecurity and One Health
Farid Boussaid - The University of Western Australia
Publication Details: IEEE access, Vol.13, pp.141313-141327
Publisher: IEEE
Number of pages: 15
Grant note: DP210101682; DP220102197 / Australian Government through the Australian Research Council
Identifiers: 991005807946807891
Murdoch Affiliation: Centre for Biosecurity and One Health; School of Information Technology; Centre for Healthy Ageing
Language: English
Resource Type: Journal article

Metrics

2 File views/ downloads

17 Record Views