AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

Sameena Javaid; Marwa Chendeb El Rai; Abeer Elkhouly; Obada Al-Khatib; Aicha Beya Far; May El Barachi

doi:10.3390/app16073450

Back

AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

Journal article

Open access

Peer reviewed

AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

Sameena Javaid, Marwa Chendeb El Rai, Abeer Elkhouly, Obada Al-Khatib, Aicha Beya Far and May El Barachi

Applied sciences, Vol.16(7), 3450

2026

DOI: https://doi.org/10.3390/app16073450

Files and links (1)

pdf

Published1.81 MBDownload View

Open Access CC BY V4.0

Abstract

Chemistry

Chemistry, Multidisciplinary

Engineering

Engineering, Multidisciplinary

Materials Science

Materials Science, Multidisciplinary

Physical Sciences

Physics

Physics, Applied

Science & Technology

Technology

The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally efficient deepfake video detection. The proposed framework adopts a multi-stage hierarchical architecture in which frame-level visual features are extracted using an EfficientNet-B0 backbone, short- and medium-range temporal patterns are modeled through Bidirectional Long Short-Term Memory (BiLSTM) networks, and long-range temporal dependencies are captured using an action-aware Transformer operating on temporally aggregated representations. Unlike conventional video transformers that apply self-attention directly to raw frame-level features, the proposed action-aware attention mechanism reduces redundant computation and improves stability in temporal reasoning. Extensive experiments on the balanced FFIW-10K dataset demonstrate that AT-HSTNet achieves an accuracy of 98.7%, with 98.0% precision, 96.0% recall, and a 96.9% F1-score, outperforming representative CNN-BiLSTM and CNN-Transformer baseline architectures. In addition, AT-HSTNet is highly efficient, requiring only 0.45 GFLOPs and achieving an inference speed of approximately 30 FPS on consumer-grade GPU hardware. As a result of this study, we found hierarchical temporal modeling more effective when combined with action-aware attention for any deepfake video detection.

Details

Title: AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection
Authors/Creators: Sameena Javaid - University of Dubai
Marwa Chendeb El Rai - American University in Dubai
Abeer Elkhouly - University of Wollongong in Dubai
Obada Al-Khatib - University of Wollongong in Dubai
Aicha Beya Far - American University in Dubai
May El Barachi - University of Wollongong in Dubai
Publication Details: Applied sciences, Vol.16(7), 3450
Publisher: MDPI
Number of pages: 18
Identifiers: 991005879653207891
Murdoch Affiliation: School of Information Technology
Language: English
Resource Type: Journal article

Metrics

1 Record Views