Logo image
AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection
Journal article   Open access   Peer reviewed

AT-HSTNet: An Efficient Hierarchical Action-Transformer Framework for Deepfake Video Detection

Sameena Javaid, Marwa Chendeb El Rai, Abeer Elkhouly, Obada Al-Khatib, Aicha Beya Far and May El Barachi
Applied sciences, Vol.16(7), 3450
2026
pdf
Published1.81 MBDownloadView
Open Access CC BY V4.0

Abstract

Chemistry Chemistry, Multidisciplinary Engineering Engineering, Multidisciplinary Materials Science Materials Science, Multidisciplinary Physical Sciences Physics Physics, Applied Science & Technology Technology
The rapid advancement of deepfake generation technologies presents significant challenges to the verification of digital video authenticity. These time-dependent artifacts are difficult to detect using conventional frame-based detection approaches. This paper introduces AT-HSTNet, an Action-Transformer-based Hierarchical Spatiotemporal Network designed for robust and computationally efficient deepfake video detection. The proposed framework adopts a multi-stage hierarchical architecture in which frame-level visual features are extracted using an EfficientNet-B0 backbone, short- and medium-range temporal patterns are modeled through Bidirectional Long Short-Term Memory (BiLSTM) networks, and long-range temporal dependencies are captured using an action-aware Transformer operating on temporally aggregated representations. Unlike conventional video transformers that apply self-attention directly to raw frame-level features, the proposed action-aware attention mechanism reduces redundant computation and improves stability in temporal reasoning. Extensive experiments on the balanced FFIW-10K dataset demonstrate that AT-HSTNet achieves an accuracy of 98.7%, with 98.0% precision, 96.0% recall, and a 96.9% F1-score, outperforming representative CNN-BiLSTM and CNN-Transformer baseline architectures. In addition, AT-HSTNet is highly efficient, requiring only 0.45 GFLOPs and achieving an inference speed of approximately 30 FPS on consumer-grade GPU hardware. As a result of this study, we found hierarchical temporal modeling more effective when combined with action-aware attention for any deepfake video detection.

Details

Metrics

1 Record Views
Logo image