ICCV 2025

Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection

Taehoon Kim, Jongwook Choi, Yonghyun Jeong, Haeun Noh, Jaejun Yoo, Seungryul Baek, Jongwon Choi
Chung-Ang Univ, NAVER Cloud, UNIST, Korea
GitHub arXiv arXiv ICCV Paper Hugging Face Demo

Abstract

We introduce a novel method for deepfake video detection that utilizes pixel-wise temporal frequency spectra. Unlike previous approaches that stack 2D frame-wise spatial frequency spectra, we extract pixel-wise temporal frequency by performing a 1D Fourier transform on the time axis per pixel, effectively identifying temporal artifacts. We also propose an Attention Proposal Module (APM) to extract regions of interest for detecting these artifacts. Our method demonstrates outstanding generalizability and robustness in various challenging deepfake video detection scenarios.

Method & Architecture

Frequency Extraction
Temporal artifacts and frequency extraction method
Our method captures subtle temporal artifacts in deepfake videos by applying a 1D Fourier transform to each pixel over time, unlike previous methods that rely on spatial frequency stacking.
Proposed Architecture
Frequency Feature Extractor and Joint Transformer Module
The pipeline consists of a Frequency Feature Extractor (with pixel-wise temporal Fourier transform and Attention Proposal Module) and a Joint Transformer Module for robust deepfake detection.

Experiments & Results

Attention Proposal Module (APM) Visualization
Visualization of APM proposed regions over time
The APM automatically focuses on regions (e.g., eyes, mouth) where temporal incoherence is most likely, enabling more precise detection of deepfake artifacts.
Performance Comparison
Video-level AUC and method comparison
Our method achieves state-of-the-art video-level AUC across multiple datasets, demonstrating superior generalization and robustness compared to previous approaches.

Key Contributions

Limitations & Future Work

Limitations:
  • Heavy compression (H.264, JPEG, WebP) merges neighboring pixels and diminishes pixel-level motion, inducing domain shifts in the temporal frequency.
  • Compressed videos align with the raw signal closely at low frequencies, but diverge significantly in the high-frequency range.
Limitation Illustration
We measured the average pixel-wise temporal frequency under various compression schemes (H.264, JPEG, WebP).
Future Work:

This sensitivity to heavy compression remains a limitation. In future work, we will investigate temporal-frequency regularization techniques to mitigate performance degradation.

📚 Citation

Cite our paper using the following BibTeX entry

@misc{kim2025spatialfrequencypixelwisetemporal, title={Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection}, author={Taehoon Kim and Jongwook Choi and Yonghyun Jeong and Haeun Noh and Jaejun Yoo and Seungryul Baek and Jongwon Choi}, year={2025}, eprint={2507.02398}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.02398}, }

🙏 Acknowledgement

This work was partly supported by Institute of Information & Communication Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT):
  • No. RS-2025-02263841: Development of a Real-time Multimodal Framework for Comprehensive Deepfake Detection Incorporating Common Sense Error Analysis
  • RS-2021-II211341: Artificial Intelligence Graduate School Program (Chung-Ang University)
  • No. RS-2020-II201336: AIGS program (UNIST)