Authors :
Sadineni Havesa; Golla Susanth Paul; Dr. S. Jagadeesan
Volume/Issue :
Volume 11 - 2026, Issue 3 - March
Google Scholar :
https://tinyurl.com/27ava5fw
Scribd :
https://tinyurl.com/388487w6
DOI :
https://doi.org/10.38124/ijisrt/26mar2080
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Recent advancements in neural speech synthesis have enabled the creation of extremely realistic deepfakes that
sound like actual human voices. Even though these technologies have some useful purposes, they also pose serious risks in
terms of voice impersonation, misinformation, and financial scams. The detection of fake speech is an emerging research
concern in speech forensics and cybersecurity. This paper proposes a dual-branch deep learning framework for effective
deepfake audio detection by combining self-supervised speech representations with handcrafted acoustic stability features.
The first branch is responsible for the extraction of semantic speech embeddings using the pre-trained model WavLM,
which contains contextual and phonetic information from the speech signal. The second branch is responsible for the
extraction of Mel-Frequency Cepstral Coefficient (MFCC) stability features and the application of the Temporal
Convolutional Network (TCN) model. The features from both branches are combined using a multilayer perceptron
classifier to decide if an audio sample is real or fake.
Experiments on the Fake-or-Real (FoR) dataset show that the proposed fusion approach enhances detection
performance compared to models using a single feature. The results suggest that merging deep contextual embeddings with
handcrafted stability features offers better resilience against modern deepfake audio generation methods.
Keywords :
Deepfake Audio Detection, Self-Supervised Learning, WavLM, MFCC Stability, Speech Forensics, Temporal Convolutional Networks.
References :
- M. Todisco, X. Wang, J. Yamagishi, and H. Delgado, “ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 213–229, 2023.
- X. Wang, J. Yamagishi, M. Todisco, and H. Delgado, “A Comparative Study of Deepfake Speech Detection Methods,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 79–89, 2023.
- H. Tak, J. Patino, M. Todisco, and N. Evans, “End-to-End Anti-Spoofing with Self-Supervised Speech Representations,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 1706–1718, 2023.
- Z. Chen, Y. Wu, and C. Wang, “Self-Supervised Speech Representation Learning for Audio Forgery Detection,” IEEE Transactions on Multimedia, vol. 26, pp. 1250–1263, 2024.
- J. Wang, Y. Li, and P. Zhang, “Hybrid Feature Fusion for Robust Deepfake Speech Detection,” IEEE Access, vol. 12, pp. 55621–55632, 2024.
- L. Li, K. Chen, and H. Zhao, “Detecting Fake Speech Using Self-Supervised Speech Embeddings,” Proceedings of ICASSP, pp. 1–5, 2024.
- A. Kumar, R. Singh, and S. Verma, “Generalizable Deepfake Audio Detection Using Hybrid Neural Architectures,” IEEE Signal Processing Letters, vol. 31, pp. 1185–1189, 2024.
- Y. Liu, H. Zhang, and J. Sun, “Temporal Convolutional Networks for Speech Spoofing Detection,” IEEE Transactions on Audio, Speech and Language Processing, vol. 32, pp. 2154–2165, 2024.
- S. Patel, M. Gupta, and R. Jain, “Deepfake Audio Detection Using Transformer-Based Speech Models,” Proceedings of INTERSPEECH, pp. 4210–4214, 2024.
- T. Zhang, X. Li, and P. Liu, “Cross-Dataset Generalization in Deepfake Audio Detection,” IEEE Transactions on Information Forensics and Security, vol. 20, pp. 402–415, 2025.
- R. Singh and P. Sharma, “Robust Detection of AI-Generated Speech Using Self-Supervised Audio Representations,” IEEE Access, vol. 13, pp. 10231–10245, 2025.
- Y. Chen, J. Huang, and W. Wang, “Improving Deepfake Speech Detection with Hybrid Acoustic Features,” Proceedings of ICASSP, pp. 521–525, 2025.
- A. Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS, 2020.
- W.-N. Hsu et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, 2021.
- S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE, 2022.
- A. van den Oord et al., “WaveNet: A Generative Model for Raw Audio,” 2016.
- J. Shen et al., “Tacotron 2: Natural TTS Synthesis,” 2018.
Recent advancements in neural speech synthesis have enabled the creation of extremely realistic deepfakes that
sound like actual human voices. Even though these technologies have some useful purposes, they also pose serious risks in
terms of voice impersonation, misinformation, and financial scams. The detection of fake speech is an emerging research
concern in speech forensics and cybersecurity. This paper proposes a dual-branch deep learning framework for effective
deepfake audio detection by combining self-supervised speech representations with handcrafted acoustic stability features.
The first branch is responsible for the extraction of semantic speech embeddings using the pre-trained model WavLM,
which contains contextual and phonetic information from the speech signal. The second branch is responsible for the
extraction of Mel-Frequency Cepstral Coefficient (MFCC) stability features and the application of the Temporal
Convolutional Network (TCN) model. The features from both branches are combined using a multilayer perceptron
classifier to decide if an audio sample is real or fake.
Experiments on the Fake-or-Real (FoR) dataset show that the proposed fusion approach enhances detection
performance compared to models using a single feature. The results suggest that merging deep contextual embeddings with
handcrafted stability features offers better resilience against modern deepfake audio generation methods.
Keywords :
Deepfake Audio Detection, Self-Supervised Learning, WavLM, MFCC Stability, Speech Forensics, Temporal Convolutional Networks.