Enhanced Deepfake Speech Classification With Ensemble Learning
Artykuł w czasopiśmie
MNiSW
100
Lista 2024
| Status: | |
| Autorzy: | Wójcicki Piotr |
| Dyscypliny: | |
| Aby zobaczyć szczegóły należy się zalogować. | |
| Rok wydania: | 2026 |
| Wersja dokumentu: | Drukowana | Elektroniczna |
| Język: | angielski |
| Wolumen/Tom: | 14 |
| Strony: | 55535 - 55548 |
| Impact Factor: | 3,6 |
| Efekt badań statutowych | NIE |
| Materiał konferencyjny: | NIE |
| Publikacja OA: | TAK |
| Licencja: | |
| Sposób udostępnienia: | Witryna wydawcy |
| Wersja tekstu: | Ostateczna wersja opublikowana |
| Czas opublikowania: | W momencie opublikowania |
| Data opublikowania w OA: | 9 kwietnia 2026 |
| Abstrakty: | angielski |
| The advancement of speech synthesis technology has made detecting synthesized recordings a critical challenge in security and audio analysis. This study addresses the core challenge of cross-generator generalization under varying data availability by investigating the detection of synthesized speech using various machine learning architectures: Convolutional Neural Networks, Recurrent Neural Networks and Transformers. To evaluate model robustness, we introduce a tiered dataset framework comprising three variants, which simulate different levels of data scarcity and generator diversity. To improve robustness, Bayesian Model Averaging (BMA) was applied as the primary ensemble aggregation method, allowing for uncertainty-aware integration of the individual model outputs. Our experimental results, evaluated using Equal Error Rate (EER) and ROC-AUC, demonstrate that the BMA ensemble significantly outperforms individual architectures, particularly in data-constrained and temporally varied scenarios. The models were intentionally trained on impoverished and imbalanced datasets to evaluate their resilience under adverse conditions, with the reduced variant being the most demanding due to its severe data constraints. In the challenging variant, the BMA approach achieved an EER of 1.02% and an AUC of 0.99, effectively mitigating the performance drop-off seen in standalone models, which reached EERs as high as 5.82%. Furthermore, on the complete dataset, the proposed system attained an exceptional EER of 0.22% and an accuracy of 99.82%. These findings offer valuable insights for researchers in cybersecurity and multimedia forensics, highlighting that model uncertainty-aware aggregation can compensate for limited training data and provide reliable detection of synthesized audio even across diverse temporal variations and augmentation-heavy environments. |
