Baza Publikacji Pracowników Politechniki Lubelskiej

Status:
Autorzy:	Wójcicki Piotr
Dyscypliny:
	Aby zobaczyć szczegóły należy się zalogować.
Rok wydania:	2026
Wersja dokumentu:	Drukowana \| Elektroniczna
Język:	angielski
Wolumen/Tom:	14
Strony:	55535 - 55548
Impact Factor:	3,6
Web of Science® Times Cited:	0
Scopus® Cytowania:	0
Bazy:	Web of Science \| Scopus
Efekt badań statutowych	NIE
Materiał konferencyjny:	NIE
Publikacja OA:	TAK
Licencja:
Sposób udostępnienia:	Witryna wydawcy
Wersja tekstu:	Ostateczna wersja opublikowana
Czas opublikowania:	W momencie opublikowania
Data opublikowania w OA:	9 kwietnia 2026
Abstrakty:	angielski
	The advancement of speech synthesis technology has made detecting synthesized recordings a critical challenge in security and audio analysis. This study addresses the core challenge of cross-generator generalization under varying data availability by investigating the detection of synthesized speech using various machine learning architectures: Convolutional Neural Networks, Recurrent Neural Networks and Transformers. To evaluate model robustness, we introduce a tiered dataset framework comprising three variants, which simulate different levels of data scarcity and generator diversity. To improve robustness, Bayesian Model Averaging (BMA) was applied as the primary ensemble aggregation method, allowing for uncertainty-aware integration of the individual model outputs. Our experimental results, evaluated using Equal Error Rate (EER) and ROC-AUC, demonstrate that the BMA ensemble significantly outperforms individual architectures, particularly in data-constrained and temporally varied scenarios. The models were intentionally trained on impoverished and imbalanced datasets to evaluate their resilience under adverse conditions, with the reduced variant being the most demanding due to its severe data constraints. In the challenging variant, the BMA approach achieved an EER of 1.02% and an AUC of 0.99, effectively mitigating the performance drop-off seen in standalone models, which reached EERs as high as 5.82%. Furthermore, on the complete dataset, the proposed system attained an exceptional EER of 0.22% and an accuracy of 99.82%. These findings offer valuable insights for researchers in cybersecurity and multimedia forensics, highlighting that model uncertainty-aware aggregation can compensate for limited training data and provide reliable detection of synthesized audio even across diverse temporal variations and augmentation-heavy environments.

Informacja o cookies

Enhanced Deepfake Speech Classification With Ensemble Learning

Artykuł w czasopiśmie