K4F-Net: Lightweight multi-view speech emotion recognition with Kronecker convolution and cross-language robustness
Artykuł w czasopiśmie
MNiSW
70
Lista 2024
| Status: | |
| Autorzy: | Powroźnik Paweł, Skublewska-Paszkowska Maria |
| Dyscypliny: | |
| Aby zobaczyć szczegóły należy się zalogować. | |
| Rok wydania: | 2025 |
| Wersja dokumentu: | Drukowana | Elektroniczna |
| Język: | angielski |
| Numer czasopisma: | 4 |
| Wolumen/Tom: | 21 |
| Strony: | 110 - 126 |
| Scopus® Cytowania: | 0 |
| Bazy: | Scopus | BazTech | Central & Eastern European Academic Source (CEEAS) | CNKI Scholar (China National Knowledge Infrastucture) | DOAJ (Directory of Open Access Journals) | EBSCO | ERIH PLUS | Index Copernicus | J-Gate |
| Efekt badań statutowych | NIE |
| Materiał konferencyjny: | NIE |
| Publikacja OA: | TAK |
| Licencja: | |
| Sposób udostępnienia: | Otwarte czasopismo |
| Wersja tekstu: | Ostateczna wersja opublikowana |
| Czas opublikowania: | W momencie opublikowania |
| Data opublikowania w OA: | 31 grudnia 2025 |
| Abstrakty: | angielski |
| Speech emotion recognition has been gaining importance for years, but most of the existing models are based on a single signal representation or conventional convolutional layers with a large number of parameters. In this study, we propose a compact multi-representation architecture that combines four images of the speech signal: spectrogram, MFCC features, wavelet scalogram, and fuzzy transform maps. Furthermore, the application of Kronecker convolution for efficient feature extraction with an extended receptive field is shown. Another novelty is cross-fusion, a mechanism that models interactions between branches without significantly increasing complexity. The core of the network is complemented by a transformer-based block and language-independent adversarial learning. The model is evaluated in a scenario of quadruple cross-lingual tests covering four data corpora for four languages: English, German, Polish and Danish. It is trained on three languages and tested on the fourth, achieving a weighted accuracy of 96.3%. In addition, the influence of selected activation functions on the classification quality is investigated. Ablation analysis shows that removing the Kronecker convolution reduces the efficiency by 5.6%, and removing the fuzzy transform representation by 4.7%. The obtained results indicate that the combination of Kronecker convolution, multi-channel fusion, and adversarial learning is a promising direction for building universal, language-independent emotion recognition systems. |
