Explainable Multimodal Hybrid Vision Transformers for Emotional Speech Recognition
Fragment książki (Materiały konferencyjne)
MNiSW
140
konferencja
| Status: | |
| Autorzy: | Powroźnik Paweł, Skublewska-Paszkowska Maria, Dziedzic Krzysztof, Barszcz Marcin, Chwaleba Kinga, Wach Weronika, Nunavath Vimala |
| Dyscypliny: | |
| Aby zobaczyć szczegóły należy się zalogować. | |
| Wersja dokumentu: | Drukowana | Elektroniczna |
| Język: | angielski |
| Strony: | 330 - 338 |
| Efekt badań statutowych | NIE |
| Materiał konferencyjny: | TAK |
| Nazwa konferencji: | 28th European Conference on Artifical Intelligence ; Including 14th Conference on Prestigious Applications of Intelligent Systems |
| Skrócona nazwa konferencji: | 28th ECAI 2025 ; 14th PAIS 2025 |
| URL serii konferencji: | LINK |
| Termin konferencji: | 25 października 2025 do 30 października 2025 |
| Miasto konferencji: | Bologna |
| Państwo konferencji: | WŁOCHY |
| Publikacja OA: | TAK |
| Licencja: | |
| Sposób udostępnienia: | Witryna wydawcy |
| Wersja tekstu: | Ostateczna wersja opublikowana |
| Czas opublikowania: | W momencie opublikowania |
| Data opublikowania w OA: | 25 października 2025 |
| Abstrakty: | angielski |
| Human speech possesses a great variety of conveying information. Identification of the transmitted emotions is crucial for effective communication, social, and human-computer interactions. Developing an efficient model is demanding issue due to subtle emotions differences, subjective assessment or sound characteristics for specific language. A wide range of deep learning methods are developed for this challenging task. We present Explainable Multimodal Hybrid Vision Transformers (EM-H-ViT), a unified framework that fuses four orthogonal feature spaces: Fuzzy-Transform energy maps, discrete Wavelet coefficients, complex Fourier spectrograms, and Mel spectrum coefficients, within a lightweight CNN–ViT backbone. Each modality is first projected to an image-like tensor. Modality-specific convolutional branches capture local patterns, while a shared Vision Transformer aggregates long-range speech context. A cross-modal attention gate learns data-driven fusion weights and simultaneously produces pixel-level saliency maps, enabling post-hoc interpretation. We evaluate the EM-H-ViT on four benchmark corpora containing recordings of emotional speech in the following languages: Polish, English, German and Danish, using speaker-independent splits. The proposed model reaches 95.2%, 98.7%, 97.6%, and 95.1% accuracy, respectively. Ablation studies show that removing any single transform degrades performance by 3.4%-6.8%, confirming their complementarity. Obtained results demonstrate that the model can deliver, language independent, both superior accuracy and transparent reasoning. |
