|
Fusion techniques may introduce more promising results into the music recognition field
when incorporated with prominent pre‐trained classifiers including ResNet50 and VGG16.
What is more, presenting a larger collection of data samples can also lead to attaining more
notable outcomes. Hence, an innovative model based on feature fusion of four distinct inputs
such as Mel‐spectrograms, spectrograms, scalograms, and Mel‐Frequency Cepstral Coefficients
has been proposed. In addition, classification methods, previously pre‐trained on ImageNet,
were applied as its core. The dataset utilized in this study was derived from Polish national
dance music consisting of five national dances such as the Kujawiak, the Polonez, the Mazur,
the Oberek, and the Krakowiak. In addition, it was decided to create and later compare two
separate datasets with 3 and 10‐second audio samples. The adaptive attention model is
proposed to adjust the extremely important features. The attained results were compared with
one of the most popular classification metrics such as testing accuracy, testing loss, precision,
recall and F1‐score. The Shapley Additive exPlanations were employed to assess which parts
of the input feature maps are the most essential to the model. As a result, the current approach
demonstrates outstanding results, exceeding 94% accuracy. Therefore, the study not only
defines a new standard for recognizing Polish national dances but also emphasizes the broader
promise of multi‐representation fusion as a model for next‐generation audio classification
|