automatic speech emotion recognition explainability interpretability speech emotion features
Speech Emotion Recognition (SER) is a method of identifying emotional states from the human voice. Automatic SER (ASER) is a research domain where Machine Learning (ML) is used to extract and analyze speech features to predict emotional states. Using ML in a sensitive area like SER requires transparency and reliability of the models. For instance, ASER is crucial to understanding the underlying decision-making in real-world applications such as mental health monitoring systems. Researchers, therefore, have focused attention on advancing the interpretability and explainability of ASER models. Interpretability maximizes human understanding of complex processes by providing meaningful insights. Explainability presents the interpretable insights in a clear and human-understandable manner. Some standard interpretability methods include feature importance, feature selection methods, and attention models. Explainability methods include SHapley Additive exPlanations (SHAP), visualizations using embedding plots, saliency maps, etc., and feature importance analysis. The current systematic review explores the different interpretability and explainability methods for speech emotion features. The current review paper aims to identify the progress in the area, identify potential research gaps, and motivate future research.
Details
Title
A Systematic Review of Interpretability and Explainability for Speech Emotion Features in Automatic Speech Emotion Recognition
Authors/Creators
Hiruni Maleesa Jayasinghe - Murdoch University, School of Information Technology, 90 South St, Murdoch, 6150, Western Australia, Australia
Kok Wai Wong - Murdoch University
Anupiya Nugaliyadde - Murdoch University, School of Information Technology