Supervised machine learning for audio emotion recognition: enhancing film sound design using audio features, regression models and artificial neural networks

Tools

Cunningham, Stuart, Ridley, Harrison, Weinel, Jonathan ORCID: https://orcid.org/0000-0001-5347-3897 and Picking, Richard (2021) Supervised machine learning for audio emotion recognition: enhancing film sound design using audio features, regression models and artificial neural networks. Personal and Ubiquitous Computing, 25. pp. 637-650. ISSN 1617-4909 (Print), 1617-4917 (Online) (doi:10.1007/s00779-020-01389-0)

[thumbnail of Author's published manuscript]

Preview

PDF (Author's published manuscript)
34055_WEINEL_Supervised_machine_learning.pdf - Published Version
Available under License Creative Commons Attribution.
Download (3MB) | Preview

Official URL: https://doi.org/10.1007/s00779-020-01389-0

Abstract

The field of Music Emotion Recognition has become and established research sub-domain of Music Information Retrieval. Less attention has been directed towards the counterpart domain of Audio Emotion Recognition, which focuses upon detection of emotional stimuli resulting from non-musical sound. By better understanding how sounds provoke emotional responses in an audience, it may be possible to enhance the work of sound designers. The work in this paper uses the International Affective Digital Sounds set. A total of 76 features are extracted from the sounds, spanning the time and frequency domains. The features are then subjected to an initial analysis to determine what level of similarity exists between pairs of features measured using Pearson’s r correlation coefficient before being used as inputs to a multiple regression model to determine their weighting and relative importance. The features are then used as the input to two machine learning approaches: regression modelling and artificial neural networks in order to determine their ability to predict the emotional dimensions of arousal and valence. It was found that a small number of strong correlations exist between the features and that a greater number of features contribute significantly to the predictive power of emotional valence, rather than arousal. Shallow neural networks perform significantly better than a range of regression models and the best performing networks were able to account for 64.4% of the variance in prediction of arousal and 65.4% in the case of valence. These findings are a major improvement over those encountered in the literature. Several extensions of this research are discussed, including work related to improving data sets as well as the modelling processes.

Item Type:	Article
Uncontrolled Keywords:	audio, emotion, machine learning, affective computing
Subjects:	M Music and Books on Music > M Music Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Faculty / School / Research Centre / Research Group:	Faculty of Engineering & Science > School of Computing & Mathematical Sciences (CMS) Faculty of Liberal Arts & Sciences > Sound-Image Research Group Faculty of Engineering & Science
Related URLs:	Publisher
Last Modified:	04 Mar 2022 13:06
URI:	http://gala.gre.ac.uk/id/eprint/34055

Actions (login required)

View Item

Downloads

Downloads per month over past year

View more statistics

Altmetric