Recognizing emotion from Speech using Machine learning and Deep learning

By Abhay Anupam Gupta, Computer Engineering, VESIT

Introduction

Human communication through the spoken language is the base for Information exchange and it is the main aspect of the society since the first human settlements.

In the same way, emotions go back to a primordial instinct before the spoken language that we know today and that can be considered as the first natural communication strategy.

The project is about the detection of the emotions elicited by the speaker while talking. As an example, speech produced in a state of fear, anger, or joy becomes loud and fast, with a higher and wider range in pitch, whereas emotions such as sadness or tiredness generate slow and low-pitched speech.

Detection of human emotions through voice-pattern and speech-pattern analysis has many applications such as better assisting human-machine interactions.

In particular, we are presenting a classification model of emotion elicited by speeches based on deep neural networks (CNNs), SVM, MLP Classification based on acoustic features such as Mel Frequency Cepstral Coefficient (MFCC). The model has been trained to classify eight different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise).

SER system design flowchart

Dataset

The dataset is built using 5252 samples from:

1.Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset.

2.Toronto emotional speech set (TESS) dataset.

Methodology

The classification model of emotion recognition here proposed is based on a deep learning strategy based on convolutional neural networks (CNN), Support Vector Machine (SVM) classifier, MLP Classifier. The key idea is considering the MFCC commonly referred to as the ”spectrum of a spectrum”, as the only feature to train the model.

MFCC is a different interpretation of the Mel-frequency cepstrum (MFC), and it has been demonstrated to be the state of the art of sound formalization in automatic speech recognition tasks. The MFC coefficients have mainly been used as the consequence of their capability to represent the amplitude spectrum of the sound wave in a compact vectorial form.

The audio file is divided into frames, usually using a fixed window size, to obtain statistically stationary waves. The amplitude spectrum is normalized with a reduction of the ”Mel” frequency scale. This operation is performed for empathizing the frequency more meaningful for a significant reconstruction of the wave as the human auditory system can perceive.

For each audio file,40 features have been extracted. The feature has been generated by converting each audio file to a floating-point time series. Then, an MFCC sequence has been created from the time series.

CNN

The deep neural network(CNN) designed for the classification task is reported operationally in Fig. 1. The network can work on vectors of 40 features for each audio file provided as input. The 40 values represent the compact numerical form of the audio frame of 2s length. Consequently, we provide as input a of size < number of training files > x 40 x 1 on which we performed one round of a 1D CNN with a ReLu activation function, dropout of 20%, and a max-pooling function 2 x 2.

The rectified linear unit (ReLu) can be formalized as g(z) = max{0, z}, and it allows us to obtain a large value in case of activation by applying this function as a good choice to represent hidden units. Pooling can, in this case, help the model to focus only on principal characteristics of every portion of data, making them invariant by their position. We have run the process described once more by changing the kernel size. Following, we have applied another dropout and then flatten the output to make it compatible with the next layers.

Finally, we applied one Dense layer (fully connected layer) with a softmax activation function, varying the output size from 640 elements to 8 and estimating the probability distribution of each of the classes properly encoded (0=Neutral; 1= Clam; 2= Happy; Sad=3; Angry=4; Fearful= 5; Disgust=6; Surprised=7).

CNN Layer Description

MLP

A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN).MLP utilizes a supervised learning technique called backpropagation for training.

Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable. A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN).

MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

SVM

“Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. However, it is mostly used in classification problems.

In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate.

Data can be scaled before applying to an SVM classifier to avoid attributes in greater numeric ranges while processing it. Scaling also serves the purpose of avoiding some numerical difficulties during the calculation.

Result

F1-score for each class compared to the baselines (SVM, MLP) and state of art.
The trend of the cost function of our deep learning model over 200 epochs
The trend of the accuracy of our deep learning model over 200 epochs

    

Results of the CNN model on the test set per each class

         

MLP Classifier achieved an F1 score of 0.83 over the 8 classes.
SVM classifier achieved an F1 score of 0.82.
Final choice — deep learning model that obtained an F1 score of 0.85

Conclusion

In this work, we presented an architecture based on deep neural networks for the classification of emotions using audio recordings from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Toronto emotional speech set (TESS). The model has been trained to classify seven different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised) and obtained an overall F1 score of 0.85 with the best performances on the Happy class (0.90) and worst on the calm class (0.77). To obtain such a result, we extracted the MFCC features (spectrum of-a-spectrum) from the audio files used for the training.

Links

Project: Github

Video: Youtube

Blog: Medium

Team Members

Aditya Karmokar — karmoadity@gmail.com

Abhay Gupta — abhay8463@gmail.com

Khadija Mohamad Haneefa — diju19599@gmail.com

Chennaboina Hemantha Lakshmi — hemanthalakshmi.channaboina@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *