Convolution Neural Networks for Urban Sound Classification

By Neeraj Ochani, Computer Engineering, VESIT


In every aspect of human life, sound plays an important role. From personal security to critical surveillance, sound is a key element to develop the automated systems for these fields. Few systems are already in the market, but their efficiency is a point of concern for their implementation in real-life scenarios. Therefore, the learning capabilities of the deep learning architectures can be used to develop the sound classification systems to overcome efficiency issues of the traditional systems. The ability of convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to sound classification. The proposed solution given in this paper, is to use the deep learning networks for classifying the environmental sounds based on the generated spectrograms of these sounds. We used the spectrogram images of individual environmental sounds to train the convolutional neural network (CNN) to predict the class of that particular sound given as an input to the network. We used one dataset for our experiment: UrbanSound8k. The proposed network was trained on the dataset and achieved an accuracy of 92% compared to the two pretrained models widely used in image classification: VGGNet, AlexNet which achieved an accuracy of 89% and 80% respectively. From this experiment, it is concluded that the proposed approach for sound classification using the spectrogram images of sounds can be efficiently used to develop the sound classification and recognition systems.


Deep Learning, Convolutional Neural Network, Spectrograms, UrbanSound8k, Environmental Sounds


The analysis and understanding of sound in the urban environment is very important to the growth of cities and populations for the future. Studies have shown a direct impact on health and child development from noise pollution. As populations continue to grow around the world, including the cities we live in; urban noise will continue to create a larger health crisis globally. It is our responsibility as researchers and scientists to understand these problems so that solutions can be created. Any solution proposed will need to maximize success and minimize economic impact in order to be practically implemented. We can solve this problem by predicting the label of that event.

The common challenges faced in the field of environmental sound classification is identifying sound sources in the presence of (real) background noise which creates confusion due to noise-like continuous sounds and sensitivity to background interference. The collection of data is a huge process and needs good amount of technical knowledge to do so, as each sound undergoes compression of signals in a certain way, However, because of multiple categories of same sounds, the classification of sound is challenging. If the parameters are not controlled in an optimum manner, the results will be inaccurate and unsatisfactory.

To date, a variety of signal processing and machine learning techniques have been applied to the problem, including matrix factorization, dictionary learning, wavelet filter banks, and most recently deep neural networks. In particular, deep convolutional neural networks (CNN) are, in principle, very well suited to the problem of environmental sound classification, they are capable of capturing energy modulation patterns across time and frequency when applied to spectrogram-like inputs ,later identify spectro-temporal patterns that are representative of different sound classes even if part of the sound is masked (in time/frequency) by other sources (noise).

The drawbacks of the mentioned previous approaches are difficulty in detecting audio events that are temporarily present in an acoustic scene. Examples of such audio events include vehicles, car horns, and footsteps focusing on precise temporal detection. They are unable to reduce stationary noise and enhance transient sound events in these recordings. Multitask learning is also a major drawback of the above solutions including data privacy concerns requiring the classification to be performed directly on mobile sensor devices. They also result in high latency for fast model prediction with high computational costs.

In this paper, we propose a convolutional neural network which extracts features on the basis of Melfrequency cestrum coefficients which collectively make up an MFC. The Mel-frequency cestrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency which provides better accuracy compared to other pretrained models, thus overcoming the drawbacks mentioned above for environmental sound classification.

Related Work

There have been several attempts to overcome the problem of environmental sound classification and ongoing research in this field to develop an efficient and low-latency solution. The multiple methods, such as, support vector machines (SVM), gaussian mixture models (GMM), K-nearest neighbors (KNN) impose a prohibitive processing cost (embedded level) while fully-connected neural networks tend to achieve the best performance, requiring the lowest number of operations. However, increasing the time to convert(encoding) is comparatively faster. Our approach provides low-latency, less computational instructions and relatively requiring less space-time trade off.

The above approaches consider the hyperplane that leaves the maximum margin, However multiple hyperplanes are present that can separate the training data correctly this is a concern for the computational costs increasing, the value of soft margin parameter controls the tradeoff aspect which is difficult to control,KNN works on the minimum distance criteria to predict the most frequent class of k-nearest neighbors to the test point, it needs a significant amount of memory to run, hence requiring the whole training set.

Our proposed solution is taking into account, shared weights, the Mel-frequency cestrum coefficients (MFCC) which consider sub-sampling, used to reduce the resolution of the feature map. This solves the problem of distortion and shifts in the final output, are highly accurate in detecting patterns in spectral-temporal forms of data. Thereby providing better accuracy compared to other pre-trained models overcoming the drawbacks mentioned in the above approaches resulting in less time taken for the model to be trained and being generalized in predicting the class labels with high precision.


Data Exploratory Analysis

These sound excerpts are digital audio files in .wav format present in the dataset. Sound waves are digitized by sampling them at discrete intervals known as the sampling rate (typically 44.1kHz for CD quality audio meaning samples are taken 44,100 times per second). Each sample is that the amplitude of the wave at a specific interval , where the bit depth determines how detailed the sample are going to be also referred to as the dynamic range of the signal (typically 16bit which means a sample can range from 65,536 amplitude values). For audio analysis the libraries used are: IPython.display.Audio-This allows us to play audio directly in the jupyter notebook, Librosa-It is a Python package for music and audio processing by Brian McFee and will allow us to load audio in our notebook as a NumPy array for analysis and manipulation.

Fig. 1. The image shows the count for each class label in the dataset
Fig. 2. Each Class Label Waveform

From the above images we can get to a conclusion that we can see that it is hard to visualize the difference between some of the classes. Particularly, the waveforms for repetitive sounds for air conditioner, drilling, engine idling and jackhammer are similar in shape.

Fig. 3. Waveform Audio File Structure

Each audio file is saved in the form of .wav(Waveform Audio File) format from which multiple audio formats can be extracted of which the three major properties are: Audio channels most of the samples have two audio channels (meaning stereo) with a few with just the one channel (mono),Sample rate there is a wide range of Sample rates that have been used across all the samples which is a concern (ranging from 96kHz to 8kHz). Bit-depth there is also a range of bit-depths (ranging from 4bit to 32bit). For the following dataset there is multiple variation in no of corresponding class labels for each property of the sound excerpt, Due to which data preprocessing is a necessity.

Data pre-processing

In the above section of analysis, it is found that the following three properties: Audio Channels, Sample Rates, Bit-depth for each sound excerpt are highly variate, thereby increasing the significance of this step, this is achieved by the Librosa python package. The pre-processing is done by the help of Librosa’s load () function, which by default converts the sampling rate to 22.05 KHz, normalize the data so the bit-depth values range between -1 and 1 and flattens the audio channels into mono.

Fig. 4. Dual Channel Audio (Stereo)
Fig. 5. Single Channel Audio (Mono)
Feature Extraction

The output of using Librosa is compared with the default outputs of SciPy’s wavfile library with a particular file from the dataset it shows the original sample rate: 44100KHz,librosa sample rate: 22050KHz,It also normalizes the data so it’s values range between -1 and 1. This removes the complication of the dataset having a wide range of bit-depths. Original audio file min-max range: -23628 to 27507, Librosa audio file min-max range: -0.50266445 to 0.74983937, It also converts the signal to mono, meaning the number of channels will always be 1.

The next step is to extract the features we will need to train our model. To do this, we are getting to create a visible representation of every of the audio samples which can allow us to spot features for classification, using an equivalent technique used to classify images with high accuracy.

Spectrograms are a useful technique for  visualizing the spectrum  of  frequencies of a sound and the way they vary during a really short period of your  timeWe will be using a similar technique known as Mel-frequency cepstral coefficients (MFCCs).The main difference is that a spectrogram uses a linear spaced frequency scale (so each frequency bin is spaced an  equal number  of Hertz apart), whereas an MFCC uses a quasi-logarithmic spaced frequency scale, which is more almost like how the human sensory system processes sounds.

Fig. 6. Visual Representation of a Sound Wave in Time-domain, Spectrogram, MFCC

In sound processing, the Mel-frequency cestrum (MFC) may be a representation of  the short-term power spectrum of a sound, supported a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively structure an MFC. They are derived from a kind of cepstral representation of the audio clip (a nonlinear “spectrum- of-a-spectrum”).

Fig. 7. MFCC Spectrogram Generated from Time Series Audio Data for feature extraction

The difference between the cestrum and therefore the Mel-frequency cestrum is that within the MFC, the frequency bands are equally spaced on the Mel scale, which approximates the human auditory system’s response more closely than the linearly- spaced frequency bands used in the normal cestrum. This frequency warping can leave better representation of sound, for instance, in audio compression.

For each audio enter the dataset, we’ll extract an MFCC (meaning we’ve a picture representation for every audio sample) and store it during a Panda Data frame along with its classification label. For this we will use Librosa’s mfcc () function which generates an MFCC from time series audio data.

Model Development

The next phase of the proposed solution is the development of the convolutional neural network ,before moving towards the development of the model it is necessary to convert/encode categorial text data into model-understandable data which is done with the help of sklearn.preprocessing.LabelEncoder package available in the sklearn python library which is an open source machine learning library and designed to incorporate with the numerical and scientific libraries NumPy and SciPy, The dataset is split into two sets known as training and testing set, the testing set incorporates 20% of the dataset(fold 9 and fold 10) while the remaining 80% is incorporated in the training set(fold -1 to fold-8).

The model is a convolutional neural network which is built in a sequential structure (layer by layer) by using sequential model with a simple model architecture using Kera s and TensorFlow backend ,The model architecture consists of four Conv2D convolution layers, with the final output layer being a dense layer(fully connected layer),The convolution layers are designed for feature detection[20]. It works by sliding a filter (16,32,64,128) window over the input features, performing matrix multiplication and storing the result in a feature map, this is known as convolution.

Fig. 8. Categorical Cross-Entropy Loss

The first layer will receive the input of shape(40,174,1) where 40 denotes the no of MFCC’S , 174 is the no of frames taking padding into account and 1 shows the audio is mono,The activation function used in the convolutional layers is Relu(Rectified Linear Unit),A dropout is used of 20% so that the model does not suffer form the problem of overfitting (Dropout is dropping some nodes randomly not taking into consideration their weights),Each convolutional layer has a pooling layer of MaxPooling2D type with the final convolutional layer having a GlobalAveragePooling2D type, The pooling layer is inserted to reduce the no of parameters and consecutive computational instructions which results in reducing the dimensionality of the model ,shortening the training time and reduce overfitting as well, The loss function used is categorical cross entropy and Adam optimizer and metrics as accuracy

Fig. 9. Model Architecture with each layer having Relu activation function and last layer with SoftMax activation function

The last layer/output layer will have 10 nodes(numerical labels) which matches the number of possible classifications for that particular audio file, The activation function used in the final layer is SoftMax which sums the output to 1,so the output can be represented as one or 0.


The evaluation of the model is done with the help of the dataset UrbanSound8k which contains 10 low-level classes from the taxonomy: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music, all excerpts are taken from field recordings uploaded to The files are pre-sorted into ten folds (folders named fold1-fold10) to help in the reproduction of and comparison with the automatic classification results. In total they are 8732 audio files of urban sounds having same properties as of the original file.

Fig. 10. Loss Function (Cross Entropy) vs No of Epochs (left) and Accuracy Curve vs No of Epochs (right)

There by the loss is decreasing with the no of epochs and the training accuracy and testing accuracy is increasing with increase in the no of epochs with stabilizing after 100 epochs and achieving and accuracy of 92% respectively.

Fig. 11.The above image shows precision, recall, f1-score(left) and confusion matrix(right)

The two images above suggest that model is able to recognize the audio files easily for corresponding to each class label respectively with high accuracy, recall and f1-scores the confusion matrix also suggests that for the street music class label there is some less accuracy for the proposed model.

Conclusions and Future Directions

We have presented a deep learning approach for environmental sound classification, our method comprises of MFCC for extracting the features of the audio, represented as images of the sound known as spectrograms and convolutional neural network helping in the feature extraction procedure, the dense layer(fully connected layer) is used as a classifier(output layer).We implemented our approach with inexpensive hardware and obtained an respectable overall accuracy of 92% ,using the environmental sounds(such as car horn, dog barking etc.).The model was run for 100 epochs with a filter size varying from 16,32,64 and 128 and the loss function used was categorical cross-entropy with softmax as the activation function for the dense layer while max pooling 2D type for first three conv2d and global average pooling 2D type with a dropout of 20%,The dataset used contain 8732 sound excerpts corresponding to 10-different class labels.

It can reflected that the model is accurate in predicting the environmental sounds with ease and does not suffer from overfitting and is generic to different audio events not included in the dataset, However they are limitations in aspect of mobile development for low-latency and currently not able to deploy the model in a compact device for ease of use.

Future work will investigate the possibility of the model to be deployed in mobile devices, to decrease the computational cost and to explore possibilities to increase the accuracy creating hybrid neural networks.


  1. Organization, W.H., Childhood hearing loss: Strategies for prevention and care. 2016.
  2. A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, “Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp. 151–155.
  3. Kuncheva, L.I.; Whitaker, C.J. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning 2003, 51, 181–207.
  4. Huzaifah, M. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. arrive preprint arXiv:1706.07156 2017.
  5. J. Salamon and J. P. Bello, “Unsupervised feature learning for urban sound classification,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp. 171–175.
  6. ——, “Feature learning with deep scattering for urban sound analysis,” in 2015 European Signal Processing Conference, Nice, France, Aug. 2015.
  7. K. J. Piczak, ‘‘Environmental sound classification with convolutional neural networks,’’ in Proc. IEEE 25th Int. Workshop Mach. Learn. Signal Process., Sep. 2015, pp. 1–6.
  8. E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 International Joint Conference on Neural Networks (IJCNN), July 2015, pp. 1–7.
  9. Giannakopoulos, T.; Pikrakis, A. Introduction to Audio Analysis: A MATLAB R Approach; Academic Press, 2014.
  10. K. J. Piczak, “ESC: Dataset for environmental sound classification,” in 23rd ACM International Conference on Multimedia, Brisbane, Australia, Oct. 2015, pp. 1015–1018.
  11. Ye, J.; Kobayashi, T.; Murakawa, M. Urban sound event classification based on local and global features aggregation. Applied Acoustics 2017, 117, 246–256.
  12. Opitz, D. and R. Maclin, Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 1999. 11: p. 169-198.
  13. Bradbury, J., Linear predictive coding. Mc G. Hill, 2000
  14. Liaw, A. and M. Wiener, Classification and regression by randomForest. R news, 2002. 2(3): p. 18-22.
  15. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proc. 24th ACM Intern. Conf. Multimedia, 2014, pp. 1041– 1044.
  16. Lesica, N.A., Why Do Hearing Aids Fail to Restore Normal Auditory Perception? Trends in neurosciences, 2018.
  17. Brian McFee, Eric Humphrey, and Julian Urbano,” A plan for sustainable mir evaluation,” In Proceedings of the International Society for Music Information Retrieval Conference, pages 285–291, 2016.
  18. Theodoridis, S.; Koutroumbas, K. Pattern Recognition, Fourth Edition; Academic Press, 2008.
  19. D. Gupta and A. Khanna, ‘‘Software usability datasets,’’ Int. J. Pure Appl. Math., vol. 117, no. 15, pp. 1001–1014, 2017.
  20. Hyoung-Gook, K.; Nicolas, M.; Sikora, T. MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval; John Wiley & Sons, 2005.

Leave a Reply

Your email address will not be published. Required fields are marked *