By Advadit Naik, Computer Engineering, VESIT
The goal of the project is to classify real-life videos into Violence and Non-Violence categories. This prediction can be used combined with surveillance cameras to alert police regarding the violence and stop the violence. This can be used along with facial recognition to identify people involved in violence.
Violent individuals can be a threat to society and it is of vital importance to ensure the safety of every individual in public places. Various computer vision techniques have been developed for human action recognition but here we try to employ deep learning methods to detect violence. The project is broadly implemented in two steps. First, we check if violence is present in the video frames and if it is, then we pass these frames through a face recognition system to detect the individual!
The dataset we used was the Real-Life Violence Situations Dataset containing 1000 violent and 1000 non-violent videos of varying resolution. We collected these videos from various sources like YouTube, recordings of real street fights, fights in sports tournaments, political riots, etc. We ensured to capture both indoor and outdoor environments in this dataset to offer a wide range of feature learning. An insight into what our data set looks like is shown below:
We leveraged transfer learning for our approach by using a pre-trained neural network called MobileNet. MobileNet was trained on ImageNet Dataset and we used to tune it to our dataset. Input video files were first prepared using data augmentation which splits each video into a sequence of 20 contiguous frames followed by data annotation where [1,0] denotes a violent video and [0,1] denotes a non-violent video. This offers data compaction and allows for easy processing.
MobileNet architecture was used to process the 20 frames of each video and the transfer values obtained from it are given as input to the LSTM neural network. The LSTM network is trained using supervised learning by using two classes of videos (violent, non-violent). The hyperbolic tangent function is used by the LSTM layer for activation. The last layer consists of the sigmoid classifier. If any of the frames in a video detect violence, then the video will be classified as violent. The model architecture of MobileNet+LSTM is shown below:
After classification, the video containing the violent activity is then passed through the Haar Cascade classifier which detects the face of an individual. The captured person’s face is compared with the face dataset and consequently, the individual is identified.
Experimental Results and Analysis
As we examine the graphs shown above, we can see that the overall performance of the MobileNet +LSTM model is pretty good as accuracy increased with an increase in the number of epochs. It can be inferred that the features extracted from the MobileNet architecture are responsible for obtaining high accuracy.
At the end of our work, we provide a simple graphical user interface that highlights violent and non-violent activities with different colour borders as shown below:
In the future, we can try to create an explicit database based on the timestamp of violent behaviour in a video, and the respective screenshots can be automatically stored in the database for future reference learning. This will help increase the diversity of the dataset and make them adaptable to be deployed in real-time violence recognition systems.
The full code for this project is available on GitHub:
Check out these papers to read more about violence detection using transfer learning and pre-trained neural networks!
1. Aqib Mumtaz , Allah Bux Sargano, & Zulfiqar Habib. (2018). Violence Detection in Surveillance Videos with Deep Network using Transfer Learning. IEEE.
2. Shakil Ahmed Sumon, Raihan Goni†, Niyaz Bin Hashem‡, Tanzil Shahria, & Rashedur M. Rahman¶. (2019). Violence Detection by Pretrained Modules with Different Deep Learning Approaches. Vietnam Journal of Computer Science.