DengAI: Disease Spread Prediction

By Padmaja Borwankar, Computer Engineering, VESIT

Introduction

Disease spread is a very important issue as it jeopardizes lives of people all over the world. The recent COVID-19 pandemic has given bitter experiences to most of us and opened our eyes which paved the way for a greater level of health awareness.

Dengue fever is a mosquito-borne disease that occurs in tropical and subtropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death. As it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have a lot of implications globally. In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands.

DengAI is an online intermediate-level practice competition hosted by drivendata.org. The task is to predict the number of dengue cases each week (in each location) based on environmental variables describing changes in temperature, precipitation, vegetation, and more. The metric which is being used to evaluate the model in this competition is based on mean absolute error. An understanding of the relationship between climate and dengue dynamics can improve research initiatives and resource allocation to help fight life-threatening pandemics.

Methodology

As a part of an internship conducted by LeadingIndia.ai, our team of four was assigned the DengAI project. We tried out various time series models and a Machine Learning model of Random Forest. The dataset provided consisted of 1456 entries of training data and 416 entries of testing data for two cities: San Juan and Iquitos. Climate-influenced variables like maximum and minimum temperature, humidity, precipitation, etc. were considered.

Statistical time series models like ARIMA, ARIMAX, SARIMA and SARIMAX were used. The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of the stationarized series in the forecasting equation are called “autoregressive” terms, lags of the forecast errors are called “moving average” terms, and a time series which needs to be differenced to be made stationary is said to be an “integrated” version of a stationary series. A non-seasonal ARIMA model is classified as an “ARIMA(p,d,q)” model, where:

  • p is the number of autoregressive terms
  • d is the number of nonseasonal differences needed for stationarity
  • q is the number of lagged forecast errors in the prediction equation

ARIMAX model is very similar to an ARIMA model, except that it also includes relevant independent variables. While the inclusion of exogenous variables adds complexity to the model-building process, the model can capture the influence of external factors.

In the SARIMA model (seasonal ARIMA), seasonality refers to periodic fluctuations. The seasonal part of an ARIMA model has the same structure as the non-seasonal part: it may have an AR factor, an MA factor, and/or an order of differencing. In the seasonal part of the model, all of these factors operate across multiples of lags (the number of periods in a season). A seasonal ARIMA model is classified as an ARIMA(p,d,q)x(P,D,Q) model, where P=number of seasonal autoregressive (SAR) terms, D=number of seasonal differences, Q=number of seasonal moving average (SMA) terms. SARIMAX again includes exogenous variables which capture the influence of external factors.

The machine learning algorithm that we used is random forest. This is an ensemble approach which means that the output of other algorithms or weak learners is combined into a weighted sum that represents the final output of the boosted algorithm. In case of random forest, the data set rows are divided into samples and features are divided into feature sets. Random combinations of these samples and feature sets are given to multiple decision trees, each of which gives some output. Since our problem statement is of regression, the random forest regressor is used which takes the mean of outputs of all decision trees as the final output. The hyper parameter considered is no. of estimators which is the no. of decision trees used.

Fig 1.  Comparison of predicted and actual values on validation data of San Juan city

We obtained fairly good results with low Mean Absolute Error values for our models. The scores that we obtained are given below, for which we obtained a highest rank of 813 on drivendata.org:

Fig 2. Comparison of MAE of various models
Fig 3. Highest rank of 813 achieved on submission of the solution

Models like ARIMA and SARIMAX performed considerably well but random forest outperformed all the other models. Limitations of the project are that the dataset was small (1872 rows) and the data was region specific and hence cannot be applied to other geographical locations. In the future, Deep learning algorithms like LSTM can be tried out. The project can be extended to other regions and other diseases as well.

References

Leave a Reply

Your email address will not be published. Required fields are marked *