Time series analysis of rubella incidence in Chongqing, China using SARIMA and BPNN mathematical models

Introduction: Chongqing is among the areas with the highest rubella incidence rates in China. This study aimed to analyze the temporal distribution characteristics of rubella and establish a forecasting model in Chongqing, which could provide a tool for decision-making in the early warning system for the health sector. Methodology: The rubella monthly incidence data from 2004 to 2019 were obtained from the Chongqing Center of Disease and Control. The incidence from 2004 to June 2019 was fitted using the seasonal autoregressive integrated moving average (SARIMA) model and the backpropagation neural network (BPNN) model, and the data from July to December 2019 was used for validation. Results: A total of 30,083 rubella cases were reported in this study, with a significantly higher average annual incidence before the nationwide introduction of rubella-containing vaccine (RCV). The peak of rubella notification was from April to June annually. Both SARIMA and BPNN models were capable of predicting the expected incidence of rubella. However, the linear SARIMA model fits and predicts better than the nonlinear BPNN model. Conclusions: Based on the results, rubella incidence in Chongqing has an obvious seasonal trend, and SARIMA (2,1,1) × (1,1,1) 12 model can predict the incidence of rubella well. The SARIMA model is a feasible tool for producing reliable rubella forecasts in Chongqing.


Introduction
Rubella is an eruptive, highly contagious, and mild viral infection transmitted through direct or droplet contact from nasopharyngeal secretions [1,2]. Humans are the only known host [3]. It typically begins with low-grade fever and lymphadenopathy, followed by characteristic brief appearance of a generalized erythematous, maculopapular rash [4]. Maternal rubella infection, especially during the first trimester of gestation, can cause miscarriage, stillbirths, or congenital rubella syndrome (CRS), a constellation of congenital disabilities that often includes the classic triad of cataracts, sensorineural deafness, and congenital heart defects [1,4,5]. Rubella occurs worldwide [6][7][8], and China is among the countries that have experienced frequent and widespread rubella outbreaks [9].
China, the most populous country, has reported 88% of rubella cases in the Western Pacific Region since 2004, when it began reporting rubella cases [10]. The rubella-containing vaccine (RCV) was licensed and made available in China in 1993 and was introduced nationwide by the government's Expanded Program on Immunization (EPI) in 2008 [11]. The overall incidence of rubella in China has declined in recent years, with a historically low level in 2017, during which the incidence was 1.16 per million total population, the accumulation of susceptible individuals led to a rebound in the epidemic, and almost 30,000 cases were reported in 2019 [9], according to the data from the National Notifiable Disease Reporting System (NNDRS). In fact, the number of rubella cases worldwide had more than doubled in 2019, compared to the year before [12]. Rubella remains a public health concern worldwide [3]. Thus, understanding the epidemiological patterns and establishing the accurate morbidity prediction model of rubella are critical for risk analysis and resource allocation in the health sector.
Many mathematical models have been developed to successfully forecast infectious diseases [13][14][15][16]. The autoregressive integrated moving average (ARIMA) model, which is the most popular linear modeling technique for forecasting time series [17], integrates changing trends, periodic changes, and random disturbances in time series and can reveal the quantitative relationship between the research object and other objects with the development and transformation of time [18]. The artificial neural network (ANN) model, on the other hand, has usually been preferred for nonlinear time series [19]. The backpropagation neural network (BPNN) model is one of the most widely used ANNs; it can approximate any complex nonlinear mapping and is especially suitable for solving complicated internal mechanisms [20].
Chongqing is one of the areas with the highest rubella incidence rates in China. A total of 187 rubella public health emergencies were reported nationwide in 2019, including 47 in Chongqing [21]. Although there are some epidemiological rubella studies, no previous research has been conducted to predict the incidence of rubella in Chongqing. Consequently, we developed a seasonal autoregressive integrated moving average (SARIMA) model, an extension of the ARIMA model with seasonal effect, and a BPNN model to forecast the incidence of rubella. The results of this study may help predict the epidemic trends of rubella and provide reference information for rubella public health intervention in Chongqing.

Data Collection
In this study, the incidence data from 2004 to 2019 were primarily gained from the Chongqing Center of Disease and Control (CDC), while population data were collected from the Chongqing Statistics Bureau. All rubella cases were initially diagnosed by clinical symptoms and confirmed by laboratory examination.

SARIMA model construction
The SARIMA model integrates the seasonal effects, long-term trend effects, cyclical variations, and disturbances of stochastic perturbations of the series. The general structure of the SARIMA model could be expressed as SARIMA (p, d, q) × (P, D, Q)s, in which p, d, and q are non-negative integers that indicate orders of non-seasonal autoregressive (AR) terms, nonseasonal differencing and non-seasonal moving average (MA), respectively; P, D, and Q are also non-negative integers that indicate orders of seasonal AR terms, seasonal differencing and seasonal MA terms, respectively; and S suggests the length of seasonal period [18]. The values of the above six parameters will be determined based on the sequence autocorrelation function (ACF) and the partial autocorrelation function (PACF). The complete structure of the model fitted to the sequence of observations is as follows [18]: In the above equation, represents the backward shift operator, denotes the residual at time , the mean of is zero and the variance is constant, is the observed value at time . Akaike information criterion (AIC) and Schwarz Bayesian criterion (BIC) were employed for model optimization. The tested model with the lowest AIC and BIC values was regarded as the relatively optimal one.

BPNN model construction
The BPNN is a typical multilayer feed-forward neural network consisting of an input layer, at least one hidden layer, and an output layer [22]. The basic idea of the BP algorithm is that the learning process consists of two processes: forward propagation of the signal and backward propagation of the error. In the forward propagation process, the input samples are passed in from the input layer, and after nonlinear processing at each hidden layer, they are passed to the output layer. The actual output value is compared with the expected value, if the given output requirement is not satisfied, and the error is propagated back. Error backpropagation is to back-propagate the output error in some form to the hidden and input layers, gradually reducing the error by adjusting the weight of each neuron in the hidden layer until the error between the actual and the expected output meets the accuracy requirement or reaches the maximum learning count [23].
A single hidden layer feed-forward network is the most widely used for time series modeling and forecasting [22]. The relationship between the output ( ) and the inputs ( ) is based on the following mathematical representation: is the weight of the k th neuron in the input layer to the j th neuron in the hidden layer; is the weight of the j th neuron in the hidden layer corresponding to the output layer neuron; is the threshold of the j th neuron in the hidden layer; is the threshold of the output layer neuron; n is the number of input nodes; is the number of hidden nodes. Both and are activation functions.

Performance statistic index
The mean absolute error (MAE), the root mean square error (RMSE) and the mean absolute scaled error (MASE) were used to determine the fitting and forecasting effect [24]. The calculation formulae are as follows: , Where and denote the actual incidence value and the estimated incidence value, respectively, at time , m is the seasonal period. Generally, lower MAE, RMSE and MASE values suggest better fitting and predicting performance, and vice versa.

Statistical analysis
At first, we used Excel 2016 software to count the number of reported rubella incidences/fatalities in Chongqing for each year from 2004 to 2019, organize the data into two periods: before (2004-2008) and after (2009-2019) the introduction of RCVs into China's nationwide EPI system, and conduct a descriptive analysis. Next, we analyzed the time series analysis of the rubella incidence sequence. We used the R4.0.5 software, to develop a SARIMA model and a BPNN model and predict the incidence of rubella. In this study, the incidence of rubella from 2004 to June 2019 was used as a training dataset to fit the model, predict the incidence of rubella from July to December in 2019, and verify the predicted effect. Statistical significance was determined as p < 0.05.

Ethics approval and consent to participate
This study protocol was approved by the Ethics Committee of Chongqing Medical University. Since all data collected in this study are was in accordance to the Law of the People's Republic of China on the Prevention and Treatment of Infectious Diseases, and we did not include any patients' personal information, written informed consent is not needed for this study.

Descriptive analyses
During this study (2004-2019), a total of 30083 rubella cases were reported in Chongqing, with an average annual incidence of 6.47 per 100000, and no deaths. Table 1 shows the average annual incidence rate after EPI introduction decreased from 10.76 per 100000 before implementation to 4.51 per 100000, with  statistically significant differences (χ 2 =6062.98, p<0.001). Figure 1 presents the temporal distribution of rubella incidence in Chongqing from 2004 to 2019, mainly showing a unimodal feature, with a peak incidence from April to June and a slight increase again in December, with obvious seasonal characteristics.
The sequence diagram also shows that the incidence of rubella from 2004 to 2019 was unstable (Figure 2). Between 2004 and 2008, regional rubella incidence ranged from 2.08 per 100000 in 2005 to a peak of 21.09 per 100,000 in 2007. Since 2013, the incidence of rubella has shown a downward trend, and the 2017 incidence was at a historic low level, during which only 52 rubella cases were reported (incidence rate of 0.17 per 100000). However, his rate increased again in 2019, with an incidence rate of 17.52 per 100000.
( , , ) × ( , , ) model Figure 3 shows the decomposition diagram of the rubella incidence sequence, which exhibits a fluctuating downward trend and prominent seasonal characteristics. Given the periodicity and seasonality of rubella, a first difference (d=1) and a seasonal difference (D=1) with a period of 12 were performed to eliminate non-stationarity. The sequence diagram after the difference appeared to be stationary (Figure 4), and the ADF test results remained significant ( < 0.05). Figure 5 shows that ACF and PACF of the stationary sequence were both trailing. Based on the    ACF, we determined the possible values of ( = 1, 2 3) and ( = 0 1) of SARIMA (p, d, q) × (P, D, Q)s, and based on PACF, we determined the possible values of ( = 1, 2 3) and ( = 0 1). All primary models were used to simulate and model the monthly rubella incidence. Several models passed the test and the model parameter test. The AIC and BIC values of the six candidate models are listed in Table 2, and we finally confirmed SARIMA (2, 1, 1) × (1, 1, 1) 12 as the optimal model, which had resulted in the minimum AIC (358.85) and BIC (377.77) values. Table 3 shows the parameter estimate results of the SARIMA model, and all the parameter estimates were significant ( ). The Ljung-Box test also confirmed that the residuals of the model were a white noise sequence ( ). The model equation is given as

BPNN model
After normalizing rubella incidence and converting all values to intervals [0, 1], we used the data of the past 12 months as the input data and the data of the thirteenth month as the output data. For example, the observed value from January to December 2004 was selected to predict that in January 2005, the practical value from February 2004 to January 2005 was chosen to indicate that in February 2005, and so on. Thus, the number of nodes in the input and output layers could be determined, with n = 12, m = 1. The number of neurons in the hidden layer could be calculated by the empirical formula: = √ + + , ranging from 5 to 14, where is the regulation constant with values between 1 and 10 [23]. We set the target error of the training of BPNN as 0.001, the training steps as 2000, the transfer function of the hidden layer as "tansig", the transfer function of the output layer as "purelin". When the difference between the target estimate and the actual value is less than 0.001, error back-propagation is not necessary; otherwise, back-propagation is required to further adjust the weights. During the entire modeling, when the difference between the two residual squares is less than 0.001, it can be considered converged, and the iteration can be skipped to end the algorithm [23]. Predictions are made based on the fitted weight matrix and threshold values for the validation set.
We constructed ten different BPNN models in terms of the number of neurons in the hidden layer.

Comparative analysis
The SARIMA (2, 1, 1) × (1, 1, 1) 12 model and the 12-8-1 BPNN model were employed for forecasting the incidence of rubella from July to December 2019. Table  4 shows the value of the prediction, showing that the predicted values obtained by the SARIMA model were closer to the actual values than those obtained by BPNN. We compared the observed rubella notification rate with the fitted and predicted ones in a point-to-point manner to further compare the validity of the two models ( Table 5). The SARIMA model was superior to the BPNN model for either fitting or forecasting performance. Figure 6 shows the actual incidence and fitted incidence of the two models; the SARIMA model had a fit value closer to the actual value than the BPNN model. Both Figure 6 and Table 4 show that the tendency and epidemics from predicted incidence are close to the real value of incidence and epidemic circumstance of rubella. These two prediction methods can be considered to predict the incidence of rubella in Chongqing. In terms of the fitting effect of the model and the accuracy of the prediction results, the SARIMA model would be much more suitable.

Discussion
In this study, we found that the incidence of rubella in Chongqing from 2004 to 2019 was not stable, but based on the annual trend of the national rubella incidence [9,25], with a peak incidence every 5-8 years [26]. In 2007 and 2008, the incidence of rubella in Chongqing was as high as 21.09 per 100,000 and 18.76 per 100000, respectively. A further peak of 10.81 per 100000 in incidence occurred in 2011. While RCV was introduced nationwide in 2008, its full implementation required four years due to sporadic vaccine supply constraints. Nationwide implementation was achieved in 2012, with routine immunization coverage exceeding 95% [25]. Since 2013, the incidence of rubella has exhibited a downward trend, and the 2017 incidence was at a historically low level. But it broke out again in 2019, with an incidence rate of 17.52 per 100,000, which may be caused by the accumulation of susceptible populations and the spread of imported strains [10,27]. Despite fluctuations, the introduction of rubella into China's EPI system in 2008 has indeed contributed to the control and elimination of rubella and the prevention of CRS in Chongqing, with a significant difference in the average annual incidence rate before and after the nationwide introduction of RCVs. Largescale routine vaccination with RCVs should be continued to close known immunity gaps and immunize health workers and populations at risk for rubella transmission [28]. The health department should continuously carry out virological surveillance work, timely detection of epidemic viruses, and genotype identification to scientifically prevent and control the spread of imported virus strains and further reduce the incidence of rubella.
We analyzed the rubella incidence rates and observed a fluctuating downward trend and seasonal characteristics in this area, with a single peak from April to June, and troughs from August to February, similar to the studies of other areas [9,25]. This may be related to the climate and human behaviors [29]. As temperatures rise, interpersonal activity and contact tend to increase during the spring and summer months. Thus, the virus is more likely to spread throughout the population, leading to widespread epidemics. Therefore, sanitary inspection should be done in the peak period to control the source of infection, cut off the transmission route, and establish a complete emergency plan.
Time series analysis extracts valuable information from historical data, identifies recursive mechanisms, and expresses the current observations as a function of their historical observations to make predictions about future trends. Rubella is a globally important public health issue. Accurate prediction of rubella is helpful for policy-makers to develop effective intervention plans and efficient allocation of public health resources. We demonstrated that both SARIMA and BPNN models can be used to predict the monthly incidence of rubella. The linear SARIMA model fits and predicts better than the nonlinear BPNN model, indicating that the rubella incidence data in Chongqing has good linear characteristics [19]. Since most epidemiological data are seasonal and cyclical [15][16][17]30], the SARIMA model considers seasonal effects and is suitable for analyzing sequences with significant seasonality and periodicity [31], which appears more appropriate for the prediction of rubella incidence. In addition, the establishment of SARIMA model is simple. To the best of our knowledge, this is the first study that assesses both linear and nonlinear models to predict rubella report rates in Chongqing. The results indicate that model SARIMA (2, 1, 1) × (1, 1, 1) 12 is the relatively more optimal prediction model in this study; the fitted values were in good agreement with the observed. The SARIMA model in this study can reasonably contribute to rubella surveillance in Chongqing and has a good short-term predictive effect. It provides a basis for predicting future rubella epidemics and for health authorities to strengthen public health measures to prevent and control the disease. It is worth noting that SARIMA is a short-term forecasting model with poor long-term forecasting ability, and the model needs to be updated or replaced by continuously incorporating new monitoring data during subsequent use. Therefore, model updating and the selection of new models remain essential elements of future research work.
Several limitations also exist in the present study. Firstly, both models were based on the time-series data for preliminary modeling prediction, without considering other factors affecting the rubella incidence, such as meteorological [32] and sociogeographical [33]. Incorporating more potential influencing factors may allow for more accurate mathematical modeling. Secondly, neither model can capture both linear and nonlinear patterns of the data equally well. Thus, in both cases, one of these parts is not considered, and can lead to deceptive results. Therefore, various hybrid approaches have been suggested [19]. In future research, more models with good time-series prediction results can be tested. A deep optimization combination of different models can make the advantages complement each other and achieve further improvement in the prediction accuracy rate.

Conclusions
The results of this study suggest that it is feasible to apply both SARIMA and BPNN models to predict the incidence of rubella in Chongqing; however SARIMA (2, 1, 1) × (1, 1, 1)12 showed better performance. The short-term prediction is effective, and it is helpful for the understanding of rubella epidemiology and resources allocation in Chongqing. Meanwhile, timely and effective countermeasures can be taken for possible epidemic peaks.