Variational-LSTM autoencoder to forecast the spread of coronavirus across the globe

Modelling the spread of coronavirus globally while learning trends at global and country levels remains crucial for tackling the pandemic. We introduce a novel variational-LSTM Autoencoder model to predict the spread of coronavirus for each country across the globe. This deep Spatio-temporal model does not only rely on historical data of the virus spread but also includes factors related to urban characteristics represented in locational and demographic data (such as population density, urban population, and fertility rate), an index that represents the governmental measures and response amid toward mitigating the outbreak (includes 13 measures such as: 1) school closing, 2) workplace closing, 3) cancelling public events, 4) close public transport, 5) public information campaigns, 6) restrictions on internal movements, 7) international travel controls, 8) fiscal measures, 9) monetary measures, 10) emergency investment in health care, 11) investment in vaccines, 12) virus testing framework, and 13) contact tracing). In addition, the introduced method learns to generate a graph to adjust the spatial dependences among different countries while forecasting the spread. We trained two models for short and long-term forecasts. The first one is trained to output one step in future with three previous timestamps of all features across the globe, whereas the second model is trained to output 10 steps in future. Overall, the trained models show high validation for forecasting the spread for each country for short and long-term forecasts, which makes the introduce method a useful tool to assist decision and policymaking for the different corners of the globe.


Introduction
As a novel contagious disease, COVID-19 has reached more than eight millions confirmed cases and more than 400,000 death globally by 14 th of June 2020 [1]. Although there are a number of the statistical and epidemic models to analyse COVID-19 outbreak, the models are suffering from many assumptions to evaluate the impact of intervention plans which create a low accuracy as well as unsure prediction [2]. Therefore, there is a vital need to develop new frameworks/methods to curb/control the spread of Coronavirus immediately [2,3].

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0246120 January 28, 2021 1 / 22 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The epidemic outbreak of COVID-19 in literature is investigated using mathematical compartmental model named Susceptible-Infected-Recovered (SIR) [4]. The SIR model represents a population under three categories: 1) Susceptible (the number of people presently not infected), 2) the number of people currently infected, and 3) the number of people either recovered or died. The model describes as differential equations. The model is completely determined by transmission rate, the recovery rate, and the initial condition, which can be estimated using least square error, Kalman filtering or BMC. The model is sometimes renamed based on the new parameters such as Susceptible-Infectious-Quarantined-Recovered (SIQR) or Susceptible-Exposed-Infected-Recovered (SEIR). The main idea in the version of all SIRs models are four-fold; first, identification and better understanding current epidemic [5], second, simulation the behaviour of the system [6], third, forecasting of the future behaviour [7], and last, how we control the current situation [8]. However, the results of the models including accuracy only valid based on their assumptions in a slice of available data/moment and have their scopes to assist healthcare strategies for the decision-making process.
On the other hand, agent-based modelling is utilised to explore and estimate the number of contagions of COVID-19, specifically for certain countries [9,10]. Also, statistical methods [11], simple time series modelling [12], and logistic map [13] are utilised for similar objectives, whereas [3], focused on modelling the spread of coronavirus based on the parameters of basic SIR in a (3-dimensional) iterative maps to provide a wider picture of the globe. Petropoulos and Makridakis [14] forecasted the total global spread relying on exponential smoothing model based only on historical data. Put all together, the drawbacks of their models are not flexible to fit for each country or region due to the lack of necessary measures, government responses, and spatial factors related to each specific location.
There are few examples of predictive modelling of the coronavirus spread based on machine learning approaches, whether through shallow or deep models. While it is can be explained due to the limitation of data since the early stage of the outbreak, it remains an essential tool. According to Pham and Luengo-oroz [15], machine learning approaches certainly could assist in forecasting by with improved quality for prediction. One of the few studies is presented by [2]. They have applied real-time short-term forecasting using the compiled data from 11 th Jan to 27 th Feb 2020 collected by the World Health Organization (WHO) for the 31 provinces of China. The data is trained on a deep learning model for real-time forecasting of new cases for the provinces. Their model has the flexibility to be trained at the city, provincial, or national level. Besides, the latent variable of the trained model is used to extract necessary features for each region and fed into a K-means to cluster similar features of the infected or recovered features of patients. Bearing this in mind, there is still a knowledge gap for machine learning models to predict coronavirus cases at a global as well as regional scales [15].
While SIR models with their different types, in addition to the aforementioned ones, are essential, the challenges remain in forecasting different regions and countries across the globe with a single model without any assumptions or scenario-based rules, but only with the current situations, features related to countries, and measures amid to reduce the impact of the outbreak. Accordingly, in this paper, we introduce a new method of learning and encoding information related to the historical data of coronavirus per country, features of countries, spatial dependencies among the different countries, and last, the time and location-dependent measures taken by each country amid towards reducing the impact of Coronavirus. Relying on deep learning, we introduce a novel variational Long-Short Term Memory (LSTM) autoencoder model to forecast the spread of coronavirus per country across the globe. This single deep model aimed to provide robust assistance to policymakers to understand the future of the pandemic at both a global level and country level, for a short-term forecast and long-term one.
The main advantages of the proposed method are: 1) It can structure and learns from different data sources, either that belongs to spatial adjacency, urban and population factors, or various historical related data, 2) the model is flexible to apply to different scales, in which currently, it can provide prediction at global and country scales, however, it can be also applied to city level. And last 3) the model is capable of learning global trends for countries that have either similar measures, spread patterns, or urban and population features.
After the introduction, the article is structured in five sections. Section 2 introduces the method and materials used. In section 3, we show model evaluations and the experimental results at country and global levels. In section 4 we discuss our results, compare our model to any existing base models and highlights limitations. Last, in section 5 we conclude and present our recommendation for future works.

Hypothesis and assumptions
The model algorithms are constructed based on four assumptions that we assume the model needs to learn to predict the next day spread: First, the model needs to extract features regarding the historical data of coronavirus spread for a given country bearing in mind the historical values of the virus spread in the other countries simultaneously before it outputs a prediction for a given country. Second, before the model gives a predicted value for each country, it should consider the predicted values of all other countries instantaneously, similar to the first point. Third, the spatial relationship between different countries is multidimensional; it can vary based on geographical location, adjacency, accessibility, or even policies for banning accessibility. The model needs to deal with variations of time and location of the different inputted scenarios while sampling outcomes. Last, apart from the virus features, for each country, there are unique demographic and geographical features that show association to the spread of the virus that may show association with the virus, in which the learning process of the model needs to consider each time before it gives a predicted value.
The structure of the input data is key for any model to learn. Fig 1 shows the concept of the overall structure of the proposed graph of multi-dimensional data sets for forecasting the spread. It illustrates how different types of data can be linked and clustered for the model to learn the spread of a virus. This data can be seen as dynamic features related to both virus and the location with long temporal scales (i.e. the population data) or short ones (t i ). It shows how local and global trend for a virus can be forecasted for a given country (n z ), with urban features that include both spatial and demographic factors (x m ), that share a spatial weight (g j ) with other countries in the graph, whereas government mitigated measures (r q ) are applied. Put all together, the model needs to differentiate between factors that characterise countries or regions, and those which characterise the virus spread to understand the patterns of spread at global and country levels.

Translation to the machine
To meet these hypotheses and assumptions during the learning process, the architecture of the proposed model is based on the combinations of three main components: 1) LSTM, 2) Selfattention, and 3) Variational autoencoder graph.
2.2.1 LSTM cells. LSTM represents the main component of the proposed model. It has been shown it is the ability to learn long-term dependencies easier than a simple recurrent architecture [16,17]. Unlike traditional recurrent units, it has an internal recurrence or a selfloop, in which it allows the timestamps to create paths, in which the gradient of the model can flow for a long duration without facing the vanishes issues presented in a normal recurrent unit. Even for an LSTM with a fixed parameter, the integrated time scale can change based on the input sequence, simply because the constants of time are outputted by the model itself. These self-loops are controlled by a forget gate unit (f i (t) ) for a given time (t) and a cell (i), in which it fits this weight to a scaled value between 0,1 with a sigmoid unit (σ). It can be explained as: Where x (t) is a vector for the current input, h (t) is a vector for the current hidden layer that contains the outputs of all the LSTM cells, b f are the biases for the forget gates, U f is the input weights, W f is the recurrent weights for the forget gates. The internal state of the LSTM is updated with a conditioned self-loop weight (f i ) as: Where b represents biases, U represents input weights, W represents the current weights into the LSTM cell, and g i (t) represents the external input gate unit. It is computed similar to the forget gate but with it is own parameters as: Last, the LSTM cell output h i (t) can also be controlled and shut off with an output gate q i (t) , similar to the aforementioned gate by using a sigmoid unit. The output h i (t) is computed as: Where b o represents biases, U o represents input weights, W o represents the current, and φ .represents the activation function such as tanh function. Put all together, this controls of the time scale and the forgetting behaviour of different units allow the model to learn long-and short-term dependencies for a given vector. Not only the model learns from the previously defined timestamps for each country, but also the model could extract features from the other countries at each given timestamp in which the dimension of the input vector, and cell states, includes the dimensions of the different countries. It is worth mentioning that the input for the LSTM cells is can be seen as a three-dimensional tensor, representing the sample size for both training and testing, the defined timestamps for the model to look back, and the timestamps of the other countries as a global feature extractor.

Self-attention mechanism.
While the LSTM cells learn from their input sequence to output the predicted sequences through the long and short dependencies of the time constants and their additional features for each country, the relations between its inputs remains missing. A self-attention mechanism allows the LSTM units to understand the representation of its inputs by relating the positioning of each sequence [16,18]. This mechanism in the case of the proposed model is crucial to assist the model to which piece of information to consider and what to forget when making a prediction.

Variational autoencoder graph.
We initialise the first graph based on the spatial weight of the geographical locations of all infected countries (more details will follow in subsection 3.1.4), however, despite the attempts of trying to create a sophisticated adjacency matrix among the infected countries (based on flight routes, spatial network, migration network, etc.), the output may misleading for any learning method over time or for a given location. The spatial weight since the outbreak of the model may look completely different from the initial day to the latest day. These due to different policies and measures that are taken by countries. However, due to its high uncertainty and variation. Inputting the model with a static graph or even a dynamic one based on limited data may exacerbate the learning process. Accordingly. the third vital components in our model represent the variational autoencoder (VAE) component that allows the model to generate information from a given input. It can be defined as a generative directed method that makes use of the learned approximate inference [16,19]. The model is based on the idea of passing latent variables z to the coded distribution p model (z) over samples x using a differentiable generator network g(z). Subsequently, x is sampled from the distribution of p model (x; g(z)) which is equal to the distribution of p model (x|z). The model is trained by maximising the lower bound of the variation L(q) that belongs to x as: Eq (6) describes the joint log-likelihood of the visible and hidden variables under the approximate posterior over the latent variables log p model (z,x), and the entropy of the approximate posterior H(q(z|x), in which q is chosen to be a Gaussian distribution with a noise that is added to the predicted mean value. In traditional VAE, the reconstruction log-likelihood tries to equalise the approximate posterior distribution q(z|x) and the model prior p model (x|z). However, in the case of our model the encoded q(z|x) is conditioned and penalized based on the output of a predicted value of the next forecast of the spread, instead of the log-likelihood of the similarity with p model (x|z), which will be explained further in the proposed framework.

Proposed model framework
We propose a sequence-to-sequence architecture relying on a mixture of VAE and LSTM. The model comprises two branches trained in parallel in an end-to-end fashion. Fig 2 shows the overall proposed framework. The first branch is a self-attention LSTM model that feeds by Spatio-temporal data of coronavirus spread per day and per country, the government policies per day and per country, and the urban features per country, in which the vector is repeated to cover the duration of training (The urban features used are three features: population density, urban population percentage and fertility rate, which will be covered in detail in the upcoming section). Each input is reshaped as a 3D tensor of shape (samples, timestamps, number of features X number of countries). The three-input data are concatenated at the last axis of the data (the dimension of the feature) and passed to the first branch of the model through two parts: 1) the self-attention LSTM sequence encoder, and 2) the LSTM sequence decoder.
The first sequence encodes the input data and extracts features for the second part of the LSTM sequence to output the prediction of the spread for the next day (in case of the shortterm forecast) per country.
The first part consists of three LSTM layers, each consists of 50 LSTM units. The first two layers activated by a Rectified Linear Unit (ReLU) and a separated by a Dropout layer of size (0.2) to minimise over-fitness Moreover, a self-attention mechanism is applied after the second LSTM layer. The final LSTM unit is activated by a linear function as the first output for the LSTM sequence encoder part.
In parallel to the self-attention encoder sequence, the second branch of the model is an encoder of VAE. It is feed by a spatial matrix of dimensions (number of countries X number of countries) and repeated for the entire duration of training and timestamps (In the next section, more details will follow on how it is selected and computed). This encoder part is mainly a convolution structure, which consists of three 1D convolution layers of filters 32, 64, and 128 respectively, in which they are all of a kernel size of 1 and activated by a ReLU function and followed by a Dropout layer of size 0.2. After the dropout, two LSTM layers are followed, in which they contain 100, 494 LSTM cells respectively. The first one is activated by a ReLU function, whereas the second one by a linear function. A fully connected layer of neurons equivalent to the number of countries is applied. Last the latent space is defined with a dimension of 10, in which the z-values are generated from sampling over the Gaussian distribution of the previous inputted layer (As explained in section 2.2.3). To visualise the generated graph for representation purposes, It is worth mentioning that the encoder of the second branch of the model can be decoded to output the generated samples for each predicted sequence by passing it into a decoder VAE, where the 1D convolutions layers are transposed to a final output shape equal to the inputted dimension. As for future work, this could be an interesting approach to understanding the variation of the graph for each predicted day for all countries.
Both outputs of the self-attention LSTM encoder and the encoder of the VAE are concatenated over the feature dimension and passed to the LSTM decoder sequence, which contains a single LSTM layer of cell numbers equal to the total number of countries. It is followed by two fully connected layers of shape size (1 X number of countries) for predicting the value of the next day, in case of the short-term forecast, or can be shaped to (numbers of future steps X number of countries) for any number of future steps that model needs to output per each country.
Data sets are split to training and testing on the first dimension of data shape (the total duration of the temporal data), in a way that the model can be tested on the last 6 days. We trained two different models, one as a single-step model for the short-term forecast (one day), whereas the other is trained as a multi-step model (10 days forecast). There are two crucial differences between these two models; The output layer, and the dimension of the y-train, and ytest of the first one is shaped as (1 x n), whereas in the other model is output layer is shaped as (10 X n), despite the number of samples. is the structure of the y-train and y-test. The second issue is the trained and tested sample is not only reduced by the number of timestamps-at the beginning of each sequence-as in the case of the first model but also reduced by the number of future steps -at the end of the sequence-in the case of the second model. Last, based on trial and error, we structured the data based on 3 timestamps for both models to look back for all the input features for each country, in which we found optimal results.
The weights of the model are initialised by random weights. The model is compiled based on the backpropagation of error of the stochastic gradient descents, relying on 'adam' optimiser [20], with a learning rate of 0.001 and momentum 0.9. The model is trained for 500 training cycles (epochs).

Evaluation metrics
The performance of the proposed method is evaluated based on three different scales; 1) a global loss-based evaluation, 2) country-based evaluation and last, 3) step-based evaluation. The short-term forecast model (single-step model) relies only on the first two evaluation metrics, whereas the multi-step model includes the three levels of evaluations.
The first loss function evaluates the overall performance of the model at a global level, in which it influenced the adjustment of the model weights during training for both trained models. It is evaluated based on the Mean Squared of Error (MSE) which is calculated as: Where m is the total sample,ŷ test ð Þ is the predicted values of the test set, and y (test) is the observed values of the test set.
Furthermore, we computed a Logarithmic version of Mean squared error or so-called Mean squared logarithmic error (MSLE) to understand the ration between the true and predicted values. This function is accountable for the relative difference between the true and predicted values, whereas large errors are not significantly penalised than small ones. MSLE makes it easier for understanding and comparing the model performance in different countries despite how small or large their number of cases. MSLE is defined as: We also computed Kullback-Leibler divergence (D KL ) or so-called 'relative entropy 'which measures the difference between the probability distribution of two sequences. It is a common approach for assessing the VAE, nevertheless, it could be a good indicator to evaluate the predicted sequences globally. It is calculated as: Where p(x) and q(x) represent the two probability distributions of the two random discrete sequences of x. In the case of the model p(x) and q(x) represents the true distribution of data and the predicted one (y (test) andŷ test ð Þ ). It is worth mentioning pðxÞjjq x ð Þ The second loss evaluates the performance of the model at a local level of each country or region. Strictly,ŷ ðtestÞ and y (test) ideally fit a statistically significant linear model where the strength of the model with r-squared value can be computed for further interpretation, in addition to the computed MSE or its root, for each county for the entire duration. Similar to the second loss, the performance of the second model (the multi-step model) includes a calculated loss (based on the root of the MSE) for each predicted step.
Last, comparing our results to other models remains a challenge due to the absence of a unified model similar to what we have achieved that forecast each country globally, or also due to the absence of general benchmark data with a common evaluation metrics. However, we try our best to compare and discuss the performance of our method to any existing models such simple or deep time-series model for specific countries or at any specific time.

Input data
To forecast the spread of the Coronavirus the next day, we synchronised different types of data to allow the model to learn. This wide range of data comprises the historical data of the coronavirus spread by each country, dynamic policies and government responses that amid to mitigate Coronavirus by each timestamp and by each country, static urban features that characterise each country and shows significant correlations with the virus spread, and last, the spatial weight among the different countries. These different data types are integrated and synchronised by countries and -time steps in case of dynamic data-to be feed to the introduced framework.
3.1.1 COVID-19 confirmed cases data. We used the historical data for Coronavirus spread published by John Hopkins University [21,22]. After integrating this data with following data sources, the version we used, contains timestamps from 22/01/2020 till 14/06/2020 (144 days) for 282 regions or countries across the globe as shown in Fig 3 for the confirmed cases for the start and end day of the examined duration.

Urban features data.
We used demographic and locational data that represent the population for each region or country from the aforementioned data set [23]. There is a wide range of factors, however, we only selected three factors; 1) Population density, 2) fertility rate and 3) Urban population. Fig 4 shows the spatial dynamics of these three factors. The two reasons for selecting these features are: First, the selection is based on enhancing the model prediction after several trial and errors with and without several features. Second and most importantly, the selected features show a statistically significant association with the spread of coronavirus over time for all countries across the globe. We examined other variables that represent each country or region such as the absolute value of population in 2020, the yearly change in the population, the world share of the population, and the value of the land area for a given country, in which we found them insignificant with  https://doi.org/10.1371/journal.pone.0246120.g004 the spread of coronavirus over time. Fig 5 shows the outputs of the spearman correlation for the three selected factors. In Fig 5A, the population density was significant with decaying positive correlation coefficients (rho) for the first 70 days from the starting date. This means at the early stage of the virus spread, the higher the population density, the more likely a higher coronavirus spread. In Fig 5B, the fertility rates across the globe show a significant association over the entire test duration except for May. The significant results are with negative rho values, which means countries with higher fertility rates are less likely to have a higher spread of coronavirus, except for the spread of the virus in May. This could explain the less spread of the virus in Africa (as shown in Fig 3), however, this feature may be a time-dependant or due to reporting inaccuracy or the low percentage of virus testing in Africa. Last, in Fig 5C, the percentage of the urban population started to show a significant association with the spread of the virus with positive rho values only for the period between the end of February till the first week of May. During this period, this means the higher the countries with a higher percentage of the urban population, are more likely to have higher coronavirus spread.

Government Response Stringency Index.
Different countries took and continuously take different measures and responses amid towards coronavirus outbreak. These time and location dependant measures include 13 indicators, which they are: 1) school closing, 2) workplace closing, 3) cancelling public events, 4) close public transport, 5) public information campaigns, 6) restrictions on internal movements, 7) international travel controls, 8) fiscal measures, 9) monetary measures, 10) emergency investment in health care, 11) investment in vaccines, 12) virus testing framework and 13) contact tracing. Put all together, Oxford COVID-19 Government Response Tracker [24] aimed to measure the variation of the government responses weighted by these indicators in a scaled index, so-called Stringency Index. It is worth mentioning that the data is continuously updated, whereas new indicators are introduced to improve the quality of the Stringency Index. We used this index to weight the different countries based on the government responses, after integrating and matching the time and location of the previously mentioned data sets (See Fig 6).

Spatial weight.
We computed a spatially weighted adjacency matrix based on the geolocation of each region or country, relying on the geodesic distance between each region or country. We used the haversine formula to compute the distance on the sphere. It calculated as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ð1 À aÞ Where φ 1 , φ 2 represent the origin and destination latitudes in radian respectively, Δλ represents the change between the origin and destination longitudes in radian, and R is the earth's radius.
The adjacency matrix is conditioned based primary on eliminating long-distance connections, which can represent the connection between the US and Europe, the US and China, and direct connection between China and the rest of the world. This hypothetical assumption came from the early international measured took by the US to ban flight to Europe and China for Non-American citizens. Given, this spatial weight may vary or have a higher degree of uncertainty, the model only self-learns from its representation while it generates various samples with the VAE encoder as discussed earlier, instead of using these data as a fixed and constant factor during training and testing. to be in business-as-usual. However, these are only a few easily interpretable examples, the challenges for the model is to self-learn the representation of the graph to adjust the different weights and generate a graph that could in forecasting the spread globally. In Fig 7, we show how we initialised the adjusted spatially weighted matrix for all countries. It attempts to show three main elements for computing the graph: first, it shows how a complete graph between the origin and destination countries is computed. Second, how the relative distance is computed and conditioned. And last, it shows how the array is scaled and reshaped. Fig 8 shows examples of the variation that could be more significant and realistic for predicting a given day for a given country. For instance, the first graph in Fig 8, can represent countries with strict measures towards international travel, the second one which could be the more likely to be the case during the period of banning travel from the US to Europe or China, for instance, the last two shows how the world more likely.

Model evaluation globally
After 500 epochs, the training and testing curves of the model show a steady output with no sign of over fitness, nevertheless, the MSE losses for both curves are at a minimum, with values less than 0.01, whereas the KL loss for the test set is less than 0.37 for both trained model. In Fig 9, we show the distribution of the confirmed and predicted cases globally with the singlestep model. The total predicted cases per day is a close number to the actual data, with a slightly higher confirmed in Africa than what has been confirmed.
In Fig 10, we show the sum of the accumulated predicted cases-predicted at a country level -across the globe for each day regarding the actual data. The results are highly accurate at a global level, with a fraction difference between the actual and predicted ones on the last examined day 14/06/2020. Specifically, Fig 10A show the accumulated prediction and the actual cases globally in a linear scale of cases counts. Fig 10B shows the true and predicted of confirmed cases at a logarithmic scale. It compares the logarithmic prediction and true values. After the initial days-where the initial cases were mainly zero in many countries globally and there was no enough data for the model to learn-the model shows a high validation in learning the overall pattern, in addition to predicting the actual numbers.
The prediction of the model is nonlinear, however, its output at a given sample when compared to its ground truth is linear. Therefore, fitting a linear regression model between the predicted result and the observed one and providing an r-squared value could be a good indicator for understanding the model strength. Fig 10C shows the relation between the confirmed and predicted cases after fitting it to a linear model. It also shows the r-squared value, the root of the MSE metrics (RMSE) and the MSLE for a linear regression fitted model on the predicted and actual values of our single-step model. The computed metrics show a high linear association among them.
What makes this method more reliable than any simple time-series model is that the predicted global curve to the actual one is outputted without the model learning any explicit rules extracted at the global level to mimic the global spread curve of the virus. The model learns the patterns at country levels, whereas error is minimised at both local and global levels. What makes this a very crucial point to discuss is that changes across the globe are more likely to happen at a country level, whereas the global level is rather an impact of the different countries. Table 1

Evaluation of selected countries
Not only does the model shows strong performance globally but also at the country level. Fig 11 shows the performance of the single-step model in different countries (for linear and logarithmic scales). After having enough data and after passing the initial days with zero values (the first 40-50 days), the model shows high performance in learning the spread pattern. It is worth mentioning that the distortion in ground truth curves reflects data uncertainty, which accordingly, it impacts the variance of the predicted values. Overall, the model shows higher performance in countries with higher spread whereas the performance of the model decreases with countries with fewer cases over a short period. However, the model shows overall reliable results at a country level for estimating the actual results and their overall patterns.

Evaluation of single-step and multi-step models
In Table 2, we extend on the evaluation of the single-step model. We show a further variation of prediction in selected countries in different continents. While the model performance varies from a country to country, overall, it shows a reliable result for at a country level.
In Table 3, we show the performance of the 10-step model for a group of selected countries. This model is evaluated per country and step. While the model performance reduces with the increase of the number of steps, compared to the single-step model, the result to a higher degree remains consistent at a country level when we reach the 10-step.

Discussion
In this article, we introduce a method for predicting the spread of coronavirus for each country across the globe for both short and long-term forecast. It has three main advantages, first, the model learns not only from the historical data but also the applied governmental measures for each country, urban factors, and the spatial graph that represent the dependencies among the different countries. The second advantage of the model is its ability to be applied at various scales. Currently, it can forecast the spread at a global and country, and region level (i.e. the case of China, UK), however, it can also be applied at the city level. Last, the model can forecast short and long term forecast which could be a reliable tool for decision-making.

Base model evaluations
There are different attempts for relying on a simple time-series model whether it is relying on machine learning or a simple mathematical rule for a single country or the total cases globally. However, the drawback in such methods is: First, by fitting an exponential smoothing function to a model with no controlled point would mean the virus will continue to spread, regardless of the number of a population, the action is taken. Second, if a simple rule for a given country works for the last days, till when this logic will continue works? What happens when values remain constant, decrease, or even increase at a different rate? There are different possible scenarios that such an approach could not answer. Third, despite the first two arguments, how many rules are needed to fit each country globally at a longer period? Subjectively, a simple time-series model without considering the factors that characterise countries or policies taken to find "general rules and features" would mean finding simple rules for each country at a given time. In most simple ways, when the curve is only increasing at the initial spread time.
Last, even if these previous issues are solved, the world is connected, the spatial weights may vary from country to country or day to day based on the restrictions and measures are taken. If there are simple rules that ultimately can fit the entire countries, the challenges would remain in how to weight the changes around the world. Most importantly, one single case in one region could influence the spread elsewhere.

Deployment and online inference
The model is trained on data from different countries, with different measures applied at a different pace. It has also seen data before, during and after lockdown measures of different regions. Moreover, the model has seen data before and after travel restrictions (from January to June) for certain countries. Accordingly, even if measures and restrictions change in the future, this could allow the model to predict future cases by understanding and inferring the change in the measures and their effects in a given country. However, as for the future improvement of the model, more data types, when they are available, could be fed into the model. Besides the policies and restriction that would vary from country to country, real-time mobility data or mobility indices could help the model to forecast new cases in the post-lockdown periods.
Building on how the model could be used for future inference, deploying the model in an online platform is a possible application of this research. This application could assist policymakers to have a better overview picture of the status of the spread of the virus across the globe to help implement or eliminate a given policy. In Table 4, we show the run-time required for the different tasks on a single GPU. We show that updating the data, computing adjacency, fusing the data, and re-training the model would require less 5 min, whereas utilising the model for inference will require less a second to predict a future step (multi-steps) for all countries.

Limitations and future work
The generative graph of the model along with the other factors has generated good predictions for each country globally (based on trial and errors). However, it remains a challenge that countries with spread over a longer period are more likely to be predicted more accurately than countries with no prior cases, despite how large or small the numbers of cases are. Based on our experiments, the model still understands the pattern in countries that are perceived as outliers, but with lower accuracy. For example in case of Sri Lanka, the strength of the model performance, in term of r-squared, decreases by 23% or the logarithmic error increases by 2.5 folds in comparison to the model performance in predicting spread cases in the United States (see Table 1).
Re-training the model with more data in the future would yield better results at both global and country levels. Besides data improvement, there are three main ways in which the model algorithms can be advanced in future work. First, finding more significant spatial or demographic factors that show significant associations with the spread may enhance the forecasts of the model. Second, applying the same concept and goals of the model to other subjects of coronavirus could lead to a better understanding of its future. This may include estimating deaths or recovery, bearing in mind the health system capability and capacity, in addition to the governmental responses. Currently, the model is capable of forecasting 10 steps in the future with acceptable accuracy, in which it is validated. With more data on more factors, the introduced method could also lead to better long-term forecast for each country based on the lesson learned from the global and country-level trends. Last, the method introduced can be used for polices evaluation by changing a given policy for a given country and date to show how the prediction in future is affected. Accordingly, as for future research, exploring the effect of the different urban factors and governmental measures at both global and country levels can be tackled to assist policy-makers to reach optimal measures amid toward reducing the spread of coronavirus. for reaching optimal measures.

Remarks and lessons learned
In this article, we introduced a novel variational-LSTM autoencoder to predict the spread of coronavirus for different regions/countries across the globe. The introduced learning process and the structure of the data are keys. The model learned from various types of dynamic and static data, including the historical spread data for each country, urban and demographic features such as urban population, population density, and fertility rate, and government responses for each country amid towards mitigating coronavirus outbreak. Also, the model learned to sample different conditions and adjustments of a spatially weighted adjacency matrix among the different infected countries. Overall, the model shows high validation for forecasting the spread at global and country levels, which makes it a useful tool to assist decision and policymaking for the different corners of the globe.
There are several lessons learned while conducting this research. First, concerning urban features, we found several associations of several factors with the spread of coronavirus globally for a specific period of the tested duration. Most significantly, countries with a higher density of population in one km 2 and larger portion of the population living in urban areas are associated with higher coronavirus spread with different coefficients, and levels of statistical significance during the examined duration, whereas countries with higher fertility rates are associated with fewer spread cases at the given studied period (22/01/2020-14/06/2020). However, we also found an association with other factors that not used in this research such as migration nets. We found that countries with higher migration flows are associated with higher spread which could also be explained with their likelihood of having a higher influx of job opportunities. Second, concerning the computed adjacency matrix graph, we found that at very short distances among the different infected countries with coronavirus spread, Western European countries (such as Germany, Italy, Spain) are fully or partially connected relative to other countries globally that are same distance they are completely isolated. This can be reflected on the relatively shorter distance-as a physical attribute-as among these countries when it compares to other countries, or the non-physical accessibility of the European market which could lead to a higher influx of migration and accordingly higher spread cases.