Nowcasting Unemployment Rates with Google Searches: Evidence from the Visegrad Group Countries

The online activity of Internet users has repeatedly been shown to provide a rich information set for various research fields. We focus on job-related searches on Google and their possible usefulness in the region of the Visegrad Group - the Czech Republic, Hungary, Poland and Slovakia. Even for rather small economies, the online searches of inhabitants can be successfully utilized for macroeconomic predictions. Specifically, we study unemployment rates and their interconnection with job-related searches. We show that Google searches enhance nowcasting models of unemployment rates for the Czech Republic and Hungary whereas for Poland and Slovakia, the results are mixed.


Introduction
Online activity has become an inherent part of the modern society and the way of living among its members.The Internet provides a vast amount of information to its users as well as an aid and assistance in times of need.During the current financial and following economic and production crises, most of the developed as well as developing economies have been hit by an economic downturn which is usually tightly connected with a growing unemployment.Job loss can be a very traumatizing experience with long lasting impact on one's life.Seeking a new job then becomes an integral part of an everyday life.In the current digitalized era, the job seeking does not restrict itself to job offices but the seekers (as well as potential employers) more frequently turn to the Internet as a source of information and new possibilities.As such, the job seekers leave a digital track of their activity.
Analysis and examination of various patterns of the online activity have become a fruitful branch of research in the last years with some exciting applications such as elections [1], investment allocation [2,3], private consumption [4] and consumers' behavior [5], future orientation [6], earnings announcements [7], diseases spreading [8][9][10][11][12], and economics and finance [13][14][15][16][17][18][19].Turning back to the unemployment and its possible examination utilizing the online activity of the Internet users, there has been some research done in the area as well focusing primarily on the Google engine search queries.The first study focusing on the possible connection between Google searching activity and unemployment rates examining the series in Germany shows usefulness of adding search queries data into the models [20].Following research [21][22][23] analyzes connection between the queries and claims for unemployment benefits in the USA and the unemployment rate itself has been studies as well [24,25].Even job search activity index based on the Google search data has been developed [26].Most of these studies focus on the US economy and its modeling while the other economies are studied rather marginally [27,28].
Here, we focus on possible connection between job-related search queries on the Google search engine and the unemployment rate in countries of the so-called Visegrad Group (the Czech Republic, Hungary, Poland and Slovakia).Our contributions lay in the following.
First, we focus on a set of countries which would be normally treated as a marginal one and thus not much studied.However, if the utility of the online search activity (and specifically the Google searching) is to be claimed, its efficiency should be shown not only on developed and well covered countries but also on the smaller ones and the results might prove useful to all policy makers even in such regions.Second, we provide a careful and step-by-step procedure to the unemployment modeling focusing not only on simple correlations but also nowcasting, forecasting and causality.And third, a cross-countries comparison is delivered which is rather unique in comparable studies focusing primarily on one specific country.

Results
The unemployment rates have undergone quite heterogenous evolution in the analyzed countries (Fig. 1).In the Czech Republic, the rate ranged between 4% and 9% between years 2004 and 2013.Initially, there was a significant downward trend from year 2004 to 2008 when the rate dropped from 9% to 4%.As the recession hit the Czech Republic in 2008, the rate started to rise to reach its new maximum of 8.5% in 2010.Since then, the unemployment rate fluctuated between 7% and 8.5%.The Hungarian unemployment rate was steadily rising from the year 2004 to 2010 where it reached its new maximum of nearly 12%.After that the rate fluctuated for almost 3 years between 10.5% and 12% to start declining in the year 2013.The unemployment in Poland experienced a steady decline from the astronomical rate of nearly 22% in the year 2004 to 6% in 2009.However, as the recession hit Poland, the unemployment rate began rising again.With some minor fluctuations, it smoothly increased to the current level of approximately 10%.And in Slovakia, the unemployment rate seems to have a similar pattern as the one of the Czech Republic, although on a different scale.In 2004, Slovakia had an unemployment rate of almost 20%.This rate linearly decreased to 8% in 2009.With the hit of recession, the unemployment rate quickly escalated to 16% around which it has been fluctuating until today.
The evolution of the Google searches is illustrated in Fig. 2.There are evident seasonal patterns in all four series.Hungary is characterized by quite regularly increasing trend in the Google searches whereas Slovakia shows the opposite and the remaining two analyzed series remain quite stable in time.Even though there seems to be some connection between the Google searches and the unemployment rates for the Czech Republic and Hungary visible by the naked eye, we can hardly claim any relationship without a proper analysis.

Basic relationship
As the initial step, we present the results of the stationarity tests which tell us whether we should analyze the original series or some of their transformations.In Tab. 1, we show the results of the ADF and KPSS tests (see the Methods section for more details) for the original as well as the logarithmic series and their first differences.The outcome is quite straightforward as we do not reject unit roots for either of the original series (or their logarithmic transformation for the Google searches, we do not examine the logarithmic transformation for the unemployment time series as these are already in the percentage representation).Further testing, which is not reported here, shows no cointegration relationship between the unemployment and the search queries series so that we need to proceed with the first differences of the series.For most of the cases, we support stationarity of the first differences.In the analysis, we further proceed with the first differences of the unemployment rate and the first logarithmic differences of the Google searches.We opt for this combination as the pair of percentage representation and logarithmic transformation allows for a straightforward interpretation as an elasticity, i.e. as a proportional relationship.
For the very basic relationship between the unemployment rate and the intensity of the job-related searches on Google, we study the following equation where ∆UR t and ∆ log(GI) t stand for the first difference of an unemployment rate at time t and the first logarithmic difference of the Google searches at time t, respectively, for a given country, and ε t is an error term.
The elasticity between the Google searches and unemployment rate from Eq. 1 is estimated at 0.5538 (with the p=value of 0.0533), 0.2056 (0.0726), 0.3317 (0.2163) and 0.4630 (0.0062) for the Czech Republic, Hungary, Poland and Slovakia, respectively, with the heteroskedasticity and autocorrelation consistent (HAC) standard errors.The proportional relationship thus varies across the analyzed countries but it remains positive for all four and statistically significant for three out of four (at least at the 10% significance level).Specifically, the relationship is very strong for the Czech Republic and Slovakia with the value around 0.5.This shows that the changes in the unemployment rate are well projected into the online search queries for the vacancies and job-related terms.Studying the connection between these two variables thus seems promising and worth further utilization and investigation.

Nowcasting
Macroeconomic time series, such as the unemployment rates, have a special property which is not present for financial series or other series in natural sciences -they are available with a pronounced lag.This is due to the data processing and collection which usually take several months and even after such period, there are sometimes corrections to the reported values.Such characteristic makes a series, which is available immediately without any lag and which is strongly correlated with the variable of interest, very useful for forecasting the present value of the variable without waiting for several months.Such forecasting the present is usually referred to as "nowcasting".
In the previous section, we have shown that the Google searches for job-related terms are significantly correlated with the unemployment rate which makes the search queries potentially useful for nowcasting of the unemployment.As a nowcasting model, we consider the following one where the unemployment rate is assumed to be available with a three months lag.We again consider the differenced series due to stationarity issues discussed above.Both series are kept to the lag of 12 months which controls for the seasonal pattern in both the series.
The results of the nowcasting models are summarized in Tab. 2. There we show the adjusted R 2 ( R2 ) as a measure of the models' quality controlling for the number of explanatory variables.We observe that for all countries, the inclusion of the Google series enhances the model strongly.The R2 increases by approximately a third for all countries but Poland for which it increases slightly less.Nonetheless, inclusion of the search queries improves the model for all countries significantly as is reported by the F -statistics for the insignificance of the searches.All series are jointly significant even at the 1% level.

Forecasting & Causality
The nowcasting results are very promising and they illustrate usefulness of the Google searches series.However, we are also interested whether such usefulness is mainly due to the unavailability of the unemployment data or whether the search queries data provide additional informative value as well.To do so, we also undergo a standard forecasting exercise where we practically hypothesize what would happen were the unemployment data available straightaway.If the Google series improve even such hypothetical model, we conclude that the search queries data bring additional information to the model in addition to being strongly correlated with the changes in the unemployment rate by itself.
For the forecasting exercise, we utilize the standard vector autoregressive model (VAR, see the Methods section for more details).The specific model takes the following form and it is compared to a simple autoregressive model of unemployment For the comparison purposes, we use two measures of the forecasting quality -root mean squared error and mean absolute error (RMSE and MAE, respectively, see the Methods section for more details).These measures are very straightforward -the lower they are the better performing the model is.In addition, we utilize the Diebold-Mariano test [29] which compares the forecasting performance of two models with the null hy-pothesis of the models performing the same (see the Methods section for more details).
The model is estimated on the series between January 2004 and December 2012 and the forecasting period is set between January and December 2013.
The summary of the forecasting performances is given in Tab. 3.There we can see that for all countries, the forecasting performance of the models increases strongly with the addition of the Google searches.This is further supported by the results of the Diebold-Mariano test which gives significant results, i.e. the model using the Google data outperforms the ones without them, for all countries at at least the 5% significance level.The online search data thus evidently provide an additional informative value to the unemployment modeling.
As the last step of the analysis, we provide a causality examination.We are thus interested in the specific relationship between the two analyzed series.Concretely, we examine whether the increasing unemployment causes people to look up the job-related terms more, or the increased online activity signalizes potential tensions on the job market, or both ways, or none.To do so, we utilize the Granger causality framework (see the Methods section for more details) which is built on the VAR analysis.The results are summarized in Tab. 3. Note that the null hypothesis of the Granger causality is "no Granger causality".Therefore, if the null hypothesis is rejected, the causality is claimed to be found.The findings are quite homogenous.For three out of four countries (Hungary being the exception), we report causality in both directions.The influence thus goes from both directions and the series strongly influence each other.and harmonized household survey, which is in accordance with the EU legislation carried out in each member state.The monthly data from Eurostat are estimates based on the results of EU LFS.Since there are no legal obligations for the EU countries to deliver monthly data, these data are often interpolated/extrapolated using national survey or registered unemployment data.
According to Eurostat, an unemployed person is defined as someone aged between 15 and 74 without work during the reference week who is available to start working within two weeks and who has actively sought employment at some time during the last four weeks.In our analysis, we use the general (both sex, 15-74 years old) raw (not seasonally adjusted) unemployment rate.We do this since we do not know the method used for the seasonal adjustment and the Google data are not seasonally adjusted either.
The Google search queries data have been downloaded from the Google Trends webpage.As languages of the studied countries differ, we have looked for various terms.As the Czech, Polish and Slovakian are all Slavonic languages, the searched words are very similar or even the same.For Czech, we searched for "práce", for Polish "praca" and for Slovakian "práce" as well.For Hungarian, we used term "állás".For the Slavonic languages, the terms are equivalent to "job" or "work", and for Hungarian, it is close to "job" or "work" but rather in a sense of looking for it.The term "állás" provides better results than a more straightforward "munka" which would be closer to a more standard meanings of terms "job" or "work'.
The weekly series obtained from the Google Trends site have been transformed to the monthly series on a basis of the number of days in the month basis.All series, both of the unemployment rate and the Google searches, are studied between January 2004 and December 2013.

Stationarity
Stochastic process {z t } is stationary if for every collection of time indices 1 ≤ t 1 < t 2 < t m , the joint probability distribution of (x t 1 , x t 2 , ..., x tm ) is the same as the joint probability distribution of (x t 1+h , x t 2+h , ..., x t m+h ) for all integers h ≥ 1 [30].To test for stationarity, we utilize the Augmented Dickey-Fuller (ADF) test [31] and the KPSS test [32].The tests have opposite null hypotheses so that they provide a complementary pair which is commonly used for stationarity testing.
In the ADF procedure [31], the OLS regression is run on in order to perform the test, where α 0 and γt are an intercept and a time trend, respectively, and p represents the lag order.The null hypothesis under which the series contains a unit root is found for against the alternative The ADF test statistics is then computed as usual t-statistics, which, however, follows a more complicated distribution under the null hypothesis.Due to the relative short time series, we set the number of lags arbitrarily to three.
The null hypothesis of the KPSS test [32] is opposite to the one of the ADF test, i.e. the KPSS test has the null hypothesis of stationarity.The test is based on the OLS regression of the series {z t } where α 0 and γt again represent an intercept and a time trend, respectively, and ξ i are independent and identically distributed random variables with a zero mean and a unit variance.The null hypothesis of stationarity is found for against the alternative The KPSS test statistic is defined as T where S t is partial sum of residuals εi and ω2 T is an estimator of the spectral density at a frequency zero.

Vector autoregression
Vector autoregression (VAR) is simply a system of temporally dependent series.More precisely, denote the number of variables k and the length of the series T , then VAR of order p is generally represented by equation where y t and ε t are k × T matrices representing the studied series and residuals, respectively, α represents a vector of constants and A i are time invariant matrices replacing the traditional β i coefficients.The selection of appropriate lag order p is usually based on a specific information criterion.
In the VAR framework, the Granger causality concept is usually used as well.The causality testing simply stems in testing the joint significance of one of the variables in the equation for some other variable.The testing procedure is thus an F -test for joint significance of a specific variable.In needs to be noted that such causality is strictly statistical and it should be always treated with caution.

Forecasting
To compare forecasting accuracy of the proposed models, we utilize three measuresmean absolute error (MAE), root mean squared error (RMSE) and the Diebold-Mariano test [29].
MAE measures the average value of absolute losses.In other words, it gives an average deviation of forecast from realized value in absolute terms.It is given by the equation where f i stands for the predicted value, y i is the actual value and a i = |f i − y i |.
RMSE is quite similar to the mean absolute error as it is simply a square root of the mean squared error, and it is defined as where f i stands for the predicted value y i is the actual value and Diebold & Mariano [29] propose a test to compare the predictive accuracy of two competing forecasts.Let {ε 1 t } T t 0 and {ε 2 t } T t 0 be the sequences of forecast errors losses from two competing forecasting measures by particular loss function (e.g.absolute error loss as a i in Eq. 8 or squared error loss as s i in Eq. 9).The null and alternative hypotheses are then stated as The Diebold-Mariano test assesses the accuracy based on the loss differential and the underlining null The Diebold-Mariano statistics is then where d is the mean loss differential and LRV d is a consistent estimate of the asymptotic (long-run) variance of √ T d.Under the null hypothesis, the testing statistic goes to a standard normal distribution so that  Unemployment rate in the Visegrad countries.The group of countries is evidently quite heterogenous in the unemployment rates.The Hungarian rate starts at the lowest level but increases stably during the whole period.The Czech rate begins at quite low levels and decreases up to the outbreak of the financial crisis when the rate surges up until 2010 after which it remains quite stable.The Polish and Slovakian rates commence at very high levels of unemployment which go down again up until the outbreak of the crisis after which they change the trends similarly to the Czech rate.

Figure 2 .
Figure2.Google search queries for the job-related terms in the Visegrad countries.The patterns are again quite heterogenous and the connection between the Google searches and the unemployment rates can be observed for the Czech and Hungarian rates.For the other two, the connection is not visible by the naked eye.Detailed treatment of the interconnections is given in the Results section of the text.

Table 3 .
Forecasting and causality summary