Computational Models of Consumer Confidence from Large-Scale Online Attention Data: Crowd-Sourcing Econometrics

Economies are instances of complex socio-technical systems that are shaped by the interactions of large numbers of individuals. The individual behavior and decision-making of consumer agents is determined by complex psychological dynamics that include their own assessment of present and future economic conditions as well as those of others, potentially leading to feedback loops that affect the macroscopic state of the economic system. We propose that the large-scale interactions of a nation's citizens with its online resources can reveal the complex dynamics of their collective psychology, including their assessment of future system states. Here we introduce a behavioral index of Chinese Consumer Confidence (C3I) that computationally relates large-scale online search behavior recorded by Google Trends data to the macroscopic variable of consumer confidence. Our results indicate that such computational indices may reveal the components and complex dynamics of consumer psychology as a collective socio-economic phenomenon, potentially leading to improved and more refined economic forecasting.


Introduction
The growth of most modern economies is driven by consumer spending [8]. Therefore, consumer confidence levels can have significant effects on economic growth. Consumer Confidence Indices (CCI) are designed to measure the degree of confidence that consumers have with respect to the state of the economic system. The basis for many CCIs lies in behavioral science where evidence has accumulated that individual consumer behavior is influenced by a number of emotional and social factors [18,4] that interact with the consumer agents' socio-economic context. In other words, the emotional state of consumers as well as their assessment of that of other consumers will shape their subsequent individual consumption patterns [12,25]. In the aggregate, as consumers collectively lose or gain confidence in the state of the economy, this is assumed to affect their collective consumption patterns and thus economic growth yielding a complex interaction between consumer confidence and economic conditions. This interplay between the complex behavior of individual agents and the emergent properties of their collective behavior is analogous to those seen in many other large-scale socio-technical systems [27].
Accurate, valid, and timely measures of consumer confidence are thus of pivotal importance to policymakers and econometric forecasting. However, as a social and abstract construct "consumer confidence" is difficult to measure. Researchers have turned to social science methods such as surveys and questionnaires which are expensive and time-consuming to conduct, and are possibly subject to a number of personal, cultural, and social biases, e.g. social conformity bias [20] which will confound measures of consumer confidence with cultural and linguistic propensities to divulge or withhold accurate information concerning one's level of confidence. The latter also renders comparisons of consumer confidence difficult to compares across different linguistic and cultural regions.
Here we investigate a computational approach that leverages large-scale search engine query volumes to gauge consumer confidence. We start from the assumption that search engine volumes reflects the issues that a population is contemporaneously pre-occupied with [24], congruent with recent work in the area of market modeling [7,23]. Hence, consumer confidence may be manifested in the volume of certain web searches such as "taxes", "investment", and "stocks", but not others, e.g. "cloud" and "cat". We focus on China since it provides an interesting case for Consumer Confidence studies given its unique linguistic and cultural background, and the important role that the consumption patterns of its burgeoning middle-class are now playing in the global economy [28].
We obtain Google query volume time series for a number of Chinese characters that are likely to express various facets of Chinese Consumer Confidence given their use in existing surveys of consumer confidence in China. Using a principal component analysis, we isolate the queries that are the main indicators of Chinese consumer confidence [22], and define a Chinese Consumer Confidence Index (C3I) from a linear combination of the respective search volume data. We cross-validate the C3I against existing gauges of consumer confidence, demonstrating its ability to offer an accurate, timely, and informative view on consumer confidence in a region that has been historically underserved with regards to econometric indices. Our results indicate that the C3I yields new information on the nature of Chinese Consumer Confidence. Our work may thus contribute to the science of modeling the social construct of consumer confidence and its socio-economic correlates that shape the emergent properties of economies as large-scale socio-technical systems [27].

Materials and methods
In our investigation we rely on the following data sources: 1. The Chinese CCI and ECQ surveys of consumer confidence for the period under consideration.
2. Google Trend data for a specific number of search queries corresponding to the same time period.
Given the different nature of these surveys we use the first as an official indicator of Chinese Consumer Confidence and the latter to extract consumer confidence topics from which can be translated into Google queries.

CCI consumer confidence survey
Consumer Confidence in China is mainly gauged by 2 surveys: the Consumer Confidence Index (CCI) and the Economist's Confidence Questionnaire (ECQ).
The Chinese Consumer Confidence Index (CCI) is reported by the National Bureau of Statistics of China (NBSC) on a monthly basis. Its methodology consists of asking 3,500 individuals (after November, 2009) about their confidence levels of the present and the future. It consists of a questionnaire of about 5 simple questions each pertaining to what is assumed to be a specific component of consumer confidence, e.g. "How do you see your current employment conditions?". Subjects' responses are recorded on a 5-point scale. We obtained historical monthly data of Chinese CCI from National Bureau of Statistics of China for the period of January 2006 to June 2013, i.e. 90 months, as shown in Fig. 1. It must be noted that the CCI numbers reported by the NBSC may be affected by some considerations with respect to data normalization and adjustments [5].

ECQ consumer confidence topic extraction
The CCI is designed to be succinct and fast to administer. Hence it consists of short questions designed to be answered in terms that are directly evaluative of the question, e.g. "positive" and "negative" with respect to that particular question, e.g. "How do you see your current employment conditions?". However, we are looking to model the notion of Chinese Consumer Confidence as exhaustively as possible so we can determine its correlates in online indicators.
The Economist's Confidence Questionnaire (ECQ) contains 31 open questions such as "What do you presently consider the greatest threat to the Chinese economy?", with a number of possible responses provided that can range from a few items to more than 15. Give the more open and exhaustive nature of the ECQ we manually extract the core topics of the ECQs questions and answers, and corresponding Chinese characters, to define an initial set of terms that can be reliably transformed to specific Google search queries. The volume of the latter are then taken to indicate the level of online attention with respect to that particular topic.
For example, ECQ Question 13 is "How do you think the dollar value may change in the next 6 months?". We manually extract the Chinese character for "dollar trend", and add it to our initial set of topics that we deem to be indicative of consumer confidence. We then retrieve Google Trend data for each individual topic.
As shown in Table 12 and 13 (Appendix), we extracted a total of 44 topics from the ECQ questions ranging from large macro-economic concepts such as "inflation" to more personal notions such as "food price". However, only 34 topics could be retained for having sufficient Google query volumes and were thus used as variables in our later analysis.

Google Trend data
The Google Trends (www.google.com/trends/) service is offered by the Google search engine; it allows researchers to retrieve weekly/monthly normalized search volume data for any user-provided search query, provided the query has non-zero search volume. For example, a user can enter the query "good" and Google Trends will return a weekly time series whose values represent the volume of searches for that query recorded by Google in that period of time on a weekly basis. An example of the Google Trends data for the Chinese character "Hao" (en: "good") is shown in Fig. 2.  As such we obtain Google Trends data for the 34 topics that produce non-zero search volumes from January 2006 to June 2013 thereby matching the date range of our CCI data. Since Google Trends data can be weekly and CCI data is released monthly, we convert all weekly Google Trends time series to monthly time series by means of a 4-week moving average. Since some months are longer than 4 weeks, where necessary, we move data points at the end of the month's last week to the next month.
In Fig. 3 we show an overview of our multi-phased methodology which is further explained in subsequent sections.

Google Trends
Principal Components

Principal Component Analysis of ECQ topic covariances
Each of the 34 Google trends time series (corresponding to the ECQ questionnaire topics) can be taken as independent variables, representing a certain facet of consumer confidence. However, we need to determine the degree of multicollinearity to investigate whether each variable independently represents consumer confidence, and to ensure the validity of later regression models used to fit a potential C3I based on these 34 independent variables to the CCI. Therefore we perform a principal components analysis (PCA) [17] to study the components that underlie the covariances of our 34 Google trends time series and reduct dimensionality. This will also ensure the orthogonality of our components and thus avoid the issue of multicollinearity in future regression models.
We list the 10 highest ranked components with their loadings in Table 1. A KMO test [6] and squared multiple correlation (SMC) test [1] show that the PCA was indeed a suitable procedure.  Judging from the scree-plot, we arbitrarily retain the first 9 PCA components since they represent the majority of information on the original topic covariances (about 85%), thus ensuring we retain all relevant information for accurate modeling. However, not all 9 components need to be included in our Transitional model (Fig. 3) since each carries increasingly less information. In fact, whether we choose 8, 9 or 10 components would be of little significance to our Transitional model. In fact components 7, 8 and 9 are indeed not included in some of our models below.
We project our topic variables unto the selected 9 components only, i.e .(C 1 , C 2 , ..., C 9 ), i.e. we define ., x 34 ) T refers to our 34 topic time series and c i refers to the entries of the 9 component vectors as listed in Table 2.
To avoid spurious regression results [30,15], we must determine whether our time series have cointegrated relationships. By co-integrated relationship we refer to the possibility of a long run equilibrium relationship between two trending stationary processes which could be stationary after differencing with the same time, e.g. I(1). Here, I(0) denotes that the time series is stationary whereas I(d) denotes that the time series will be stationary after d times difference. After we extract the 9 first components, we conduct an ADF test [9,13] to check the variables' unit root, the results of which are shown in Table 3. The results in Table 3 indicate that C 5 , C 7 and C 8 are stationary at a 1% significance level. We define 7,8). Subsequently all 9 variables are integrated to order one (I(1)) ensuring they are stationary after computing difference once.

Modeling and Computing
After determining the principal components of our Google trend time series data, i.e. the components that best describe consumer confidence as indicated from Google query volume with respect to our 34 survey topics, we perform a Vector Auto-regression (VAR) [26] to determine the degree of auto-correlation in our CCI data. As shown in Table 4, we find a considerable degree of auto-correlation, indicating the necessity to include CCI at lag 1 as an independent variable in future analysis. This finding is intuitive,     since consumers factor previous confidence into their assessment of future conditions as well as other present information.
We conduct a Granger Causality test [14] between our independent variables, C i (i = 1, 2, 3, 4, 6, 9) and A j (j = 5, 7, 8) vs. one dependent variable, namely CCI t . The results indicate that independent variables C 1 , C 3 and A 5 are Granger causative of the CCI. Since results in behavioral science [2] indicate that people tend to discount older information in favor of newer information, we choose variables that were lagged one unit. [2] indicate that people tend to discount older information in favor of newer information, we choose variables that were lagged one unit.
The normalization of CCI data in reference to 1996 data [5] ended in November 2009 leading to an apparent discontinuity in the CCI data in 2009-2010 as shown in Fig.1.
To ensure our CCI data is not biased by structural changes, but merely a difference in normalization, we conduct as Structural Change test [21]. The results are summarized in Table 6; the null-hypothesis that no structural change occurred must be rejected. In other words, the results indicate a structural change is likely to have occurred in November 2009. However, it is unlikely that this change was the result of systemic changes in how consumers evaluate and express their confidence. It is rather more   likely that the discontinuity results from changes in time series normalization. Hence, we do not add a dummy variable in our models to the CCI data itself but account for it elsewhere in our model. Table 6, all three tests imply there is a structural change in the time series, which may have resulted from the NBSC standardization in November 2009. We therefore add dummy variable D to all the independent variables of our model, with the exception of CCI t−1 , where the first time period comprises 47 months and the second time period comprises 42 months.

As indicated in
Then, our transitional model (i.e. Model 2 in Fig. 3) can be written as follows: where t 0 = 47.
We then proceed with a Stepwise Regression [11] as follows:

Repeat step 3 and step 4, until all variables pass the test.
As shown in Table 7 the model exhibits a good fit with a large R 2 and a small square root of residuals (SS). We conduct a White-and Bartlett Test [3,29] to determine whether the regression has heteroscedasticity and auto-correlation. As indicated by the results shown in Table 8 and Figure 4, this is not the case.
We conduct an ADF test on the residual error [10] to determine whether the regression is co-integrating or not. Table 9 indicates that t has no unit root, which implies that there is a co-integrated relationship in the regression, supporting the validity of our regression analysis.
Using the regression results we can model C3I as shown in Eq. 3.
Z(t) = −8.321 1% critical value = -3.527 p = 0.0000 This fitted equation preserves the major components of the PCA (C 1 − C 4 ) to avoid significant information loss. We can formulate our final fitted model using the original indices as shown in Eq. 4. C3I t = t ≤ 47 : 46.555 + 0.498CCI t−1 + X t A t > 47 : 57.067 + 0.498CCI t−1 + X t B + X t−1 C where X T = (x 1 , x 2 , ..., x 34 ); and the entries of A T , B T , and C T are provided in subsequent tables and the appendix.
The A T , B T , and C T matrices reveal significant changes in the structure of the C3I over time. In Tables 10 and 11, we show the positive and negative topics influencing our estimation of the C3I, showing how certain topics positively contribute to C3I and others contribute negatively to C3I. In particular we see that before December 2009 positive topics include "stocks", "CPI", and topics related to "trade". Negative topics notably include "prices", e.g. "housing", "fuel", "food","over capacity", and concerns about "economic transition". Examining Table 11 we find that these negative topics are not positive influences in C3I. In fact, the top ranked positively contributing topics are now "over capacity", "real estate", "housing prices". We also note that the negatively contributing topics continue to include "exchange rates" and "foreign currency".
We compare C3I values predicted by our model to the actual CCI values in Fig. 5

Conclusions
We model Chinese Consumer Confidence by analyzing the relationship between Chinese CCI data and Google Trends time series for queries derived from official CCI questionnaires. We subjected our Google Trends data to a PCA to reduce its dimensionality and avoid variable multicollinearity. We show the model exhibits no significant auto-correlation or heteroscedasticity using an ADF test and a White-test [29]. Our model which includes Google Trends data as well as lagged CCI data manages to approximate the official CCI values quite well cf. an R-square value of 0.9203, and furthermore produces a good prediction of new C3I values obtained after conclusion of the original data.
The resulting model allows us to draw a number of noteworthy conclusions.
First, our finding indicates that the results of expensive and time-consuming Consumer Confidence surveys might be approximated and potentially extended by more economical and time-efficient methods that leverage online behavioral indicators. This however requires a careful consideration of which online indicators are most relevant to the assessment of consumer confidence. Here, we focused on Google Trend data that was obtained for a narrow set of query terms (carefully derived from official Economist's Confidence Questionnaires) to ensure validity, and to avoid the introduction of noise or spurious correlations. In fact, rather than an approximation of official CCI data, the use of Google Trends data might in fact enhance the assessment of consumer confidence by avoiding structural measurement changes such as those that may have caused the discontinuity observed in the official CCI data on November 2009 leading to elevated CCI values after Nov. 2009. We were forced to introduce a dummy variable to account for this   Second, we observe that the C3I data is shaped by a number of inherent factors that may be fundamentally related to how people assess future economic conditions. As shown in Eq. 3 we found an α = 46.555 for t ≤ 47 and α + D t γ 0 = 57.067 for t > 47, as well as a δ value with respect to the lag 1 CCI data, denoted CCI t−1 , of 0.498. This result indicates that the C3I is partially shaped by its own previous values. We speculate that people may extrapolate their present confidence to an assessment of future economic confidence, as well as relying on other relevant information.
Third, as shown in Fig. 6, our Google Trends data indicates a consistent downtrend in consumer confidence from 2007 to the present which is not mirrored by official CCI data. However, Google Trends data presumably provides only a partial indicator of the factors that shape consumer confidence. We can therefore not conclude that our Google Trends model indicates an actual downtrend in consumer confidence. It does point to an interesting divergence between two different, but related measures of consumer confidence. We also note that after the observed discontinuity, CCI does exhibit a slight downward. Fourth, examining the topics that contribute positively or negatively to our estimation of C3I reveals a number of interesting changes over time. The first part of Eq. 4, i.e. t ≤ 47 corresponds to the period before December 2009. Matrix A, shown in Table 10, can be split into 2 categories of topics, namely those that contribute positively to C3I and those that contribute negatively according to their coefficients. Note that the topics themselves do not contribute to C3I. The attention they receive in the population, measured by Google trends volume, is used as an indicator of the population's pre-occupation with the topic in relation to the C3I. The topics in Table 10 thus reveal the internal topical structure of this particular measurement of consumer confidence through a behavioral measure and which topics contribute negatively or positively to our estimation of C3I. As shown in Tables 10, 11 we see that a number of topics contributing positively to our estimation of C3I change polarity in C3I estimate after November 2009. This change may indicate that the population changed its assessment of these topics, leading to a different contribution to their consumer confidence, or potentially a change in how the CCI is measured. For example, when a large number of individuals search for "over capacity" this might occur because of the perception of over capacity as a negative issue, while some years later, people might search for the same topic from the position that over capacity is improving, hence making a positive contribution to their consumer confidence.
Fifth, generally, consumer confidence can be shaped by 4 distinct consideration, namely whether one is either confident vs. unconfident with respect to either present vs. future conditions: one can be confident about the present and future, confident about the present but unconfident about the future, unconfident about the present but confident about the future, and, finally, unconfident about both present and future. The topic information in Table 11 can express these 4 types of conditions. The table is split vertically into 4 parts, with the polarity of topics in each row alternating between positive vs. negative for either t and t + 1. The top part of the table lists topics that have positive polarity for both t and t + 1, i.e. these topics correspond to consumer confidence that is positive with respect to the present and future (period t and t + 1). Following we list topics that are positive with respect to time t but negative with respect to t + 1, etc. An examination of Table 11 may thus reveal the degree to which certain topics contribute positively or negatively to confidence or lack therefore with respect to present and future conditions. We furthermore observe that the magnitude of coefficients in the t column of Table 11 is generally significantly higher than that of the t + 1 column, indicating that our topics are most suited to gauging consumer confidence about present conditions, rather than future conditions. Since the CCI is designed to measure people's confidence about current economic situations, they most likely will express their positive or negative feelings with respect to the present. Since the polarity of topics might change over time, this might reduce their influence as indicators of future conditions.
In spite of the promise of this novel approach of measuring Chinese consumer confidence from search query volume, we must note a number of shortcomings that should be addressed in future work. First, our terms and therefore topics are manually derived from the official Chinese ECQ survey. However, our choice of terms and topics might not fully capture the essence of how the CCI as a survey measures consumer confidence, since the latter consists of a different set of questions that respondents are required to answer in their entirety, not as a bag of words. Future work may focus on a more complete, principled, and thorough translation of the construction of the CCI to a set of search query terms, possible by the use of n-grams to capture more of the underlying semantics of consumer confidence.
Second, we assumed the CCI serves as a ground truth for measuring consumer confidence which is reasonable given it was deliberately designed to do so in terms of eliciting explicit responses. However, as a result any biases or deficiencies of the CCI will impact the validity of our own model. Furthermore, our model was optimized to match the outcomes of the CCI which may or may not in all cases accurately reflect true consumer confidence. In future work, we may include alternate gauged of consumer confidence to arrive at a more reliable and comprehensive assessment of consumer confidence to condition our model to.
Third, we relied on a limited, pre-defined set of topics, which may or may not provide exhaustive coverage of the notion of consumer confidence. More work should be conducted on careful selection of indices and consequently model building.
Fourth, our reliance on Google trends data introduces a number of issues. In particular, if a particular aspect of consumer confidence can not be gauged from search engine volume, our method won't capture it. As suggested by [19] the validity and accuracy of our model could be improved by the inclusion of other related indicators of consumer behavior such as social media feeds, blog volume, newspaper data, etc that will allow us to reduce overfitting and noise through "triangulation" or "cross-validation". A related source of concern is that Google renormalizes their trends data continuously and may change the algorithms by which they are produced leading to difficulties in assessing differences and significance of absolute values over time. Although variations seem to be minimal, it may render our results more difficult to replicate or reproduce.
Lastly, the accuracy of our model depends on the variables that we have chosen. We attempted to make reasonable choices with high face validity and attempted to avoid extraneous variable not related to consumer confidence, but can not make any definitive claims with respect to their appropriateness or completeness. Furthermore, our use of Principal Component Analysis will lead to the inclusion of topic variables that have no bearing on CCI, but can still exert a strong influence over the construction of our model. In future research, we intend to define more efficient methods for the automated selection of variables. We caution again that similar methods have been shown to be subject to systematic challenges [19], which we have painstakingly sought to avoid.
In spite of the deficiencies of our present approach, we have demonstrated the feasibility of modeling large-scale socio-economic phenomena such as consumer confidence from behavioral online data, i.e. Google search queries, opening new possibilities for more exhaustive, accurate, and finer-grained models of complex dynamic socio-technical systems such as a nation's economy which is shaped by the interactions of large number of autonomous agents that respond to their individual conditions as well as those of others, including global systemic information such as financial news, economic growth forecasts, GDP numbers, and inflation numbers. Over Capacity x 24 Real Estate Sales x 3 Trade Balance x 25 Housing Price x 4 Economic Perforeance x 26 Deposit Reserve Rate x 5 Private Investment x 27 Foreign Exchange x 6 Income Gap Employment Situation x 29 Crude Oil Price x 8 Employment x 30 Fixed Investment x 9 Real Economy x 31 PPI x 10 Population Ageing x 32 GDP Growth Rate x 11 Small and Medium-sized Enterprise Management International Trade(Import and Export) x 36 Exchange Rate of Japanese Yen Against US Dollar x 15 Interest Rate for Loan x 37 Real Estate Adjust x 16 Stocks x 38 Real Estate Development x 17 US Economy x 39 Debt Risk x 18 Dollar Trend x 40 Macro-economy x 19 Economy Transition x 41 Foreign Investment x 20 Food Price x 42 Administrative Expenditure x 21 Tax x 43 Investment Scale x 22 Exchange Rate x 44 Labor Force What is your judgment on the following aspects of China's economic operation? A) Macro-economy B) Demand C) Consumption Q2 What do you think the current situation of China's economy? Q3 What do you think the next six months of imports and exports growth will become? Q4 What do you feel the next six months, China's foreign trade balance will be? Q5 You expect 2013 annual GDP growth rate will be: Q6 What do you consider the next six months of CPI will be? Q7 What do you think the next six months of PPI will be? Q8 What do you think of the international crude oil and food prices over the next six months will be? Q9 What the current liquidity situation of the real economy is in your eyes? Q10 What do you think the next six months deposit reserve rate should be? Q11 What do you feel the next six months interest rate for loan should become? Continued on previous page...