Individual investment decision behaviors based on demographic characteristics: Case from China

Predicting and analyzing behaviors of investors is of great value to financial institutions. This paper uses survey data from about 9,000 individual investors across China to explore the predictability of decision behaviors by studying demographic characteristics that are relatively easy to obtain. After applying Pearson’s chi-squared test, Spearman rank correlation test, and several data mining methods, we verified that demographic characteristics are closely linked to decision behaviors, and it would be an economical and feasible solution for financial organizations to build initial behavioral prediction models especially when investors’ behavioral data are insufficient.


Introduction
In the early 1990s, China's stock exchanges (one in Shanghai and another in Shenzhen) were officially established as an experiment for market economy reform. After more than 20 years of rapid development, the China's financial market system is becoming increasingly more mature. In the meantime, the vitality of financial investors has been enhanced greatly [1]. Financial institutions local and abroad are aware that China will be a vast market and competition will be vigorous. Facing these challenges, many companies are eager to get insights into the Chinese financial market and investors. Much different from most developed financial markets where institutional investors are majority, individuals account for nearly 80% of all investors in China [2]. And by 2017, data from the China Clearance Center showed that the number of individual investors in the A-share stock market amount to over 133 million and one out of every ten people in China invests in the stock market.
As we know, precision marketing and personalized service have been hot topics and key strategies for firms to gain a competitive advantage in the era of Big Data. In response to such a huge and important group of investors, an interesting question is, "what kind of investment behavior preferences do individual investors have in China's financial market?" The answers to these questions can directly influence the strategies of service providers. For example, based on clients' preference and tolerance of risk, a brokerage firm can target them with a risky product. Accordingly, gaining insights into investors' behavior preferences becomes a necessary PLOS ONE | https://doi.org/10.1371/journal.pone.0201916 August 9, 2018 1 / 16 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 measure for companies to develop new customers, reduce management costs, provide personalized services, and to obtain a competitive advantage [3]. However, investors' behavior preferences are hard to observe and measure directly due to their dynamics, ambiguity, heterogeneity and uncertainty [4]. It is difficult even for institutions to obtain such information from individual investors due to legal restrictions. However, some personal characteristics of investors such as demographic characteristics are much easier to access and measure legally than the behavioral information. Applying some easy-to-obtain personal characteristics as predictors to build models to evaluate investors' behavior preferences may be a solution to this problem.
This paper aims to analyze the capability to predict investment behaviors based on some demographic characteristics, and verify the feasibility and effectiveness of building behavior prediction models based on these characteristics. To fulfill these purposes, we collected survey data from more than 20,000 Chinese individual investors and used several methods to analyze predictability. The paper proceeds as follows: firstly, it reviews the literature relevant to financial behavior preferences and personal characteristics; secondly, the variables and data are described; thirdly, we provide the methodology used to analyze predictive capability and build a prediction model, then present the results; to illustrate the validity of the model, an assumed application case is provided in the fourth section; finally, the conclusions are drawn.

Literature reviews
Traditional finance theories such as the "Efficient Market Theory" and "Modern Portfolio Theory" hold that the investors' behaviors are rational and logical, and all activities are reflections of economic information [5]. Scholars like Kahneman and Tversky [6] established and developed behavioral finance theories. According to the view of Pompian [7], these theories can be divided into two types, Behavioral Finance Macro and Behavioral Finance Micro. The former usually studies institutional investors, and the latter mainly concerns the individual investors. The original intention of focusing on Behavioral Finance theories was to explain stock market anomalies and market bubbles and crashes in order to increase the efficiency of financial markets by applying psychology and other social science theories [8]. In the course of this research, scholars observed personal characteristics and behaviors from the perspective of an investment effect. Extensive personal characteristics of investors were involved, such as personality, genetic characteristics, education, social position, economic capability, experience, emotion, cognition, etc. For example, Pompian and Longo [9] investigated 100 investors using the Myers-Briggs personality test list and questionnaire, and found that there were striking differences among individual investors with different preferences including preferences of investment types, choices of information channels and trading behaviors. Wen [10] built a D-GARCH-M model to examine the relation between Investors' Risk Preference and return on stock market, found that investors become risk aversion when they gain and risk seeking when they lose and the extent of risk aversion in gains and that of risk-seeking in losses were different. Clark-Murphy and Soutar [11] applied cluster analysis and discriminant analysis in their study and divided the samples into four categories according to the different attitudes and decision-making behaviors of individual investors, and found that individual investors in each category have different features in investment preferences and target selections. Hira and Loibl [12] paid special attention to differences of investment behavior caused by gender, and they found that gender had an impact on the acquisition sources of investment information and risk-taking level by conducting a national randomized sample and telephone interview. Barnea et al [13] used twin investors' investment records (which were very difficult to obtain) to discuss the linkages among individual investor's characteristics, market participation habits and capital investment distribution behaviors with Pearson's correlation test. The authors found that one third of investment behavior differences can be explained by individual genetic characteristics. Kabra et al [14] found people of different ages and genders have varying risk tolerance levels in decision making processes by factor analysis and regression analysis. Cary Frydman et al [15] conducted a study using functional magnetic resonance imaging to test a "realization utility" explanation for their behaviors.
As this paper is related to the Chinese financial market, some related literature about the market are as follows. At an early stage of the stock market, Xinghui and Xiaohong [16] made an investigation in Shanghai and found that all of personal characteristics, abilities, social and economic environments could influence stock investment performance. Bojin [17] made a questionnaire survey on the individual investors and institutional investors of 126 sales departments in Jiangsu Province, aimed at understanding the factors that affect their investment behaviors, including their composition situations, psychological qualities, investment techniques, as well as politics, economies, policies, information, etc. Through interviews and questionnaires, Lei [18] found that the individual investors who are able to effectively master market information and have an advantage over others on investment knowledge will be more likely to profit. Some scholars studied the features of specific investment behaviors in excessive trading, and considered that those excessive trading behaviors are common among individual investors [19,20]. Others undertook research on the personal characteristics of individual investors, arguing that Chinese individual investors not only have a cognitive behavioral deviation in general sense, but also have localization deviation [21].
With the establishment of behavioral finance theories, studies on investors' personal characteristics and behavior preferences have drawn scholars' attention from both developed and emerging economies. The literature mentioned above provides a solid foundation for this research. However, there are few studies from the viewpoint of big data applications such as precision marketing and personalized service. The following conclusions can be drawn from existing literature: (1) most of the studies focused on the effects of investment and examined the influence of investor's personal and behavior characteristics as explanatory variables of models; (2) many personal characteristics were analyzed from the aspects of psychological cognition and character traits, which needs professional and complex psychological tests, tracking surveys, therefore being constrained to relatively small samples sizes; (3) most researches merely applied classical linear regression model for analysis, seldom using models of data mining which potentially reveal any nonlinear, discontinuous and probabilistic relationships between variables. As a result, we inferred that it is necessary to use a larger sample and more accurate methods to study the predictability of investor decision-making behaviors from the perspective of data applications.

Demographic characteristics variables
As mentioned in literature reviews, personal characteristics and behaviors of investors are extensive. However, many of them are hard to obtain, which would prevent them from being used as predictor variables of the models in practical business intelligence projects. For example, many business intelligence projects would face "cold start" problems when the project is at the starting stage, which usually represents a serious problem in recommender systems as there is not enough historical data to analyze user's preferences at the beginning. And in the same way, there is insufficient data to build precise models. In this paper, a solution is provided by focusing on some demographic characteristics which are comparatively easy to acquire. Referring to research in [22][23][24][25][26], this paper focuses on the following demographic characteristics: gender, age, occupation, years of education, financial knowledge level, investment experience and income. Here we call them DC (Demographic Characteristic) variables. These characteristics are not only accessible in daily life, but also can be measured and described easily. Moreover, these characteristics are stable within a certain period of time. As input variables of models, these features are of great importance for practical applications.

Investment behavior variables
Although investors would not deliberately pay attention to and structure their own behaviors, according to decision-making theories, they naturally or half unconsciously follow such processes composed of four stages: preparation, decision making, execution and feedback. The main tasks in the preparation stage include evaluating self-ability, and determining investment goals and searching information; in the decision-making stage, the most important tasks are choosing investment directions and products as well as determining investment scale and allocation proportions; the decision execution stage includes determining trading time and specific trading operations; and the feedback stage is to evaluate and rethink the previous decisions. Based on this viewpoint and referring to certain other available studies [27][28][29][30][31][32][33], this paper investigates following specific investment decision behaviors: investment scales, investment instruments, transaction frequencies, decision-making styles, investment information channels. All of these are major behaviors of investors in different decision stages and have potential value for financial marketing and service. Here we call them IB (Investment Behavior) variables.

Survey sample overview
A questionnaire was designed in accordance with DC and IB variables. Some questions for measuring the validity and consistency of the questionnaire are also included. The questionnaire could be completed within about 15 minutes. According to the Statistical Report of Development Status of China Internet Network released by China Internet Network Information Centre (CNNIC), in December 2013, the number of Chinese cyber citizens have reached 618 million. Nowadays, the vast majority of investment transaction are handled through the network, hence most financial investors are cyber citizens too. Therefore, we hired a professional online survey company (https://www.wenjuan.com/) to issue questionnaires. This process consisted of two stages. The first stage started in December 2013 and ended in February 2014. The second stage lasted from November 2014 to December 2014. Altogether, around 22,000 questionnaires were collected. In the procedure of data pretreatment, we have removed those questionnaires from duplicate IP address or due to lack of validity or consistency. Moreover, the questionnaires with abnormal answer times or unanswered key questions were also rejected as invalid. Finally, we use 8,489 questionnaires as experimental data, the survey data were collected anonymously and the data can be found through the following URL: https:// github.com/WennieX2017/IID-Behavior-Prediction. Table 1 describes the geographical distribution of survey samples. The table illustrates that the samples are mainly from six developed provinces or municipalities namely Guangdong, Shanghai, Beijing, Shandong, Jiangsu, Zhejiang, which account for 55.92%. To some extent, the distribution also reflects current economic geography in China and can prove the geographical representativeness of the samples.  Correlating results from other information sources, it is believed that the survey data are consistent with other information released by China's official agencies in several ways. Consequently, it could be regarded as a representative sample of individual investors in China. In addition, some data preprocessing steps such as binning and reclassifying were executed. Table 2 displays the values of each variables for subsequent analyses.

Methodology
In order to ensure the accuracy of the results and considering the nonlinear, discontinuous and uncertain relationship among variables, Pearson's chi-squared test, Spearman rank correlation test and several data mining methods are applied in subsequent analyses.
1. Pearson's chi-squared test (χ 2 ) is a statistical test suitable for unpaired data from large samples [34]. Its null hypothesis states that the frequency distribution of certain events observed in a sample is consistent with a particular theoretical distribution. The value of the test-statistic is: Where χ 2 = Pearson's cumulativetest statistic, which asymptotically approaches a χ 2 distribution. O i = the number of observations of type i. N = total number of observations E i = N Ã p i = the expected (theoretical) frequency of type i, asserted by the null hypothesis that the fraction of type i in the population is p i . n = number of categories. The chi-squared statistic can then be used to calculate a p-value by comparing the value of the statistic to a chi-squared distribution. If the test statistic exceeds the critical value, the  2. Since the variables such as age, education and transaction frequency are ordinal variables, Spearman Rank Correlation method is applied; this is a nonparametric (distribution-free) rank statistical analysis tool proposed by Charles Spearman. It assesses how well an arbitrary monotonic function can describe a relationship between two variables, without making any assumptions about the frequency distribution of the variables. It does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it is very suitable for variables measured at the ordinal level. Spearman correlation coefficient can be computed using the following formula. y) is the covariance of the rank variables x and y, and σ x , σ y are the standard deviations of the rank variables. We would analyze the correlation of each pair of ordinal variables between DC and IB at two stages of data collection. If the correlation coefficient ρ is significantly different from 0, it means the DC variable has information to predict the IB variable.
3. Above Pearson's chi-squared test and Spearman Rank Correlation analysis (described above) merely investigate the correlation between one DC variable and one IB variable each time. However, investors' behavior is comprehensively affected by multiple factors (variables). In order to test the predictability of each IB variable based on combination of different DC variables, data mining is applied. Generally, data mining is based on inductive statistics and is a type of data driven method, which is especially suitable for finding the hidden, complex nonlinear models of the data [35]. Supervised classification is the most important data mining technology, whose framework is shown in principles about these data mining algorithms, please see reference [36]. Here, the data collected from the first stage are used as a training set, and those from the second stage are used as a test (prediction) set. Through this we can not only view the predictive effect, but also understand the stability of the models in different periods.
For classification tasks, Precision, Recall, F-measure and Accuracy are widely used to evaluate the performance of models. To understand them, let us consider a two-class prediction problem, in which the outcomes are labeled either as positive (p) or negative (n). There are four possible outcomes from the classifier. If the outcome is p and the actual value is also p, then it is called a true positive (tp); however, if the actual value is n then it is a false positive (fp). Conversely, a true negative (tn) means both the prediction outcome and the actual value are n, and false negative (fn) represents the prediction outcome is n but the actual value is p (see Table 3).
Accuracy (Accu) is the ratio between the number of correctly predicted samples and the number of the total samples, which is defined as: Precision (P) is the ratio between the number of true positive and the total number of positive predicted by the classifier, which is defined as Recall (R) is the ratio between the number of detected positive and the total number of positive that occurred during a classification, defined as F-measure combines precision and recall which is defined as the harmonic mean of precision and recall.
Value range of above three indicators are all between 0 and 1. The higher of them, the model prediction will perform better. Table 4 presents the results of Pearson's chi-squared test for each pair of DC-IB variables in two stages of data collection. According to the results, most IB variables are significantly associated with DC variables at level of 0.01, which means that the DC variables contain information to predict the IB variables. For example, the result shows that gender is significantly related to investment frequency, which is consistent with the previous research finding that "men were more likely than women to adjust their investments" [12]. Besides, the significant correlation between gender and investment scale also matches Barber and Odean's finding that "women hold slightly, but not dramatically smaller, common stock portfolios" [37]. In the meantime, the Chi-squared values of each DC-IB variables pair at two stages are very similar, which indicate that the relationships between them are stable.

Results of rank correlation analysis
As shown in Table 5, most Spearman Rank Correlation coefficients are significant at the level of 1%, which indicates that these ordinal DC variables generally have a significant impact on the IB variables. To be specific, age, education, knowledge, experience, and income are all positive with investment scale, that is consistent with the general perception that with the growth of investors' experience, knowledge, age and income, investors often have better investment capability and consciousness. In addition, the significant positive relationship between transaction frequency and experience is consistent with the research conclusion that "experienced investors were generally over-confident, thus leading to frequent trading" [21]. And the decision-making style and age present a negative correlation, reflecting that the older investors are more inclined to have a cautious decision-making style, which is consistent with the intuitive feel that older people are more conservative. Overall, in two stages, the correlation coefficients of most pairs of ordinal variables are very close and have the same sign (positive or negative), which indicate that these relationships are stable.

Results of data mining models
We created our predictive models using IBM SPSS Modeler Version 18.0. SPSS Modeler is a predictive analytic software that provides a range of advanced algorithms and techniques for data analysis, decision management and optimization. Table 6 indicates the predictive performance of the models on test (stage 2) data set. The accuracy of applying demographic characteristics to predict investment behavior in different classification methods has reached a good level (far more than 0.5). At the same time, most of the values of R, P and F are over 0.5 too, which indicate that these models are strong predictors. Even though certain values of R, P and F are not high, the corresponding model possess utility and value: for instance, applying the C&R model to the prediction of investment style, the R, P and F values are 0.49, 0.36 and 0.42 when style = 1, while when style = 2, the values of the R, P and F are 0.70, 0.80 and 0.74. It means that although this model is not suitable for predicting style = 1, but would be very good at predicting style = 2. Hence the model is still valuable for finding clients with style = 2. Comparing the results of different models relevant to IB variables, the best performing model is for transaction frequencies, where the predictive accuracy of all six classifiers has reached around 0.7, followed by investment instruments and investment scales. Therefore, demographic characteristics possess strong predictive capability to the investors' behaviors.
Comparing the performance of six classifiers, the C&R method appears to give the best results on almost every IB variables, thus, it could be recommended as a method for building a predictive model. Moreover, the C&R method has some unique advantages. For instance, it can output the degree of importance of predictor variables to the goal variable, and produce decision rules like "if . . .then. . .". Table 7 demonstrates the importance of DC variables to each IB variable in the C&R model. The degree of importance represents the DC variables' relative contribution to the prediction of the goal variable (showed in the brace).
Based on the results, it is obvious that there are strong connections between all the DC variables and IB variables. In addition, the importance of variables provides more information in details. For example, the investment scales of individual investors have the strongest link with their investment experiences, incomes and occupations. Investment style is mainly influenced by knowledge, income, age and education background. Investment instrument and trade frequency are dominated by experience and income. And financial knowledge and investment experience are the major factors of information channel. These results are in agreement with the practical experiences and observations. From these results, we can conclude that experience and income are most important factors for nearly all behaviors. The importance of knowledge, occupation and age are significantly different for behaviors. Table 8 illustrates several decision rules with high confidence values learned with C&R. From these rules, we can conclude that experienced investors with high income are more likely to take risks (scale>40%) (No. 2); those of ages from 50 to 60 with limited investment knowledge are inclined to make decisive decisions (No.7); those who are experienced, with high income and sufficient investment knowledge are usually high frequency trading players (No.6). Obviously, these rules are in accordance with investors' behaviors and are valuable for further applications.

Application case
In order to prove the usefulness of the findings above, we assume that a financial institution is going to promote a particular financial service to clients with a certain behavioral preference, for example, risk-preference investors (scale more than 40%). This institution keeps a list of 100,000 clients with the above demographic characteristics, but without any risk-preference information. Let us assume that the promotion costs are 10 RMB per person, each customer Investment decision behaviors based on demographic who buys this service (potential responder) would bring 250 RMB revenue. Response rate of risk-preference investors is 10%, while the rate for non-risk-preference investors is 1%. Here a decision must be made: should all potential responders be targeted or just some of them? When promoting to all potential customers, it would reach all risk-preference clients whose number is about 35,180 (risk-preference investors account for about 35.18%). Suppose the institution releases questionnaires and collects data according to this research, and got a predictive rule like No.2 in Table 9 from data mining model. According to this rule, the institution would only target experienced(!3) and high income(!3) investors, which covers around 25% of total investors, and reach about 17,250 (25,000×69%) risk-preference investors. Table 9 compares the benefits of this promotion when using this prediction model and when not.
Based on these results, this financial service institution can earn a profit (160,500 RMB) even considered modeling cost for precision promotion activity (40,000 RMB), which is better than the solution of using model (41,500 RMB).

Conclusions
Understanding the behaviors of investors is of great value to many financial institutions. However, the investors' behavior information is ambiguous and implicit, which makes it difficult to observe, measure and obtain directly. This paper presents a new idea: analyze the capability to predict investment behaviors based on certain demographic characteristics, and verify the feasibility and effectiveness of building behavior prediction models based on these characteristics. Our study makes the following contributions: if ((occupation = 1) or (occupation = 6) or (occupation = 8)) and (experience = 1) and ((income 2) or (income!6)) and ((knowledge 2)) then instrument = 2 0.90  1. Different from that most studies focused on the effects of investment and examined the influence of investor's demographic and behavior characteristics as explanatory variables of models. We explored the potential relationship between investor's demographic characteristics and investment behaviors, and proved that investors' demographic characteristics can be used to predict their investment behaviors.
2. We apply certain easy-to-obtain personal characteristics as predictors to build models to evaluate investors' behavior preferences. In this way, the issue that investors' behavior preferences are hard to obtain and measure due to their dynamics, ambiguity, heterogeneity and uncertainty can be solved.
3. We use data mining which can reveal nonlinear, discontinuous and probabilistic relationships between variables to study the predictability of investor decision-making behaviors from the perspective of data applications.
In this paper, an in-depth study about the predictive power of several easy-to-obtain demographic characteristic variables on investors' behaviors has been conducted and following conclusions drawn: 1. By applying Pearson's chi-squared test, Spearman rank correlation analysis, and six classic data mining techniques, we can find that Chinese investors' decision behaviors are significantly and stably correlated to their demographic characteristics, which indicates that the demographic characteristics can be used for prediction of investors' behaviors; 2. Among the demographic variables examined in this paper, experience and income are especially important predictors. And trade frequency of an investor is the most predictable behavior, followed by investment scale and investment instrument; 3. Due to the availability of demographic characteristics, it is an economical and feasible approach for predicting investors' behaviors. Even if investors' behaviors cannot be predicted exactly, information hidden in demographic characteristics is still valuable for some applications such as precision marketing, personalized service and so on. Especially at the starting phase, data on demographic characteristics can be useful supplements when behavioral data are insufficient to address the "cold start" problem of business intelligence projects.