Analyzing Personal Happiness from Global Survey and Weather Data: A Geospatial Approach

Past studies have shown that personal subjective happiness is associated with various macro- and micro-level background factors, including environmental conditions, such as weather and the economic situation, and personal health behaviors, such as smoking and exercise. We contribute to this literature of happiness studies by using a geospatial approach to examine both macro and micro links to personal happiness. Our geospatial approach incorporates two major global datasets: representative national survey data from the International Social Survey Program (ISSP) and corresponding world weather data from the National Oceanic and Atmospheric Administration (NOAA). After processing and filtering 55,081 records of ISSP 2011 survey data from 32 countries, we extracted 5,420 records from China and 25,441 records from 28 other countries. Sensitivity analyses of different intervals for average weather variables showed that macro-level conditions, including temperature, wind speed, elevation, and GDP, are positively correlated with happiness. To distinguish the effects of weather conditions on happiness in different seasons, we also adopted climate zone and seasonal variables. The micro-level analysis indicated that better health status and eating more vegetables or fruits are highly associated with happiness. Never engaging in physical activity appears to make people less happy. The findings suggest that weather conditions, economic situations, and personal health behaviors are all correlated with levels of happiness.


S1 File. The model diagnostic results
The model diagnostics are methods for determining whether a fitted model adequately represents the data. To evaluate the ordinal regression models in this study, we need to determine whether the model improves our ability to explain the outcome. We do this by comparing the ordinal regression model without any explanatory variables (the "Intercept Only" model) against the model with all the explanatory variables (the "Final" model). We compare the Final model against the Intercept Only model to see whether it significantly improves the fit to the data. S1 File Tables 1 and 2 show the model fitting information. The statistically significant chi-square statistics indicate that the Final models are a significant improvement over the Intercept Only models.
A standard statistical maneuver for testing whether a model fits is to compare the observed data with the fitted model for consistency. From the observed and expected frequencies, the usual Pearson and Deviance goodness-of-fit measures are computed in S1 File Tables 3 and 4. We start from the null hypothesis that the fit is good. If we do not reject this hypothesis (i.e., if the p value is large), then we conclude that the observed data and the model predictions are similar and that we have a good model.
The results for our analysis suggest all models in this study fitted well (p>0.01).
To assess the strength of association, there are several R 2 -like statistics that can be used to measure the strength of the association between the response variable and the explanatory variables. In this study, three commonly used pseudo R 2 statistics are employed to measure the strength of association. The results are shown in S1 File Tables 5 and 6. Here, the Nagelkerke's pseudo R 2 values (16.80% for the 28 countries' data and 8.80% for the China data) indicate that the model explains a relatively small proportion of the variation between people in their happiness levels. This is just as we would expect, because there are numerous factors that affect personal happiness.
Multicollinearity occurs when statistical models have two or more explanatory variables that are highly correlated with each other. This leads to problems with understanding which variable contributes to the explanation of the response variable and technical issues in calculating an ordinal regression. Determining whether there is multicollinearity is an important step in ordinal regression. To test for this assumption, we require creating dummy variables for each level of the categorical explanatory variables in this study. We use the Variance Inflation Factor (VIF) value for each variable as a check for multicollinearity, and as a rule of thumb, we use the VIF of an explanatory variable greater than 10 as a cutoff for variables that may merit further investigation. The results in S1 File Tables 7 and 8 indicate that the VIF values are all quite acceptable for the significant variables in this study. Two variables, "smoking cigarettes" and "past 12 months: have visited a doctor," have large VIF values. Both of them, however, are non-significant in our models.
The key assumption for fitting an ordinal regression is that the effects of any explanatory variables are consistent or proportional across different thresholds; hence this is usually termed the assumption of proportional odds (SPSS calls this the assumption of parallel lines). We evaluate the appropriateness of this assumption through "the test of parallel lines" in SPSS. This test compares the ordinal model, which has one set of coefficients for all thresholds (labelled Null Hypothesis), to a model with a separate set of coefficients for each threshold (labelled General). The test results in S1 File Tables 9 and 10 give a significantly better fit to the two datasets than the ordinal model, and thus we were led to reject the proportional odds assumption. The test of the proportional odds assumption has been described as anti-conservative, however, in that it nearly always results in rejection of the proportional odds assumption [1], particularly when the number of explanatory variables is large [2], the sample size is large [3,4], or there is a continuous explanatory variable in the model [3]. It is important to examine the data using multinomial logistic regression to explicitly see how the odds ratios (ORs) for our explanatory variables vary at the different thresholds. We use the 2-day period ordinal regression for 28 countries as an illustration. Looking at the separate ORs of continuous explanatory variables across the six splits in S1 File Table 11, the difference in ORs appears negligible (0.963 to 1.005 for Age), so a common OR for each of these continuous explanatory variables is a very plausible assumption. The proportional odds assumption is also upheld for most of the categorical variables.
The categorical variable most out of line with the proportional odds assumption is smoking cigarettes. Of 25,096 respondents to this question, 12,432 (49.50%) replied, "Do not smoke and never did"; 6,567 (26.25%) replied, "Do not smoke now but smoked in the past"; 1,544 (6.20%) replied, "Smoke 1-5 cigarettes per day"; 1,683 (6.70%) replied, "Smoke 6-10 cigarettes per day"; 2,296 (9.10%) replied, "Smoke 11-20 cigarettes per day"; 518 (2.10%) replied, "Smoke 21-40 cigarettes per day"; and 56 (0.20%) replied, "Smoke more than 40 cigarettes per day." The ORs for smoking cigarettes from the separate logistic regressions differ hugely. In this particular case, it might be reasonable to conclude that the ORs from the ordinal regression model do underestimate the extent of the over-representation at the "Do not smoke and never did" level, the "Do not smoke now but smoked in the past" level, the "Smoke 6-10 cigarettes per day" level, and the "Smoke 11-20 cigarettes per day" level, and do overestimate the extent of the under-representation at the "Smoke 1-5 cigarettes per day" level and the "Smoke 21-40 cigarettes per day" level. This finding was obscured in the single cumulative OR for each level of smoking cigarettes, and summarizing this relationship in a single OR misses this observation. Thus, smoking cigarettes may well be the major factor underlying the overall rejection of parallel lines. The Chi-square test that led to the rejection of the proportional odds assumption probably reflects the large sample size in our datasets. We think the violation of the proportional odds assumption is quite minor.