Multi-Country Evaluation of Affective Experience: Validation of an Abbreviated Version of the Day Reconstruction Method in Seven Countries

Background The Day Reconstruction Method (DRM) was developed to assess affective states as measures of experienced well-being. The present study aimed to validate an abbreviated version of the DRM in a representative sample of the population in seven countries (China, Ghana, India, Mexico, Russia, South Africa, and Spain), and to examine whether there are country differences in affect and in the relationships among the activities based on the similarity of the affect associated with each of them. Methods Interviews were conducted with 47,222 non-institutionalized adults from seven countries, using an abbreviated version of the DRM. A cluster analysis was carried out to classify activities on the basis of the similarity of the associated affect. In each country, the factorial structure of the affect adjectives was tested through Confirmatory Factor Analysis. Internal consistency and construct validity were also assessed. Moreover, the differences in affect across countries and the diurnal cycles of affect were evaluated. Results The DRM showed adequate psychometric properties regarding reliability and construct validity in all countries. Respondents from Ghana and South Africa reported more positive net affect whereas Indian respondents reported less positive net affect. Most of the countries showed a similar diurnal variation of affect, which tended to improve throughout the day. Conclusions The results show that this abbreviated version of the DRM is a useful tool for multi-country evaluation of experienced well-being.


Introduction
The Day Reconstruction Method (DRM) [1] is a tool for studying well-being which assesses how people spend their time, with whom they spend time and how they describe the experiences associated with the activities and settings of their lives. The DRM may have multiple applications, from evaluating the impact of different illnesses, or analyzing social and economic stressors to evaluating policies and measuring the well-being of the society [1].
Kahneman et al. [1] administered the first version of the DRM to 909 working women in Texas in the United States of America.
Respondents were asked to first think about the previous day by writing down a diary consisting of a sequence of episodes covering the entire waking day. They then described each episode by answering questions about each episode and associated feelings they experienced.
Due to time and economic constraints, most national comparisons of well-being assess evaluative well-being based on global life satisfaction or general happiness questions [8]. However, cultural comparisons based on these self-reports are difficult to interpret, and have some important sources of bias, e.g. memory bias and positivity bias [9].
Experienced well-being is perhaps a more challenging construct to measure. In recent years, some attempts have been made to evaluate experienced well-being in larger samples. Krueger and Stone [10] designed a telephone survey based on the American Time Use Survey and the DRM: the Princeton Affect and Time Survey (PATS). This approach is a good approximation of the gold standard, Experience Sampling Method. The Gallup World Poll [11] was able to obtain information about well-being from nationally representative samples, with questions about whether respondents experienced certain positive and negative feelings a lot during the previous day. While providing valuable insights, the time use and associated affective experience are not collected for each activity limiting its usefulness in fully characterizing experienced well-being.
A brief measure of experienced well-being that could be used in large national representative survey samples could overcome the logistical barriers needed to generate comparable data on wellbeing across countries worldwide. One example of an abbreviated version of the DRM that can be used in large population surveys, irrespective of literacy levels and the ubiquity of telephones, has been recently validated in Jodhpur, India as part of the World Health Organization's Study on Global Ageing and Adult Health (SAGE) [12].
The present study aimed to validate this abbreviated version of the DRM in a representative sample of the population of seven countries, and to examine whether country differences exist in affect and in the activity-affect relationships.

Sample and Procedure
The data were obtained from SAGE and the Collaborative Research on Ageing in Europe (COURAGE in Europe), which are two multi-country projects compiling comprehensive data on the health and well-being of adult populations and the ageing process. Both studies collected data on respondents aged 18+ years, with an emphasis on populations aged 50+ years, from nationally representative samples. The data from Spain presented in the present paper come from COURAGE in Europe, whereas the data from the other six countries come from SAGE. The details of the design and methods for SAGE are published elsewhere [13].

Measures
Respondents were also asked to provide demographic information (age, sex, education level, marital status, residential location, work status) at the beginning of the interview. The abbreviated version of the Day Reconstruction Method [1] (available at http://www.who.int/healthinfo/systems/sage/en/index.html) used to obtain information about participants' daily activities and their subjective well-being, was limited to a maximum of 15 minutes of interview time. Respondents were asked to reconstruct a portion (morning, afternoon, or evening) of their previous day's activities, and reported the positive and negative emotions associated with each activity. The data provided a picture of the participants' daily lives, including what they did, for how long, and who they were with, as well as a way of calculating how much of their time was spent feeling pleasant or unpleasant emotions.
Respondents were randomly assigned to complete one of the four different versions of the abbreviated DRM (sets A, B, C, and D). In sets A, B and C, respondents reconstructed only a portion of their previous day's activities (starting with morning, afternoon, or evening respectively) and responded to questions about each episode, including the nature of the activity (for example, working, shopping), any people who were present (for example, alone, with spouse), and the extent to which they experienced various feelingsworried, rushed, irritated or angry, depressed, tense or stressed, calm or relaxed, and enjoying on a 3-point response scale (1 = not at all, 2 = a little, and 3 = very much). In the sample from Spain, the scale ranged from 0 (not at all) to 6 (very much) with the remaining points unlabeled. In set D, participants reported the activities, people present, and feelings for each part of the day (morning, afternoon and evening) altogether, instead of activity by activity along with the respective accompanying emotion. In sets A, B and C, the day was recorded in an event-by-event manner, and the participants reported the time at which the first activity registered started and the duration of each activity. This information was used to estimate the affective state at each hour of the day. Set D recorded broadly what was done in the morning, afternoon, and evening, and therefore the duration of each activity was not reported. Set D was not used in Spain.
Two measures were calculated based on the scores obtained with the seven affect items: net affect and U-index [14]. Net affect was defined as the average of the two positive emotions (calm/ relaxed and enjoying) minus the average of the five negative ones (worried, rushed, irritated/angry, depressed, and tense/stressed), that is, positive affect minus negative affect. For sets A, B and C, scores were weighted by activity duration. For set D, a raw score was calculated because the affect items were not associated with single activities. For comparisons between the seven study countries, net affect scores were expressed on a percentile scale, in which 100 represents the best affective state. Positive and negative affect were also expressed on the percentile scale. The Uindex was obtained by calculating, for each participant, the proportion of time in which the highest-rated feeling was a negative one. In set D, the U-index was not calculated because the duration of each activity was not collected.
The questions were translated from English into the local languages, following the WHO translation guidelines for assessment instruments. This included a forward translation, a targeted back-translation, review by a bilingual expert group and the elaboration of a detailed translation report. Ethical approvals from the following institutions were obtained: Ethics

Statistical methods
The sociodemographic characteristics of respondents were recorded for each country. Differences in sociodemographic characteristics between the final and the excluded sample were assessed by means of unpaired t-tests (continuous variables) and x 2 tests (categorical variables).
Descriptive analyses generated mean scores of positive and negative adjectives (worried, rushed, irritated/angry, depressed, tense/stressed, calm/relaxed, and enjoying) for the different activities. Based on these means, a cluster analysis was carried out in order to identify the relationships among the activities coded in the DRM in each country, according to the similarity of the activity-affect relationship. Variables were standardized, and the Euclidean distance was employed as a dissimilarity measure. Divisive hierarchical clustering method was used. The average linkage method was chosen because it maximizes the cophenetic correlation coefficient (CCC) [15], which is a measure of how faithfully a dendrogram maintains the original pairwise distances. A bootstrap version of cluster analysis that evaluated how consistently the same clusters appeared over 10,000 runs was performed with a sub-sampled dataset. For this purpose, the pvclust [16] package for statistical software R was used. For each cluster in hierarchical clustering, p-values were calculated. The pvalue of a cluster is a value between 0 and 1, which indicates how strong the cluster is supported by the data. The pvclust package provides two types of p-values: Approximately Unbiased (AU) pvalue and Bootstrap Probability (BP) value before statistical adjustments were reported. AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value than the BP value computed by normal bootstrap resampling. Clusters strongly supported by the data (with AU higher than 95%) were highlighted by rectangles.
In each country, the factorial structure (negative versus positive items) of the seven adjectives was tested through Confirmatory Factor Analysis (CFA) for categorical outcomes, with the robust weighted least square estimator (WLSMV; which does not have assumptions of multivariate normality), and using polychoric correlations [17] for categorical variables. Goodness-of-fit of the model with two latent factors and seven observable indicators was assessed according to standard recommendations [18]. Several indices were used to assess fit according to the values proposed in the literature for Structural Equation Modeling (SEM) with categorical outcomes [19][20][21][22]: a) lack of significance of x 2 ; b) comparative fit index (CFI).0.95; c) Tucker-Lewis index (TLI).0.95; d) root mean square error of approximation (RMSEA),0.08; and e) weighted root mean square residual (WRMR),1.0. Since the x 2 statistic is sensitive to sample size [23], the x 2 values might be inflated (and statistically significant) due to the large size of the sample, which might erroneously imply a poor data-to-model fit [24]. Burnham and Anderson [25] noted that model goodness-of-fit based on statistical tests becomes irrelevant with large sample sizes. Moreover, WRMR is sizedependant and can be unreliable when the sample has more than 2000 cases. Due to these considerations and the large size of the sample used, only CFI, TLI and RMSEA are reported in this study. One of the main advantages of RMSEA is that it allows the calculation of a confidence interval around its value [26]. Ninety percent confidence intervals for RMSEA are also reported.
In sets A, B and C, the affective state associated with each activity was coded. Because the affective state was not reported for each activity in set D, cluster analysis and CFA were carried out only with the activities reported in sets A, B and C, pooling the data corresponding to these sets.
Reliability was assessed in terms of internal consistency using the Raykov & Marcoulides [27] method for reliability evaluation with categorical items, by means of Mplus option for maximum likelihood estimation with robust standard errors (MLR). Composite reliability was estimated for each of the factors obtained after CFA. In set D, similar analyses were carried out to assess construct validity and reliability. In this case, the analyses were run by countries, pooling the responses for the morning, afternoon and evening questions.
Diurnal variation of affect was assessed in each country, using the affective state reported by respondents in each hour of the day. This information was calculated for each participant assigned to sets A, B, and C based on the time at which the first activity reported began and the duration of each activity. Cross-country differences in net affect and U-index were tested by an ANOVA test. Bonferroni tests for post-hoc pairwise comparisons were used. 95% confidence levels were considered in hypothesis tests. Since statistical significance of differences could be due to the large sample size, effect size measures (Cramer's V for contingency table chi-square tests, Hedges' g for unpaired t-tests and pairwise comparisons, and Cohen's f for ANOVA tests) are reported. Cohen's guidelines were used as a reference [28]: Hedges' g values of 0.20, 0.50, and 0.80 constitute small, medium, and large effect sizes, respectively; these values are, respectively, 0.10, 0.25, and 0.40, in case of Cohen's f; and 0.10, 0.30, and 0.50, in case of Cramer's V for chi-square test for 262 contingency tables. Cluster analyses were carried out using R version 2.10.1 [29]. Mplus version 6 [30] was employed for factor analysis modeling. The rest of the analyses were performed using Stata SE version 11 [31].

Results
A total of 47,222 people from China, Ghana, India, Mexico, the Russian Federation, South Africa, and Spain were interviewed. However, 1564 (3.3% of the initial cases) were removed from these analyses because they did not answer the subjective well-being section analyzed in the present article. The respondents excluded from the survey did not differ (the differences were not significant or had a very small effect size) by sex (56.7% women in the final sample vs. 60.0% women in the sample removed, p = 0.001, Cramer's V = 0.02), mean age  Table 1 presents the main characteristics of the 45,658 respondents included in this study. The percentage of these participants who completed each of the four sets was approximately 25% (10,250 in set A, 10,355 in set B, 10,123 in set C, and 10,347 in set D) in SAGE. In the sample from Spain, 1536 participants completed set A, 1507 set B, and 1540 set C. Significant differences in sociodemographic characteristics were found between countries, with effect sizes ranging from moderate to large. Higher differences were found for mean age and residential setting, with India presenting the lowest mean age and the highest percentage of population living in a rural setting.

Pattern of activities
Cluster analyses of activities were conducted to classify activities according to their affect patterns. A total of 43,288 activities were considered for China, 17,094 for Ghana, 41,566 for India, 4844 for Mexico, 10,560 for Russia, 13,722 for South Africa, and 25,087 for Spain. Activities (and their corresponding affective states) were collected from participants who answered sets A, B, and C. The activity ''went to sleep for the night'' was excluded from the cluster analyses because the affective state was not coded for this activity.
The average linkage method presented a good value for CCC in each country. CCC values ranged from 0.73 in Russia to 0.98 in South Africa. The clusters (edges in Figure S1) with high AU values (percentages higher than 95% confidence level) are strongly supported by the data. The dendrogram in Figure S1 shows the order in which the different clusters were created. Although the pattern of results was different from country to country ( Figure S1), there was a tendency for leisure activities to group in one cluster, whereas household and work-related activities were in another. Some activities, such as intimate relations, were underreported, and did not show a stable pattern of association.

Reliability and construct validity in sets A, B, C, and D
Considering the activities reported in sets A, B and C, a CFA was carried out by country to examine the construct validity and to test whether the factorial structure of affect was comprised of two factors: negative items loading on one factor and positive items loading on the other. Adequate fit indices, indicating satisfactory model fit, were found for all the countries, pooling sets A, B and C in each of them. In all cases, CFI and TLI values were higher than 0.98. RMSEA values ranged from 0.026 to 0.074. Moreover, in each group and subgroup, the upper boundaries of 90% confidence intervals were always lower than 0.080. These findings suggest that a two-factor model (negative affect and positive affect) can be considered a plausible hypothesis for the initial seven-item instrument in each country.
The results of the model shown in Table 2 are the estimates of the loadings of each observed measure on each of the factors, followed by the standard errors (s.e.) and their associated p-values of the null hypothesis that in the studied population the pertinent factor loading was zero. Factor correlation was negative and significant in all cases, with values ranging from 20.374 in Mexico to 20.753 in South Africa. The values associated with composite reliability in the two factors suggested an adequate reliability for the DRM in all the countries considered. Composite reliability values for negative affect ranged from 0.770 in India to 0.911 in Spain, whereas values ranging from 0.700 in Russia to 0.893 in China were found for positive affect.
In set D, the affect was coded for each part of the day, not separated by activity, so CFA and reliability analyses were carried out separately for each country, pooling the responses in the morning, afternoon and evening. The results in Table 3 again show a good construct validity of the affect items, divided in two factors, with adequate goodness-of-fit indices. Adequate reliability was found in each country for this version, although in some cases composite reliability for positive affect was slightly lower than 0.70. These analyses were not carried out in Spain, because set D was not used in the Spanish sample.

Differences in affect across countries
Significant differences across countries were found for net affect, with moderately high effect size ( Table 4). The highest mean net affect value was found in South Africa. All the pairwise comparisons (except Russia vs Mexico and Spain vs China) were significant, at a 99% confidence level after the Bonferroni correction, although in some cases these differences could be due to the large sample size. According to the effect size associated with these pairwise comparisons, the largest differences in net affect were found between South Africa and India (Hedges' g = 1.00), Mexico (Hedges' g = 0.92), and Russia (Hedges' g = 0.91). Similar results were found for the U-index, with lower values for South Africa, Spain, China, and Ghana, and higher values for India, Mexico, and Russia (see results in Tables 4 and 5).
The average number of episodes reported was 4.7 (s.d. = 1.8), ranging from 2.5 in Mexico to 6.0 in Spain; average episode duration was 85.7 minutes (s.d. = 68.9), ranging from 60.5 in India to 146.4 in Mexico.

Diurnal variation of affect
As shown in Figure 1, net affect improved as the day passed, although in some countries-South Africa, Ghana, and Russia-it declined in the evening. Mexico showed a different pattern than the other six countries, with net affect reaching its peak upon awakening and declining throughout the day.
Positive affect showed a similar pattern as net affect ( Figure 2). In some countries-South Africa, Spain, Russia, and Ghana-it declined early in the morning, and for all countries except Mexico it increased as the day passed, with a decline in the evening in all countries except China. In Mexico, positive affect decreased as the day passed, but increased in the afternoon and early in the evening.
Negative affect was most pronounced in the morning and tended to decrease as the day passed, although in most countries (except China) it rose in the evening. Mexico showed a different pattern, with negative affect increasing as the day passed.

Discussion
The present study's results show that this abbreviated version of the DRM is a useful tool for multi-country studies evaluating experienced well-being. This study confirms that the abbreviated version of the DRM tested showed adequate psychometric properties regarding reliability and construct validity in the seven countries. The cluster of activities varied in each country, although the results showed a tendency to form two groups of activities: one group comprising work and household activities and therefore with a lower net affect, and another one mainly comprising leisure activities, with a higher associated net affect.
The results obtained from the fitted CFA model support the two-factor hypothesis regarding the seven affective state items of the DRM, with worried, rushed, irritated/angry, depressed, and tense/stressed loading on the negative affect factor, and calm/ relaxed and enjoying loading on the positive affect factor. Lucas et al. [32] also found positive affect and negative affect to be discriminable from each other. Furthermore, both factors showed adequate internal consistency, although other studies using different adjectives have obtained higher rates [33].
Ghana and South Africa presented the highest net affect, whereas India showed the lowest net affect. Regarding the diurnal variation of affect, all countries showed a similar pattern, with the exception of Mexico. In general terms affect improved as the day passed (positive affect increased and negative affect declined), and in the evening it worsened (positive affect declined and negative affect increased). Previous studies in the United States with the Experience Sampling Method and the DRM have also found similar patterns regarding the diurnal rhythms of affect, with negative affect falling for most of the day [1,34] and positive affect improving as the day passed [34].
Regarding the feasibility of use, this is a much shorter version of the DRM, since it is designed to last a maximum of 15 minutes of interview time, whereas completion times for the self-administered original DRM instrument [1] ranged from 45 to 75 minutes [1,2]. Furthermore, the administration of the questionnaire through an interview has the advantage that it can be administered to people regardless of their level of education. The fact that more than 40% of the sample had received little or no formal education was not a barrier for the administration of the DRM. Average number of episodes reported was around five, which, as expected, is about a third of the episodes reported in the original DRM version developed by Kahneman et al. [1], although average episode duration was 86 minutes, higher than in the original version, which was around one hour [1].
This instrument can therefore be used to measure everyday experiences and activities and can be useful in social and health research to analyze the associations between affective states in everyday life and behavior, such as time use, or underlying  Table 5. Hedges' g associated with significant differences after Bonferroni correction in pairwise comparisons across countries in terms of net affect (lower diagonal) and U-index (upper diagonal). biological processes [35]. Furthermore, it can be helpful for economists and policymakers to measure the well-being of the society [14].

China Ghana India Mexico Russia South Africa Spain
One of the strengths of the present study is its large sample size and the fact that the recruited sample included geographically and socio-economically diverse participants who are representative of each country's population. Nevertheless, large sample sizes can mean that small differences are detected as statistically significant. For this reason, in the present study effect sizes were calculated to describe the magnitude of the differences.
The results of this study should be interpreted taking into account some limitations. Convergent validity and test-retest reliability were not assessed. However, the convergent validity of the DRM has been previously shown to be appropriate when compared with the Ecological Momentary Assessment [33,35], and previous analyses performed with this abbreviated version of the DRM have shown a moderate temporal stability [12]. Further studies are being planned as a part of the SAGE study program to compare self-reported emotive states obtained through the DRM with ESM and other biomarkers of emotive states.