Body shape matters: Evidence from machine learning on body shape-income relationship

The association between physical appearance and income has been of central interest in social science. However, most previous studies often measured physical appearance using classical proxies from subjective opinions based on surveys. In this study, we use novel data, called CAESAR, which contains three-dimensional (3D) whole-body scans to mitigate possible reporting and measurement errors. We demonstrate the existence of significant nonclassical reporting errors in the reported heights and weights by comparing them with measured counterparts, and show that these discrete measurements are too sparse to provide a complete description of the body shape. Instead, we use a graphical autoencoder to obtain intrinsic features, consisting of human body shapes directly from 3D scans and estimate the relationship between body shapes and family income. We also take into account a possible issue of endogenous body shapes using proxy variables and control functions. The estimation results reveal a statistically significant relationship between physical appearance and family income and that these associations differ across genders. This supports the hypothesis on the physical attractiveness premium in labor market outcomes and its heterogeneity across genders.


Introduction
This paper studies the relationship between physical appearance and family income using unique three-dimensional (3D) whole-body scan data.Recent development in machine learning is adapted to extract intrinsic body features from the scanned data.Our approach underscores the importance of reporting errors and measurement errors on conventional measurements of body shape such as height and body mass index (BMI).
In the literature on the association between physical attractiveness and the labor market outcomes, sparse measurements such as facial attractiveness, height and BMI are mainly considered as measurements of the physical appearance.For instance, Hamermesh and Biddle (1994) studied the impact of facial attractiveness on wages and showed that there is significant beauty premium.Persico et al. (2004) and Case and Paxson (2008) analyzed the effects of height on wages.They found apparent height premium in the labor market outcomes.Cawley (2004) estimated the effects of BMI on wages and reported that weight lowers the wages of white females.In most previous studies, physical appearance was measured by imperfect proxies from subjective opinion based on surveys.This concerns a possibility of attenuation bias from reporting errors on the physical appearance in the estimation of the relation between physical appearance and labor market outcomes.In addition, measurements such as height, weight, and BMI are too sparse to characterize detailed body shapes.As a result, the issue of the measurement errors on the body shapes would make it difficult to correctly estimate the true relation so that many studies ended up with mixed results.
We use a novel dataset, the Civilian American European Surface Anthropometry Resource (CAESAR) dataset.The dataset contains detailed demographics of subjects and 40 anthropometric measurements such as height and weight, obtained using tape measures and calipers.It also contains the height and weight reported by subjects.This allows us to calculate the reporting errors in height and weight for each subject and investigate their properties and impacts on the estimation results.We found that the reporting error in males' height is correlated with their characteristics such as family income, age, race and birth region, while the reporting error in females' height is correlated with their age.We also found that the reporting error in males' weight is dependent with their true weight, while the reporting error in females' weight is associated with their true weight and hours of exercises.
We further investigate properties of reporting errors using a nonparametric conditional mean and nonlinear quantile functions.The nonparametric estimation of the conditional expectations of the reporting errors in height given the true height shows that the reporting error for female's height is nonclassical in the sense that the reporting error and the true height are dependent.The quantile regression provides that the conditional median of the reporting error is independent of the true height.Thus it would be more plausible to impose a restriction on the conditional quantile of the reporting error of height than the conditional mean (see Bollinger (1998), Hu and Schennach (2008), and Song (2015)).On the other hand, the nonparametric conditional mean and nonlinear quantile regressions show that there are substantial nonclassical errors in both genders' reported weight.
The estimation results for the association between height (or BMI) and the family income confirm that the reporting errors have substantial impacts on the estimated coefficients.Furthermore, such conventional measurements on body shape are too sparse to describe whole body structure.So the analyses with the sparse measurements are very sensitive to the variable selection, which implies that regressions with the measured height and BMI might suffer from the issue of the measurement errors on the body shape.A handful of papers address the issue by proposing statistical methods such as bias-correction methods or instrumental-variables approaches which deliver consistent estimators at the expense of strong assumptions.
The dataset encloses digital 3D whole-body scans of subjects, which is a very unique feature.The scanned data on human body shapes would mitigate the issue of possible measurement errors due to the sparse measurements.Since the observed variable on body shapes in the dataset is three-dimensional, nevertheless, it is not straightforward to incorporate the data into the model of the family income.Indeed, there are 45,534 covariates for each individual's body scan. 3To this end, we adopt methods based on machine learning to identify important features from 3D body scan data.Autoencoders are a certain type of artificial neural networks that possesses an hour-glass shaped network architecture.They are useful in extracting the intrinsic information from the high dimensional input and in finding the most effective way of compressing such information into the lower dimensional encoding.As shown in this paper, the graphical autoencoder can effectively extract the body features and is not sensitive to random noises.To the best of our knowledge, we are not aware of any studies published in Economics which use three-dimensional graphical data with such high dimensionality.
There have been increasing attentions to geometric data such as human body shapes, social networks, firm networks, product reviews, geographical models, etc, in economic studies.In this paper, we introduce new methodology built on deep neural networks and show how it can be utilized to analyze the economic model when the available data has a geometric structure.When one attempts to incorporate geometric data in statistical analyses, there is no trivial grid-like representation for the data.As a result, encoding the features and characteristics of each data point into a numerical form is neither straightforward nor consistent.Most existing studies simplify the geometric features with some sparse characteristics.For instance, in the human body data, many of the relevant studies quantify the geometric characteristics of a human body shape with some sparse measurements, such as height and weight.However, such methods do not always capture detailed geometric variations and often lead to an incorrect statistical conclusion due to the measurement errors.As a better alternative, we propose a graphical autoencoder that can interface with the three-dimensional graphical data.The graphical autoencoder permits incorporation of geometric manifold data into economic analyses.As we will discuss, direct incorporation of the graphical data can reduce measurement errors because graphical data in general provides more comprehensive information on geometric data compared to discrete geometric measurements.
From the proposed method using the graphical autoencoder, we successfully identify intrinsic features of the body shape from 3D body scan data.Interestingly, intrinsic features of the body type are significantly important to explain the family income.Using the graphical autoencoder, we identify two intrinsic features forming male's body type and three intrinsic features for female's body type.In contrast to the conventional principle component analysis, the graphical autoencoder renders us to interpret the extracted features.For both genders, the first feature captures how tall a person is (stature), while the second feature captures how obese the body type is (obesity).The third feature captures hip-to-waist ratio of the body shape among the female sample.
As acknowledged in the literature, body types could be endogenous in that these can be also driven by unobserved factors of income such as nutrition, personality, ability, and family background.In order to identify the causal impact of body types on family income, we correct for possible endogeneity issues of body types.We utilize proxy variables approach and control functions approach.In particular, our identification strategy is to use variations in shoe size, jacket size for males (blouse size for females), and pants size as legitimate instrumental variables for stature in the control functions approach.By testing the null of exogenous stature, we find that female's stature is endogenous but not male's.
We summarize the main findings in Figure 1.In the estimation results from the deep-learned body parameters (right panel), we find that for males stature has a positive impact on family income and is statistically significant at 5% significance level, while obesity is insignificant.We estimate one centimeter increase in stature (converted in height) is associated with approximately $998 increase in the family income for a male who earns $70, 000 of median family income.For females, the coefficient of obesity is negative and statistically significant at 1% significance level.On the other hand, coefficients of other features such as height and hip-to-waist ratio are statistically insignificant.One unit decrease in obesity (converted in BMI) is associated with approximately $934 increase in the family income for a female who earns $70, 000 family income.The results imply there still exist physical attractiveness premium and its heterogeneity across the gender in the relationship between body shapes and income, even after controlling for unobserved confounding factors.Education is statistically significant for both genders but experience is significant only for the female samples.However, in the estimation results from the conventional body measures such as height and BMI (left panel), the magnitude of the estimated coefficients are much smaller than those from the deep-learned body parameters, which supports the possibility of attenuation bias due to measurement errors.This could suggest that height and BMI have limited powers to describe body shapes so any statistical analysis based on those simple measurements would lead to wrong economic inferences.Our findings also highlight the importance of correctly measuring body shapes to provide adequate public policies for the healthcare.
The rest of the article is organized as follows.Section 2 presents the model of interest.Section 3 introduces and summarizes the CAESAR dataset.Section 4 discusses the impact of reporting errors in height and weight.Section 5 discusses estimation results for the impact of the physical appearance on family income.Section 6 concludes.Empirical estimation results are contained in Appendix.

Model
We consider the association between family income and body shapes as follows: where Family Income i is log family income, Body Shapes i is a measure of body types, X i is a set of covariates, and i is unobserved causes of family income for individual i.We are particularly interested in the parameter β, but we also discuss the relationship between family income and other individual characteristics through the vector of parameters α.
We consider family income instead of individual income, since family income is available in the data.So we identify the combined effects of body shapes on income through the labor market and marriage market.In fact, as documented in Chiappori et al. (2012), various studies found assortative matching on income, wages, education, and anthropometric characteristics such as weight or height in marriage market.As a result, total effects of body shapes in the labor market and marriage market are identifiable from family income so that it is worthwhile to investigate the impact of physical attractiveness on family income.
A large body of literature has analyzed the presence of earnings differentials based on physical appearance.A strand of literature has focused on facial attractiveness.Hamermesh and Biddle (1994) analyzed the effect of physical appearance on earnings using interviewers' ratings of respondents' physical appearance.They found evidence of a positive impact of looks on earnings.Mobius and Rosenblat (2006) examined the sources of the beauty premium and decomposed the beauty premium that arises during the wage negotiation process between employer and employee in an experimental labor market.They identified transmission channels through which physical attractiveness raises an employer's estimate of a worker's ability.Scholz and Sicinski (2015) studied the impact of facial attractiveness on the lifetime earnings.They found there exists the beauty premium even after controlling for other factors which enhance productivity in the labor market earnings.
Other threads of literature have analyzed the effects of height on labor market outcomes.Persico et al. (2004) found that an additional inch of height is associated with an increase in wages, which is a consistent finding in the literature in addition to racial and gender bias.They showed that how tall a person is as a teenager is the source of the height wage premium.This implies that there are positive effects of social factors associated with the accumulation of productive skills and attributes on the development of human capital and the distribution of economic outcomes.Case and Paxson (2008) also found there are substantial returns to height in the labor market.However, they showed that the height premium is the result of positive correlation between height and cognitive ability.Lundborg et al. (2014) found that the positive height-earnings association is explained by both cognitive and noncognitive skills observed in tall people.Deaton and Arora (2009) reported that taller people evaluate their lives more favorably and the findings are explained by the positive association between height and both family income and education.Böckerman and Vainiomäki (2013) used twin data to control for unobserved ability and found a significant height premium in wage for women but not for men.Lindqvist (2012) studied the relationship between height and leadership and confirmed that tall men are significantly more likely to attain managerial positions.Cawley (2004) considered the effects of obesity on wages.He found that weight lowers the wages of white females and noted that one possible reason for the result is that obesity has adverse impact on the self-esteem of white females.Rooth (2012) used a field experimental approach to find differential treatment against obese applicants in terms of the number of callbacks for a job interview in the hiring process in the Swedish labor market.He found the callback rate to interview was lower for both obese male and female applicants than for nonobese applicants.
Mathematically, human body shapes can be viewed as arbitrary 2-manifolds M embedded in the Euclidean 3-space R 3 .In statistical analyses as in equation ( 1), quantifying geometric characteristics of different manifold shapes and encoding them into a numerical form is not straightforward.Thus, these continuous manifolds are approximated by proxies in a tensor form.Due to this reason, many of the relevant works in the literature on the physical appearance quantify the geometric characteristics of a human body shape with some sparse measurements, such as height, weight, or BMI.As we will see in the later sections, however, such kind of quantification methods do not always capture detailed geometric variations and often lead to an erroneous explanation of statistical data.For instance, with height and BMI alone, one can hardly distinguish muscular individuals from individuals with round body shapes.The situation does not improve even if some new variables, such as chest circumference, are added, since these variables still are not quite enough to codify all the subtle variations in body shapes.Moreover, oftentimes, such additional variables merely add redundancy, without adding any substantial statistical description of data, as the commonly-used anthropometric parameters are highly correlated to each other.In addition, it is also noteworthy that the manual selection of measurement variables can also introduce one's bias into the model.In this paper, we compare several common ways of quantifying manifold structured data with a newly-proposed graphical autoencoder method.

Data
We use a unique data, called the Civilian American European Surface Anthropometry Resource (CAESAR) dataset.
It was collected from a survey of the civilian populations of three countries representing the North Atlantic Treaty Organization (NATO) countries: the U.S., The Netherlands, and Italy.The survey was primarily conducted by the U.S. Air Force and the sample from the U.S. was used for our study.The survey of the U.S. sample was conducted from 1998 to 2000 and carried out in 12 different locations which were selected to obtain subjects approximately in proportion to the proportion of the population in each of 4 regions of the U.S. Census. 4he dataset contains 2,383 individuals whose ages vary from 18 to 65 with a diverse demographical population.The dataset contains detailed demographics of subjects, anthropometric measurements done with a tape measure and caliper, and digital 3D whole-body scans of subjects.In contrast to other traditional surveys, the data contains both reported and measured height and weight.This feature makes it possible to calculate reporting errors in the survey data and analyze their relations to the correctly measured height/weight as well as individual characteristics.In addition, the existence of 3D whole-body scan data makes the CAESAR data serve as a good proxy to physical appearance such that potential issue of measurement errors can be mitigated.
Some of the total 2,383 subjects in the database have missing demographic and anthropometric information; these have been deleted in our study.In addition, there are also subjects who elected not to disclose and/or were not aware of their income, race, education, etc.These individuals have also been removed in this study.In the analysis, we divide the sample by gender to take into account the differential treatment across genders.
Tables 1-2 provide summary statistics of the variables in the database for males and females, respectively.The data has a single question about family income (grouped into ten classes).Average family income is $76,085 for males and $65,998 for females.The differences in average family income across genders would be due to the fact that the male sample includes more married people than the female sample.Median family income is slightly lower than the mean family income, which amounts to $70,000 for males and $52,500 for females.BMI has been commonly used as a screening tool for determining whether a person is overweight or obese.5 BMI is calculated as weight in kilograms divided by height in meters squared.We refer reported BMI (measured BMI) to the one based on reported height and weight (measured height and weight).In the tables, height, weight and BMI are those measured by professional tailors at the survey sites.For both genders, reported BMI is slightly larger than measured BMI on average.
In addition to the bio-metric measurements, the data contains other variables for individual characteristics and socioeconomic backgrounds.Education grouped into nine categories is 16.29 years for males and 15.75 years for females on average.Experience is calculated as potential experience = age − education − 6 and its mean is 17.54 years for males and 18.62 years for females.Fitness is defined as exercise hours per week.Its mean and median are 4.24 hours and 2.5 hours, respectively, for males.For females, its mean and median are 3.74 hours and 2.5 hours, respectively.
The data also include the number of children.Marital status is classified as three groups: single, married, divorced/widowed.Occupation consists of white collar, management, blue collar, and service.Race has four groups including White, Hispanic, Black, and Asian.Birth region is grouped into five groups including Midwest, Northeast, South, West, and Foreign.The majority in the dataset are white collar married White males and females born in Midwest.As we will discuss later, the data also contains 40 body measures which includes height and weight.The list of the body measures are provided in Table 3.

Reporting Errors in Height and Weight
Several studies in the literature use survey data so that they assume there are no reporting errors in height and weight or reporting errors are classical in that they are not correlated with true measures.6Since our data contains both reported and measured height and weight, we can further investigate the properties of the reporting errors.We consider measured height and weight as the true height and weight since they are measured by professional tailors.The reporting errors are calculated as Reporting Error H = Reported Height − Height and Reporting Error W = Reported Weight − Weight, respectively The following equation estimates which personal background explains reporting error in height and weight: Reporting Reporting where X i is a set of covariates including family income, age, age squared, occupation, education, marital status, fitness, race, and birth region.Height i is the true height in millimeters, and Weight i is the true weight in kilograms.We found dependence between reporting errors and some covariates.Table 4 reports the estimation results.The standard errors are estimated by bootstrapping and are reported inside the parentheses.In the equation (2), the coefficient of the true height is not statistically significant for both genders.We observe different results across the gender.For males, family income is negatively correlated with the reporting error in height at 1% significance level.Age squared is positively correlated with the reporting error.Hispanic males are more likely to under-report their height compared to White males.Males who were born in Northeast are more likely to over-report their height relative to those from Midwest.On the other hand, the coefficient of family income is not statistically significant for females.Older females are more likely to under-report their height.The estimation results are summarized in Figure 2.
In the equation (3), the true weight is negatively correlated with the reporting error in weight (at 1% significance level) for both genders: heavier people have a tendency to under-report their weight.For females, it is interesting to find that the coefficient of fitness is statistically significant at 5% significance level and it is negatively correlated with reporting-error in weight.Thus females who spend more time on exercise have a tendency to under-report their weight.However, we find little evidence that other personal background are correlated with reporting error in weight.The estimation results are summarized in Figure 3.It displays no significant relation between the reporting error and the true height for males.For females, we observe more reporting error at low height level than at the average height level, which indicates that the reporting error for females' height is nonclassical in the sense that the reporting error and the true height are dependent.This result is a finding which is not captured by the linear mean regression in Table 4 where the reporting error in height is not related to the true height for both genders.
The plots for the reported weight show more substantial nonclassical errors.Males who are at the low weight level (below approximately 75 kilogram) have a tendency to over-report their weight, but males at the weight level above approximately 75 kilogram under-report their weight.Similarly, females who are at the low weight level (below approximately 50 kilogram) have a tendency to over-report their weight, but females at the weight level above  for τ ∈ (0, 1) with their 95% confidence bands, respectively.We estimate the conditional quantiles using the nonlinear polynomial regression.The figures display the median and the 10%, 25%, 75%, and 90% quantiles across genders.In Figure 6, the results show that there is heterogeneity in the conditional distribution of the reporting error in height for both genders.It is shown that over-reporting of height is more pronounced for males than females.We notice that more than 75% of the sample of the males over-report their height.Interestingly, the median regression lines in both genders are approximately parallel with the horizontal line, which implies that the conditional median of the reporting error is independent of the true height.Bollinger (1998) also found that median regressions for earnings will be more robust to the reporting error than mean regressions.Thus, it would be more natural to impose a restriction on the conditional quantile of the reporting error of height than the conditional mean, as in Hu and Schennach (2008).Figure 7 also displays apparent heterogeneity in the conditional distribution for both genders.It shows that when heavier people than the average are concerned, under-reporting of weight is more pronounced for females than males.We notice that within this group, almost 75% of the sample of the females under-report their weight.The median regressions in both genders are dependent on the level of the measured weight.This indicates that there are substantial nonclassical errors in the reported weight so that a restriction on the conditional quantile may not be valid.following two regression equations: where X i is a set of controls including experience, experience 2 , race, occupation, education, marital status, and number of children.We can test the importance of controlling for weight by comparing the estimated coefficients from two equations.In addition, as mentioned before, the data contains measurements on height and weight both reported by subjects and measured by on-site measurers.Therefore, by comparing estimates of measured one with reported counterpart, we can see how much the reporting errors affect the estimation results.Table 5 reports estimation results from reported height and weight.Table 6 provides estimation results from measured height and weight.
The hypothesis that the coefficient on height is zero is tested across gender.Results for both genders are presented in each tables.In equation (4) of Table 5, reported weight is not included.The column for males shows that education is statistically significant in the income equation.The coefficient of the reported height is positive and statistically significant at 10% significance level.The column for females is somewhat different from that for males: the coefficient of experience, experience 2 , and education are statistically significant.In addition, the coefficient on the reported height is positive and statistically significant at 5% significance level.In equation ( 5), we add the reported weight to the set of regressors.The column for males shows that the coefficient of the reported height becomes statistically insignificant, and the coefficient of the reported weight is also insignificant.However, in the column for females the coefficient of the reported height is still positive and statistically significant, but the coefficient on the reported weight is insignificant.
In Table 6, we instead use the measured height and weight to estimate the income equation.Interestingly, the coefficients on the height for males in both equations are positive and statistically significant at 1% significance level.
Their magnitudes are larger than those from Table 5.When the measured weight is added, its coefficient is still insignificant for males.For females, the coefficients on the measured height are statistically significant in both equations and their magnitudes are larger than those from Table 5.When the measured weight is added, its coefficient becomes significant at 10% significance level, which shows a negative association between family income and weight.Thus, we confirm there are apparent reporting errors in height and weight.Particularly, the impacts of the reporting errors on the estimation results are more severe in males than females.These reporting errors bring attenuation bias to the estimates.Furthermore, the estimation results from two equations ( 4)-( 5) are different.It shows that using height only as a proxy to body shapes might be too simple to describe delicate figures of the physical appearance.
As in Cawley (2004), we consider BMI as the primary regressor in the regression equations ( 4)-( 5) to estimate the impact of obesity on income.Again, we add weight or height as additional regressor to take into account a possible omitted variable problem.So we consider the income equations as following: where BMI i is the body mass index.We first estimate the equations using the reported variables and summarize the estimation results in Table 7. From the columns for males in the table, the coefficient of the reported BMI in equation We provide results from reported body measures (Reported) and measured body measures (Measured).Note that the unit for height is converted into centimeter (cm).( 6) is statistically insignificant.Adding the reported weight or height does not change the result for the reported BMI as in equations ( 7)-( 8).Instead, the estimated coefficient of the reported height or weight is significant.For females, the coefficients of the reported BMI are insignificant in equations ( 6) and ( 8).However, in equation ( 7), the reported BMI has a negative impact on family income and the relation is statistically significant at 1% significance level.It also shows that the coefficient of the reported height or weight is positive and significant at 5% significance level.
We next estimate equations ( 6)-( 8) using the measured BMI, height and weight.For all equations in Table 8, the estimation results are somewhat different from Table 7.In equation ( 6) for males, the coefficient of BMI is still insignificant.When weight is included as in equation ( 7), its coefficient for males is positive and statistically significant at 1% significance level.The coefficient of BMI becomes negative and statistically significant at 5% significance level.When height is included as in equation ( 8), its coefficient for males is positive and statistically significant at 1% significance level.However, the coefficient of BMI is statistically insignificant.For females, the results are different from those in Table 7.The coefficient of BMI is always negative and statistically significant.The coefficient of height or weight is positive and also statistically significant.The results for equation (8) are highlighted in Figure 8.It is shown that the magnitudes of the coefficients for height are larger when measured height are used.For females, the impact of the measured BMI on family income is significant in contrast with insignificant impact of the reported BMI.Different signs on the effect of BMI across genders are also observed: positive effect of male BMI as opposed to negative effect of female BMI.Thus, the analysis confirms that the reporting errors in body measures have significant impacts on the estimated coefficients.
Interestingly, we also observe that the estimation results have significantly changed as different set of measures of body types were included in the equations.One possible explanation for this is that even the measured height and BMI might not be perfect proxies to the body types, although they are less prone to reporting errors.In fact, height, weight and BMI are simple measures of body types so that they might miss useful information on the true body types (e.g., see Wada and Tekin (2010) for BMI). 7n order to further investigate the role of the measurement errors on the body types, we run the following regression equation: where Body i is a set of body measurements which include 40 number of measurements on various parts of body.8Since these are more sophisticate than simple measurements of height and BMI, it is less likely that the measurement errors on body type is prevalent.to those in Table 8 for both genders.Interestingly, we found eight statistically-significant body measurements for males and four for females (see Figure 9).For instance, in the sample of males, Acromial Height (Sitting), Chest Circumference, and Waist Height (Preferred) have positive association with the family income, while Arm Length (Shoulder-to-Elbow), Buttock (Knee Length), Elbow Height (Sitting), Subscapular Skinfold, Waist Circumference (Preferred) are negatively correlated with the family income.For females, Shoulder Breadth is positively correlated with the family income.However, the coefficients on Face Length, Hand Length, Neck Base Circumference are all negative.
The most distinctive result is that the coefficients on height and weight for both genders are statistically insignificant in the regression.This implies that there are useful information on body types which are embedded into various body measures.The body shapes or types cannot be fully captured by simple measures such as height or weight.
Moreover, it is possible that there are interactions between different body measurements since they have close relationships to construct a body shape.So we consider the original regressors (X i and Body i in equation ( 9)) and interaction terms of Body i as a set of regressors.This gives 797 number of covariates, which makes the OLS regression inconsistent. 9In order to mitigate the issue of high-dimensional data, we use the following Lasso regression which is valid under a sparsity assumption: where Z is a vector of covariates including the interaction terms with size p = 797 and λ ∈ [0, 1) is a regularization parameter.We construct the lasso fit using 10-fold cross-validation.Figure 10 plots mean-squared-error (MSE) over the sequence of the regularization parameter λ for each gender.For males, the minimum MSE is 0.231 at λ = 0.024 and the minimum MSE plus one standard error is 0.241 at λ = 0.047.For females, the minimum MSE is 0.216 at λ = 0.017 and the minimum MSE plus one standard error is 0.228 at λ = 0.058.The regression results show that many interaction terms are statistically significant. 10When equation ( 9) is re-estimated with these interaction terms (so-called "post-Lasso"), it obtains higher adjusted R squared (0.393 for males and 0.470 for females) than those in Table 9.Thus it is highly likely that these body measures are interrelated.However, constructing stylized body types based on these relevant body measures is not straightforward and a nonstandard problem.

Physical Appearance and Graphical Autoencoder
Characterization of geometric quantity such as physical appearance of human body shape using a sparse set of canonical features (e.g., height and weight) often causes unwanted bias and misinterpretation of data.For simple shapes like rectangles, canonical measures such as width and height already provide a complete description of the shape.Hence, shape variation among rectangles could easily be described using the two canonical parameters without much issues.However, this seldom applies to more sophisticated shape variations, if at all.Instead, the canonical shape descriptors, often hand-selected, might cause nonignorable error in capturing genuine statistical distribution by overlooking some important geometric features or measuring highly-correlated variables redundantly, which can be thought of as a measurement error of some sort.
Unfortunately, however, extracting a complete and unbiased set of shape descriptors is not a trivial task.Furthermore, the task is highly problem-specific such that, for example, the shape descriptors for car shapes would not be appropriate for describing human body shapes.To this end, we propose a novel data-driven framework for extracting complete, unbiased shape descriptors from a set of geometric data in this paper.The proposed framework utilizes an autoencoder neural network (Bourlard and Kamp, 1988) defined on a graphical model.In this section, we present an overview of the approach and demonstrate that the shape descriptors obtained through the new approach can actually provide a better description of data.

Graphical Autoencoder
Mathematically, human body shapes can be represented as curved surfaces, or more formally manifolds M (i) embedded in R 3 , where i is an index identifying each individual.A manifold is a topological space that locally looks like Euclidean space near each point.The statistical models that this paper concerns are, in a generic form, a regression of an economic variable Y with respect to a manifold-structured regressor M and other covariates X: where φ is a known function up to unknown parameter θ and is an error term.Here, a problem rises regarding the manifold regressor M as the regressor M is an abstract, geometric object and not a usual vector variable as in other typical economic and statistical models.In other words, there is no statistical model that naturally accepts the manifold regressor M, unless M is somehow converted into a vector form.
Due to the above bottleneck, one may consider measuring a few geometric dimensions, such as lengths and girths, and use those measurements to encode body shapes.However, as being shown in the paper, such simplistic measurements are not an accurate characterization of complex geometric objects such as human body shapes.Instead, data driven parameterizations such as in Wang (2005); Baek and Lee (2012); Pishchulin et al. (2017) provide more comprehensive and reliable codification of body shapes, but many of these works assume that the human body shape distribution is linear, leading to inaccurate encoding of body shapes (Freifeld and Black, 2012;Baek, 2013).
In this paper, we employ a data-driven, nonlinear parameterization of body shapes achieved via a graphical autoencoder.An autoencoder is a certain type of artificial neural network that possesses a hour-glass shaped network architecture.An autoencoder can be thought of as two multilayer perceptron (MLP) models cascaded sequentially, where the first MLP codifies a high-dimensional input into a lower dimensional embedding (encoder) and the second MLP reconstructs the original input back from the encoded embedding (decoder).Because of the dimensional bottleneck created in the middle, the neural network is promoted to search for the most effective way of compressing the high dimensional input into the lower dimensional embedding.
The concept of a graphical autoencoder we propose here is an extension of such notion of autoencoders to manifoldstructured data.As in many geometric data analysis applications, we discretize a manifold M to a triangular mesh, achieving piece-wise linear approximation of the original surface.A triangular mesh is a graph G = {V, E, F} where V In this setting, the graphical autoencoder is defined as follows: Here, each of the layers f 1 • • • f m and g 1 • • • g m are modeled as a simple perceptron: where W i are neural weights and b i are bias.σ is the activation function where we empirically decide to be rectified linear unit (ReLU) activation for We set linear activation for the terminal layers f m and g m (i.e.no rectification).
Finally, the graphical autoencoder is trained to minimize the mean square error between the original mesh and the reconstructed mesh: where p = f (V ) by definition, θ f and θ g are the model parameters of f and g respectively, and V is the list of vertex coordinates of a graphical model.
Figure 11 illustrates a schematic overview of the graphical autoencoder.As shown in the figure, the vertices of a topology-normalized graphical model act as input neurons in the autoencoder model.Then the input neurons are connected to the hidden neurons in the next layer which then are connected in chain through the "bottleneck" layer.The bottleneck layer has a significantly small number of neurons compare to the input neurons and, hence, the dimensionality compression occurs there.The latter half of the autoencoder is symmetric to the first half and finally reconstructs the bottleneck encoding into the original graphical model.The training process of the graphical autoencoder attempts to minimize the discrepancy between the reconstructed model and the original input by tuning the neural weights of the hidden layers.

Graphical Autoencoder on CAESAR Dataset
The CAESAR scan dataset includes 15, 178 number of vertices as well as (x, y, z) coordinate.This gives us 45, 534 inputs for each individual.In order to extract body shape parameters that encode the geometric characteristics of a person's appearance, we designed a graphical autoencoder consisting of seven hidden layers.Each of the hidden layers are comprised of 256-64-16-d-16-64-256 neurons respectively, where d is the intrinsic data dimension, or the dimensionality of the embedding.The RMSprop optimizer was used for the training.The dataset was randomly split to a training group used for training and a validation group that were set aside during the training.The ratio between the number of data samples in such groups were 80:20 respectively.The training continued until 5,000 epochs with the batch size of 200 samples.As a criterion to evaluate the performance of the graphical autoencoder, we used the reconstruction error measured in mean-squared-error (MSE).As described above, the graphical autoencoder first embeds graphical data into a lower dimensional embedding through the encoder part of the network, which then is reconstructed back into a graphical model through the decoder part.We compared how the reconstructed output is different from the original input to the network.
The first experiment was conducted to test the ability of the graphical autoencoder in embedding the geometric information underlying in data.To achieve this, we applied the aforementioned graphical encoder to the entire CAESAR dataset, with varying embedding dimension d from 1 to 20 as reported in Figure 12.The embedding accuracy was below 3e −4 m 2 for most cases.Particularly, when the dimension d was 3, it showed the lowest MSE, in both training and validation losses, which provides a justification for estimating d = 3 as the intrinsic dimension.
For the meaning of the embedded parameters of the third dimension, the first component, P 1 , discerned to be related to height of a person and P 2 to the body volume (obesity/leanness).Interestingly, as P 3 increases the body shape became more feminine, (namely, more prominent chest and hip-to-waist ratio) and, conversely, as it decreases the body shape became more masculine with less prominent chest and curves.
Based on such observation, we further conducted another similar experiment for training the graphical autoencoder with separate genders.Among 2,383 subjects in the CAESAR dataset, there were 1,122 males and 1,261 females.The two groups had been separated to two experiment sessions in which they were further separated to training and validation groups with the same 80:20 ratio.
The new experiment with separate genders demonstrated a similar trend to the first experiment in terms of how the intrinsic dimension affects the reconstruction error, as visualized in Figure 13.However, interestingly, this time, the reasonable intrinsic dimension d was observed to be 2 for male subjects.We interpret this result that, since now the two genders are separated, the role of P 3 (feminine/masculine) is now less significant than before and, thus, the gain of accuracy by including the third dimension becomes negligible for males.We also note that, however, such interpretation was not true with the female population, since the accuracy was in fact higher when P 3 was included.Our explanation to such is that, for the female body shapes, there is a greater variation in body curves compared to male population, and therefore, the third component has a greater significance for females.We, therefore, select d = 2 for males and d = 3 for females.Lastly, we also note that the convergence was slower when the two genders were separated and measurable gain of accuracy could be observed even after 1, 000 epochs, which was not the case when the two genders were combined in the training.This could be because the number of training samples in the training dataset is significantly smaller (about a half) than the previous case, rendering a drop of the representative power of the data.
Figure 14 illustrates the body shape spanned by the two parameters obtained from the graphical autoencoder for each gender.3D body shape models are arranged in accordance with their body shape parameters with increments of −3σ, −1.5σ, 1.5σ, and 3σ with respect to the mean in each direction where σ is the standard deviation of each parameter.Body shapes for male (left) and female (right) display similar patterns over changes in the two parameters.Overall, the first parameter P 1 affects how tall a person is.That is, a smaller value in P 1 indicates the person is not tall compared to the other population and vice versa.P 2 is how lean a person is.That is, a large value in P 2 results in an obese person, while a small value in P 2 results in a more slim and fit person.In order to better understand these parameters, we consider a linear fit of BMI, height, or weight on each parameter.
Figure 15 displays the relation between body shape parameters and the conventional body measurements for male.P 1 is positively correlated with BMI, height, and weight.Among these body measurements, height is the most highly correlated with P 1 (approximately R 2 = 0.70).P 2 is negatively correlated with height, but is positively correlated with BMI and weight.BMI has the highest correlation with P 2 (approximately R 2 = 0.69).Figure 16 displays the relation between body shape parameters and the conventional body measurements for female.The patterns are close to those for male in Figure 15.As discussed before, the female sample produces an additional feature, P 3 .We visualize the third parameter for female in Figure 17.As shown in the figure, P 3 captures the ratio of hip to waist for females, which is unique to female dataset.For simplicity, thus, we will interpret P 1 , P 2 , and P 3 as features associated with a person's stature, obesity, and hip-to-waist ratio, respectively.
It is worthwhile to note that the extracted features P 1 , P 2 , and P 3 perform better in explaining nonlinear variations in body shapes than simple measures such as height, BMI, and hip-to-waist ratio.As shown in Figure 14, the body shape spanned by P 1 and P 2 displays dynamic and nonlinear patterns as the parameters vary.In addition, we consider a linear prediction of female P 3 using various body measures and report the results in Table 10 (also in Figure 18).11It shows that there are many body parts which are highly associated with P 3 .This confirms that P 3 captures complexity in female body shapes and reflects not only hip-to-waist ratio but also variations in other body parts.Nevertheless, we call P 3 hip-to-waist ratio for convenience.
Figure 14: Body Shape Parameters Derived from the Graphical Autoencoder.3D body shape models for male (left) and female (right) are arranged in accordance with their body shape parameters, with increments of -3σ, -1.5σ, 0, 1.5σ, and 3σ with respect to the mean in each direction where σ is the s.d. of each parameter.

Extracted Body Types and Family Income
We now use the measurements of body type which are extracted by graphical autoencoder in the previous section.We estimate the equation ( 1) with the extracted body types in place of a set of body measurements for Body i as following: where P 1i , P 2i and P 3i are body types for each individual i.Table 12 reports estimation results across the gender with the same set of controls.In equation ( 17), we add all intrinsic features of the body shape to the income equation.For males, only the coefficient of the P 1 measurement is statistically significant and P 2 does not explain the family income.One standard deviation increase in males' P 1 is associated with 0.052 increase in log family income.For females, on the other hand, only the coefficient of the P 2 measurement is statistically significant, and P 1 and P 3 are not correlated with the family income.When these insignificant variables are dropped as in equations ( 15) and ( 16), the regression equations get higher adjusted R squared.The results show that one standard deviation decrease in females' P 2 is associated with 0.056 increase in log family income.
For comparison, we replace the extracted body types with height, BMI, and hip-to-waist ratio and re-estimate the above equations. 12The estimated results are reported in Table 11.In both genders, height has positive impact on log family income and is statistically significant.The estimated coefficients of BMI and hip-to-waist ratio are insignificant.In particular, BMI was significant at 10% significance level in Table 8 when height and BMI are included.However, BMI is not significant anymore when hip-to-waist ratio is added.We observe no gender differential in the impact of body types.The results confirm that the estimation of the income equation with conventional body measurements is susceptible to variable selection and provides different conclusions than those with our proposed method.

Proxy Variables Approach
If unobserved determinants of family income such as individual personality and childhood nutrition are correlated with the physical appearance, the estimates in the previous section are inconsistent.To recover the causal relationship, we use the proxy variables approach where observed proxy to the unobserved determinants resolve the possible issue of endogeneity.A set of the observed proxies includes fitness, car size, birth state, and survey site. 13We choose these variables as relevant proxies since fitness and car size could be related to individual preference to body types, and birth state and survey site could reflect local nutrition environments.To control for the endogeneity, thus, we assume the conditional independence of body types and unobserved determinants of family income, conditional on the observed proxies.
We estimate equation ( 17) by controlling for various subset of the proxy variables.As a comparison, we use measured Height, BMI, and hip-to-waist Ratio in place of P 1 , P 2 , and P 3 , respectively.The estimation results are reported in Table 13.The estimated coefficients of height are similar to those in columns for equation ( 8) for both genders in Table 8.In particular, height still has positive effects on family income in both genders.On the other hand, the estimated coefficient of BMI for female becomes significant; BMI has negative effects in female subsample.
Table 14 reports estimation results for the equation ( 17) with P 1 , P 2 , and P 3 .It is worth noting that fitness and car size are not statistically significant but birth state (Northeast) and survey site (Dayton, OH for male; Marlton, NJ for female) are statistically significant.When all proxy variables are included, the estimated coefficient of P 1 is 0.056 for males.It is statistically significant at 1% significance level.Thus taller males have a tendency to have higher family income.But we do not find statistically meaningful relationship between the males' obesity and the family income.For females, the P 2 measurement is negatively associated with the family income and its coefficient is statistically significant at 1% significance level.Thus we find that females' obesity matters for the family income in a negative sense but their stature and hip-to-waist ratio are not associated with the family income.The results are qualitatively similar to those from Table 12.

Control Functions Approach
Even though some unobserved determinants of family income have been controlled for, there would be other possible unobserved factors such as individual ability.As reported in Persico et al. ( 2004), Case and Paxson (2008), and 13 Fitness is measured as hours of exercise per week.Car size is classified as two groups: Sedan (compact, economy, intermediate, full size, luxury, sports car) and Non-sedan (SUV, minivan, station wagon, truck, van).Birth state are classified as five groups: Foreign, West, Midwest, South, and Northeast.Survey site includes LA (CA), Detroit (MI), Ames (IA), Dayton (OH), Greensboro (NC), Marlton (NJ), Ottawa (Ontario, CAN), Minneapolis (MN), Houston (TX), Portland (OR), San Francisco (CA), and Atlanta (GA).2014), stature would be highly correlated with individual cognitive and noncognitive abilities.Thus we conjecture P 1 or stature would be possibly endogenous regressor.To address this issue, we look for a set of valid instrumental variables (IV).Our identification strategy is to assume that unobserved determinants of Shoe size, Pants size, and Jacket Size for male (or Blouse Size for female) are uncorrelated with individual cognitive and noncognitive abilities.Given the condition, we adopt these unobserved size determinants or variations as valid IVs and incorporate the instrumental variables or control functions (CF) approach into the economic model based on the graph convolution method.
Figure 19 shows a graphical depiction of the causal diagram (see e.g.Pearl (2000) and Chalak and White (2011)).Complete circles denote observed variables and dashed circles denote unobserved determinants.Arrows denote direct causal relations.A line without arrows denotes dependence between two variables.The primary parameter of interest is the impact of body type on family income.But body type (stature) is endogenous due to the dependence of body type determinants and cognitive and noncognitive abilities.Shoe size, jacket (blouse) size, and pants size are determined by body type as well as unobserved determinants such as personal size preference.We assume that such unobserved size determinants are uncorrelated with ability.Given the condition, they can serve as legitimate IVs to identify the causal relation.Indeed, it is plausible to assume that these variables are legitimate IVs.First, they are unlikely to be correlated with the unobserved determinants of family income such as cognitive and noncognitive abilities.For instance, it is natural to assume that highly able people do not necessarily wear bigger shoes or pants given their foot length or waist circumference.In fact, it is unlikely that these IVs directly cause the family income.Second, as shown in the estimation results below, they are also strongly correlated with the body type.
Since each size determinants or variations are unobserved, we estimate them from the projection of the observed size on the most relevant body part.In practice, we consider the projection of the reported shoe size, pants size and jacket (or blouse size) on the measured foot length, waist circumference, and chest circumference, respectively.The residuals from each projection are used as the estimated size determinants or variations (analogous to the 'residuals as instruments' in Hausman and Taylor (1983)).
We consider the following first-step reduced form equation:  where Shoe Size Determinants i is the estimated individual i's variation or determinants in shoe size and Jacket Size/Blouse Size Determinants i is the estimated variation in jacket size for male (or blouse size for female), and Pants Size Determinants i is variation in pants size, and where X i are a set of exogenous regressors and ν i is idiosyncratic shocks to the P 1i .By construction, ν i is the component which generates the endogeneity.From the reduced-from equations (18), we estimate the control function νi .In the second-step, we then estimate the income equation by adding the control function as following: Since ν corrects for the sources of the endogeneity, we can consistently estimate the parameters associated with the physical appearances.Other nice feature of the control functions approach is that we can test whether the physical appearances are endogenous by checking if π = 0.
Table 16 reports estimation results for the equations ( 18)-( 19).In the columns for the equation ( 18), all IVs are statistically significant and positively correlated with stature in both genders.Experience is also positively associated with individual's stature and the relation is nonlinear.People who were born in Foreign countries or Northeast are likely to be shorter than those born in Midwest.For males, education is negatively correlated with stature.Asian people are less likely to be taller than White people.For females, Hispanic and Asian people are less likely to be taller than White people.
In the columns for the equation ( 19), the estimated coefficient of P 1 is 0.097 for male, which is larger than that from Table 14.It is statistically significant at 5% significance level.Thus taller males have a tendency to have higher family income.Interestingly, we do not find statistically significant causal relationship between males' obesity and the family income.We estimate that one standard deviation increase in P 1 measurement is associated with $0.097 × 70, 000 = $6, 790 increase in the family income for a male who earns $70, 000 of median family income.This is equivalent to $ 0.097 6.8 × 70, 000 = $998.5 increase of family income per centimeter.Note that for males one standard deviation in P 1 is equivalent to 6.8 centimeter in height and one standard deviation in P 2 is equivalent to 4.07kg/m 2 in BMI.
The estimation results for the covariates resemble those in previous tables.As shown in the literature on the returns to education, education has a positive impact on family income.Its estimated coefficient is 0.046 and statistically significant at 1% significance level.Males born in Northeast have tendency to have higher family income than those born in Midwest.Interestingly, the estimated coefficient of ν1 is statistically insignificant.Thus, we find no strong evidence that P 1 is a endogenous regressor.
For females, the estimated coefficient of P 1 is negative and the estimated coefficient of P 3 is positive, but they are not statistically significant.The P 2 measurement is negatively associated with the family income.Its coefficient is −0.069 and statistically significant at 1% significance level.Thus we find that a female's obesity negatively matters for her family income, but we do not find a strong causal impacts of her stature and hip-to-waist ratio on the family income.One standard deviation decrease in P 2 measurement is associated with $0.069 × 70, 000 = $4, 830 increase in the family income for a female who earns $70, 000 family income.This can be interpreted to $ 0.069 5.17 × 70, 000 = $934.2increase per one unit of BMI.Note that for females one standard deviation in P 1 is equivalent to 6.8 centimeter in height and one standard deviation in P 2 is equivalent to 5.17kg/m 2 in BMI.
For females, experience is important to have higher family income.As commonly reported in the literature on the wage equation, the experience displays a quadratic functional form.Education has a positive impact on the family income and its coefficient is statistically significant at 1% significance level, which is similar to the finding in the male case.Similarly, females born in Northeast have tendency to have higher family income than those born in Midwest.The estimated coefficient of ν is statistically significant.Thus, we find substantial evidence that stature is endogenous in the female's income equation.
As a comparison, we apply the control functions approach to the income equations where height, BMI, and hip-to-waist ratio are used in place of the extracted body features.The corresponding reduced-form and structural equations are the same as the equations ( 18)-( 19).Table 15 reports estimation results.For male, we estimate that one centimeter increase in height is associated with $0.01 × 70, 000 = $700 increase in the family income for a male who earns $70, 000 of median family income.The estimated effect is smaller than that when the extracted features from the deep learning are used.For females, pants size is not associated with height in the first step regression.The estimated coefficient of ν is significant in female sample, which implies evidence of endogenous female height.For females, interestingly, coefficients of height, BMI, and hip-to-waist ratio are all insignificant.Thus we do not find strong evidence of body shape effects when height, BMI, and hip-to-waist ratio are used.As a result, we observe that the estimation results with the conventional measurements are volatile across different regression models -OLS, proxy variable approach, and control functions approach.On the other hands, those with the deep-learned body parameters are very stable across different models and interestingly captures gender differential in the impact of body types on income.
Finally, we summarize the estimated coefficients in Figure 20 for males and in Figure 21 for females.We compare the results from the conventional body measures to those from the deep-learned body parameters with/without controlling for the endogeneity.They show that the estimated effects from the measured height, BMI, and hip-to-waist ratio are substantially different than those from the deep learned parameters.One can possibly interpret such difference as a limitation of conventional body measures on describing appearances.In fact, it is widely known in literature (CDC document, accessed 2019) that BMI is a surrogate measure of body fatness.Neither does it distinguish fat, muscle, or bone mass, nor does it describe distribution of fat among people.As such, there is a chance where the difference in family income is falsely correlated to stature while the true underlying statistics suggests otherwise.To illustrate this problem, consider a tall and visually obese male and a short and muscular male with the same body mass.In this case, since there is no difference in BMI, the difference in family income must be explained by stature, which may lead to an inaccurate conclusion.This, however, was not the case for the deep-learned parameters.

Conclusion
This paper studies the relationship between the physical appearance and family income.We show there are significant reporting errors in the reported height and weight, and show that these discrete measurements are too sparse to provide complete description of the body shape.In fact, these reporting errors are shown to be correlated with individual backgrounds.We also find that the regression of family income on the self-reported measurements suffers from the issue of reporting errors and delivers biased estimates compared to the regression on the true measurements.The findings shed light on the importance of measuring body types instead of simply relying on subjects' self-reports for public policies.
We introduce a new methodology built on graphical autoencoder in deep machine learning.From the three dimensional whole-body scan data, we identify two intrinsic features consisting of human body shapes for males and three intrinsic features for females.These body features are presumably less likely to suffer from measurement errors on the physical appearances.We also take into account a possible issue of endogenous body shapes by utilizing proxy variables and control functions approaches.The empirical results document positive impact of stature and negative impact of obesity on family income for males.On the other hand, results for females show that obesity is the only significant feature and it negatively affects family income.The findings support the hypotheses on the physical attractiveness premium and the differential treatment across the gender in the labor market outcomes.Finally, we believe that the proposed method can be applied to many interesting research questions in Economics which deal with geometric data such as graphical, image, spatial, and social networks data.

Figure 1 :
Figure 1: Summary of the Estimation Results for Family Income Equation.Estimated coefficients and bootstrapped 90% confidence bands are reported.The left panel presents results from the conventional body measures and the right panel reports results from the deep-learned body parameters through the graphical autoencoder.

Figure 2 :
Figure 2: Personal Background and Reporting Errors in Height.Estimated coefficients and bootstrapped 90% confidence bands are reported.Note that the unit for height is converted into centimeter (cm).

Figures 4 and 5
Figures 4 and 5 plot the estimation of the conditional expectations of the reporting errors in height/weight given the true measures, namely, E[Reporting Error H | Height] and E[Reporting Error W | Weight] with their 95% confidence bands, respectively.We use nonparametric kernel estimation where the kernel function is an Epanechnikov kernel and the bandwidth is chosen by the Silverman's rule-of-thumb.The confidence bands are estimated by a nonparametric bootstrap method.The solid line represents zero reporting error.The nonparametric plots for the height show that the reported height is larger than the true height at almost all height level in both genders showing over-reporting patterns.It displays no significant relation between the reporting error and the true height for males.For females, we observe more reporting error at low height level than at the average height level, which indicates that the reporting error for females' height is nonclassical in the sense that the reporting error and the true height are dependent.This result is a finding which is not captured by the linear mean regression in Table4where the reporting error in height is not related to the true height for both genders.

Figure 3 :
Figure 3: Personal Background and Reporting Errors in Weight.Estimated coefficients and bootstrapped 90% confidence bands are reported.

Figure 4 :
Figure 4: Conditional Mean of Reporting Errors in Height Conditional on True Height.

Figure 5 :
Figure 5: Conditional Mean of Reporting Errors in Weight Conditional on True Weight.

Figure 6 :
Figure 6: Conditional Quantile of Reporting Errors in Height Conditional on True Height.

Figure 7 :
Figure 7: Conditional Quantile of Reporting Errors in Weight Conditional on True Weight.

Figure 8 :
Figure 8: Comparison of Reported and Measured Body Measures.Estimated coefficients and bootstrapped 90% confidence bands are reported.We provide results from reported body measures (Reported) and measured body measures (Measured).Note that the unit for height is converted into centimeter (cm).

Figure 9 :
Figure9: Various Measures and Family Income.Estimated coefficients and bootstrapped 90% confidence bands are reported for male (left) and female (right).Note that units for all measurements except weight are converted into centimeter (cm).

Figure 11 :
Figure 11: Schematic Illustration of the Proposed Graph Autoencoder.A discrete-sampled scalar field acts as input and output nodes of the autoencoder.The intermediate layers are similar to the ordinary autoencoder layers.

Figure 12 :
Figure 12: Result of Training Graphical Autoencoder with the Entire CAESAR Dataset.The abscissa is the number of epochs for the training and the ordinate is the model loss in terms of MSE.The left shows the loss on training dataset (training loss) while the right shows the loss on validation dataset (validation loss).The accuracy did not show any significant improvement after 1,000 epochs for all cases and thus removed from the figure for the sake of better visualization.

Figure 13 :
Figure 13: Result of Training Graphical Autoencoder Separately on Each Gender.The abscissa is the number of epochs for the training and the ordinate is the model loss in terms of MSE.The left shows the loss on training dataset (training loss) while the right shows the loss on validation dataset (validation loss).

Figure 15 :
Figure 15: Relation between Body Shape Parameters and the Conventional Body Measurements for Male.The straight line displays the linear fit.The R squared is reported in the parentheses.

Figure 16 :
Figure 16: Relation between Body Shape Parameters and the Conventional Body Measurements for Female.The straight line displays the linear fit.The R squared is reported in the parentheses.

Figure 17 :
Figure 17: The Third Body Shape Parameter P3 for Females.The third parameter tends to capture the hip-to-waist ratio of the body shape among the female subsample.

Figure 18 :
Figure18: Various Measures and P3 for Females.Estimated coefficients and bootstrapped 90% confidence bands are reported for female.Note that units for all measurements except cup size are converted into centimeter (cm).

Figure 20 :
Figure 20: Comparison of Conventional Body Measures and Deep-learned Body Parameters (Male).Estimated coefficients and bootstrapped 90% confidence bands are reported.We provide results from conventional body measures (Conventional) and deep-learned body parameters (Deep) with/without control function (CF) approach.Note that the unit for height is converted into centimeter (cm).

Figure 21 :
Figure 21: Comparison of Conventional Body Measures and Deep-learned Body Parameters (Female).Estimated coefficients and bootstrapped 90% confidence bands are reported.We provide results from conventional body measures (Conventional) and deep-learned body parameters (Deep) with/without control function (CF) approach.Note that the unit for height is converted into centimeter (cm).
The males' average reported weight is 86.03 kilograms and the average of the measured weight is 86.76 kilograms.The median of two measurements are the same.For females, reported weight is 67.88 kilograms and measured weight is 68.81 kilograms on average.Median reported weight is 63.49 kilograms and median measured weight is 64.85 kilograms.In both subsamples, the standard errors of the weight are large, which are approximately 17 kilograms.

Table 9
presents the estimation results.Except height and weight, for brevity, we only report measures of body parts which are statistically significant.Coefficients on age, race, occupation, education, and marital status are very similar Causal Diagram for the Impact of Body Type on Income.Shoe/Pants/Jacket size determinants are independent of individual ability so that they can serve as valid instrumental variables.

Table 3 :
List of Various Body Measures

Table 4 :
The Association between Reporting Error in Height/Weight and Personal Background

Table 5 :
The Association between Reported Height/Weight and Family Income

Table 6 :
The Association between Height/Weight and Family Income

Table 7 :
The Association between Reported BMI and Family Income

Table 8 :
The Association between BMI and Family Income

Table 9 :
The Association between Various Body Measures and Family Income

Table 9 :
The Association between Various Body Measures and Family Income (Continued)

Table 10 :
The Association between Female P3 and Various Body Measures

Table 11 :
The Association between BMI/Height/Hip-to-waist-Ratio and Family Income

Table 12 :
The Association between Body-type Parameters and Family Income

Table 13 :
The Association between BMI/Height/Hip-to-waist-Ratio and Family Income -Proxy Variable Approach

Table 14 :
The Association between Body-type Parameters and Family Income -Proxy Variable Approach

Table 16 :
The Association between Body-type Parameters and Family Income -Control Function Approach