Depression is commonly comorbid with many other somatic diseases and symptoms. Identification of individuals in clusters with comorbid symptoms may reveal new pathophysiological mechanisms and treatment targets. The aim of this research was to combine machine-learning (ML) algorithms with traditional regression techniques by utilising self-reported medical symptoms to identify and describe clusters of individuals with increased rates of depression from a large cross-sectional community based population epidemiological study.
A multi-staged methodology utilising ML and traditional statistical techniques was performed using the community based population National Health and Nutrition Examination Study (2009–2010) (N = 3,922). A Self-organised Mapping (SOM) ML algorithm, combined with hierarchical clustering, was performed to create participant clusters based on 68 medical symptoms. Binary logistic regression, controlling for sociodemographic confounders, was used to then identify the key clusters of participants with higher levels of depression (PHQ-9≥10, n = 377). Finally, a Multiple Additive Regression Tree boosted ML algorithm was run to identify the important medical symptoms for each key cluster within 17 broad categories: heart, liver, thyroid, respiratory, diabetes, arthritis, fractures and osteoporosis, skeletal pain, blood pressure, blood transfusion, cholesterol, vision, hearing, psoriasis, weight, bowels and urinary.
Five clusters of participants, based on medical symptoms, were identified to have significantly increased rates of depression compared to the cluster with the lowest rate: odds ratios ranged from 2.24 (95% CI 1.56, 3.24) to 6.33 (95% CI 1.67, 24.02). The ML boosted regression algorithm identified three key medical condition categories as being significantly more common in these clusters: bowel, pain and urinary symptoms. Bowel-related symptoms was found to dominate the relative importance of symptoms within the five key clusters.
Citation: Dipnall JF, Pasco JA, Berk M, Williams LJ, Dodd S, Jacka FN, et al. (2016) Into the Bowels of Depression: Unravelling Medical Symptoms Associated with Depression by Applying Machine-Learning Techniques to a Community Based Population Sample. PLoS ONE 11(12): e0167055. https://doi.org/10.1371/journal.pone.0167055
Editor: Igor Branchi, Istituto Superiore Di Sanita, ITALY
Received: June 6, 2016; Accepted: November 8, 2016; Published: December 9, 2016
Copyright: © 2016 Dipnall et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The original and cleaned data for the NHANES data used in this study is open access and located at the URL http://wwwn.cdc.gov/Nchs/Nhanes/Search/nhanes09_10.aspx.
Funding: MB is supported by a NHMRC Senior Principal Research Fellowship 1059660. LJW is supported by a NHMRC Career Development Fellowship 1064272. FNJ is supported by an NHMRC Career Development Fellowship 1108125. The author(s) received no specific funding for this work.
Competing interests: JFD has no conflicts of interest in relation to this manuscript. JAP has recently received grant/research support from the National Health and Medical Research Council (NHMRC), BUPA Foundation, Amgen,/GlaxoSmithKline/Osteoporosis Australia/Australian and New Zealand Bone and Mineral Society, Western Alliance, Barwon Health, Deakin University and the Geelong Community Foundation. MB has received Grant/Research Support from the NIH, Cooperative Research Centre, Simons Autism Foundation, Cancer Council of Victoria, Stanley Medical Research Foundation, MBF, NHMRC, Beyond Blue, Rotary Health, Geelong Medical Research Foundation, Bristol Myers Squibb, Eli Lilly, Glaxo SmithKline, Meat and Livestock Board, Organon, Novartis, Mayne Pharma, Servier and Woolworths, has been a speaker for Astra Zeneca, Bristol Myers Squibb, Eli Lilly, Glaxo SmithKline, Janssen Cilag, Lundbeck, Merck, Pfizer, Sanofi Synthelabo, Servier, Solvay and Wyeth, and served as a consultant to Astra Zeneca, Bioadvantex, Bristol Myers Squibb, Eli Lilly, Glaxo SmithKline, Janssen Cilag, Lundbeck Merck and Servier. Drs Copolov, MB and Bush are co-inventors of provisional patent 02799377.3-2107-AU02 “Modulation of physiological process and agents useful for same”. MB and Laupu are co-authors of provisional patent 2014900627 “Modulation of diseases of the central nervous system and related disorders”. MB is supported by a NHMRC Senior Principal Research Fellowship 1059660. LJW is supported by a NHMRC Career Development Fellowship 1064272. SD has received grants/research support from the Stanley Medical Research Institute, NHMRC, Beyond Blue, ARHRF, Simons Foundation, Geelong Medical Research Foundation, Fondation FondaMental, Eli Lilly, Glaxo SmithKline, Organon, Mayne Pharma and Servier, speaker’s fees from Eli Lilly, advisory board fees from Eli Lilly and Novartis, and conference travel support from Servier. FNJ has received Grant/Research support from the Brain and Behaviour Research Institute, the National Health and Medical Research Council (NHMRC), Australian Rotary Health, the Geelong Medical Research Foundation, the Ian Potter Foundation, Eli Lilly, the Meat and Livestock Board and The University of Melbourne and has received speakers honoraria from Sanofi-Synthelabo, Janssen Cilag, Servier, Pfizer, Health Ed, Network Nutrition, Angelini Farmaceutica, and Eli Lilly. She is supported by an NHMRC Career Development Fellowship (#1108125). DM has received grant/research support from the Australian research Council (ARC), Mental Illness Research Fund (MIRF), Victorian Department of Justice, Beyond Blue, Swinburne University of Technology, Federal University. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Abbreviations: DIPIT, Data Integration Protocol In Ten-Steps; ML, Machine-learning; MART, Multiple Additive Regression Trees; NCHS, National Center for Health Statistics; NHANES, National Health and Nutrition Examination Survey; PHQ-9, Patient Health Questonnaire-9; SOMs, Self-organizing maps
Depression is a debilitating illness that is estimated to affect 350 million people globally and is frequently associated with somatic symptoms and other medical conditions [1,2]. The nature and direction of these relationships are often complex, interrelated, and difficult to unravel. Depression classically presents with many and diverse somatic symptoms. The comorbidity of depression with a number of chronic medical conditions, such as Irritable Bowel Syndrome (IBS) , ischemic heart disease , cancer , diabetes , osteoporosis , thyroid disease , and obesity , has also been well established. However, these conditions often have bidirectional relationships with depression such that this level of comorbidity and interrelatedness can complicate treatment and stymie efforts to identify causal factors in depression. Thus, the identification of individuals in clusters of comorbid symptoms in depression may reveal new pathophysiological mechanisms and treatment targets.
Due to the complexity and heterogeneity of medical data, previous studies have primarily investigated individual medical conditions linked to depression. The use of “big data” and machine-learning (ML) techniques and algorithms has the ability to handle heterogeneous data without strict constraints and have been demonstrated to unearth key patterns and interactions in health data [10,11]. The mapping of multidimensional data onto two-dimensional maps [12–14] with ML techniques allows the researcher to visualise and interpret the complexity of the data and generate new hypotheses regarding depression.
ML is a vast and expanding field of artificial learning where algorithms improve performance through experiential learning . In the health arena, ML algorithms that learn by training on subsets of data have been used to fit models using supervised ML (i.e. where the objective of the exercise is to establish the main inputs to predict known values) , and to find patterns in data using unsupervised ML (i.e. where the objective is to uncover previously unknown patterns and clusters within the data set, without any a priori model defined) . Blending of unsupervised and supervised ML techniques has been used to detect patterns and relationships within large numbers of complex lifestyle-environ variables . Notoriously complex in nature, medical symptom data are ideally suited to blended ML techniques. Utilising the learning properties of ML it is possible to detect, visualize and understand the composition of medical symptoms clusters for those with psychiatric disorders such as depression. [19,20]
ML techniques have been used across a variety of disciplines to explore and model very large quantities of data to discover patterns, unsuspected relationships and useful rules for a specific purpose. Often novel unsuspected and novel interpretations of the data (serendipity) are uncovered. Commercially, these techniques have been used successfully for businesses to learn from their transaction data about the behaviour of their customers, improving their business model by exploiting this knowledge . However, it has only been over the last 10 years that ML techniques have been used in medical research, primarily in neuroscience and biomedicine [22,23]. More recently ML techniques have been used in psychiatry , using predominantly very big data sets. Complex survey methodologies are often implemented with population-based data (e.g. oversampling in underrepresented groups, stratification, clustering) and traditional statistical techniques are capable of dealing with this complexity . However, big data techniques on their own do not adequately account for this type of sample. Thus, a blend of both big data ML techniques with traditional statistical techniques has the potential to uncover hidden patterns while accounting for the complex sampling.
The aim of this research was to use data from a large cross-sectional community based population epidemiological study to combine unsupervised and supervised ML algorithms with traditional regression techniques by utilising self-reported medical symptoms to identify and describe clusters of individuals with increased rates of depression from a large cross-sectional community based population epidemiological study.
Study design and participants
The 2009–2010 National Health and Nutrition Examination Survey (NHANES) (2009–2010)  cross-sectional civilian noninstitutionalized population based data were utilised for this study. This study included 18 to 80 year old non-institutionalised US civilians (N≈10,000) and applied a complex four-stage sampling methodology: counties; segments within counties; households within segments; and, individuals within households. Data were collected from 15 locations across 50 US states, with oversampling of subgroups of the population of particular public health interest, to increase the reliability and precision of population estimates . Questionnaire data relating to medical symptoms and demographics were downloaded from the NHANES website and integrated using the Data Integration Protocol In Ten Steps (DIPIT) .
Variables were initially selected based on the criterion of relevance to medical symptoms. Analysis was performed to minimise the degree of missing data across the set of medical symptoms. The final set of 68 dichotomous medical symptom variables and an unweighted sample size of 3,922 was used for clustering in this research study. There were 377 participants identified with depression, being representative of the total depressed sample for NHANES during 2009–2010 (i.e. 8% after adjustment for the complex survey sample structure). The imbalanced nature of the data was addressed in this study by identifying clusters with high rates of depression (i.e. high risk clusters) rather than individual participants with depression. This meant that within each high risk cluster the imbalance was much reduced. This was the primary rationale for undertaking the Self-organised Mapping (SOM) and clustering of individuals, thereby allowing the identification of the key clusters significantly associated with depression using binary logistic regression. Finally, the most important medical symptoms for identifying depressed individuals were identified for each of these key clusters.
NHANES received approval from the National Center for Health Statistics (NCHS) research ethics review board and informed consent was obtained from all participants. Use of data from the NHANES 2009–2010 database is approved by the National Center for Health Statistics Research Ethics Review Board (Continuation of Protocol #2005–06).
A self-reported Patient Health Questonnaire-9 (PHQ-9)  was used to assess depressive symptoms (‘depression’). This questionnaire consisted of nine items that were summed to form a total score. Those with a total score of 10 or more were considered moderately or severely depressed . The 68 medical symptom data were classified into 17 broad medical categories: heart, liver, thyroid, respiratory, diabetes, arthritis, fractures and osteoporosis, pain (i.e. neck, back, hip pain), blood pressure, cholesterol, vision, hearing, psoriasis, weight, bowels, urine, and if a blood transfusion was received. The self-report demographic and socio-economic variables from the NHANES demographic and questionnaire data components were also utilised .
This research implemented two ML algorithms: an unsupervised algorithm, combined with hierarchical clustering, to create the medical symptom clusters and a supervised algorithm to identify and describe the key clusters with a significant relationship with depression. Due to the complex sampling methodology of the NHANES data, traditional binary logistic regression was implemented to identify these key clusters while controlling for potential socio-demographic confounders.
A summary of the statistical methodology, testing regime and results is outlined in Fig 1.
Medical symptom cluster identification
Self-organizing maps (SOMs) were introduced by Kohonen in 1995  as a variant of artificial neural networking, inspired by biological neural networks, and have since been used in many diverse applications across a variety of fields including bioinformatics, engineering, financial analysis, experimental physics, and psychiatry [31,32]. SOMs provide a simple and effective unsupervised ML algorithm for clustering individual participants and visualising high dimensional data in a low dimensional map without any reliance on distributional assumptions.
The SOM identifies clusters by effectively packing the dataset onto a q-dimensional plane where data points “similar” to each other in the original multidimensional data space are then mapped onto nearby areas of the q-dimensional output space. SOMs combine competitive learning with dimensionality reduction by smoothing the clusters with respect to an apriori grid. The SOM is called a topology-preserving map because multi-dimensional input data is represented often by a two dimensional “map” of nodes where topological properties of the input space are maintained.
The steps involved in the SOM competitive ML algorithm involve initially assigning random vector weights to each node (or position on the grid), then randomly choosing data points (participants) from the training data and presenting them to the SOM. The “Best Matching Unit” (BMU) in the map is the node with a vector weight most similar to a data point and nodes within the “neighbourhood” of each BMU are found. With each iteration, the size of this neighbourhood decreases. The vector weights of nodes in the BMU neighbourhood are adjusted closer to their associated data points. The size of these adjustments decrease with each iteration and the magnitude of these adjustments is proportional to the proximity of the node to the BMU. These steps are repeated for N iterations or until the vector weights for all the nodes converge to their final values.
For this study a hexagonal map topology was used, with five SOM grids tested (10x10, 15x15, 20x20, 25x25, 30x30) to establish a map with suitable nodes. The final solution utilised a 15x15 grid with a learning rate for weight adjustment declining linearly from 5% to 1% over 100 iterations. The unconstrained nature of the SOM technique meant that clusters of nodes form naturally from the medical symptom data on the grid without the influence of the participant’s depressive symptom status. Hierarchical clustering, using the complete linkage method , was then utilised to group SOM nodes with similar final weights, identifying the final clusters. Three to 12 cluster solutions were considered and the cluster solution with the most differentiation in terms of depression was chosen for further investigation. The clusters were numbered in order of their rates of depression (i.e. frequency and average total PHQ-9 score).
Identification of key clusters with higher depression rates
Quantitative and qualitative investigation, using exploratory statistics of the resultant clusters was used to establish variation with respect to depression rates and demographics.
Demographic factors were included in a binary logistic regression model to identify the key participant clusters with a significant positive relationship with depression, accounting for the complex survey design of NHANES. This model controlled for potential confounders and quantified the probability of depression within each cluster. The cluster with the lowest depression rate was chosen as the reference group. This stage of the analysis was used to identify participant clusters with significant rates of depression in order to identify the important medical symptoms from the ML boosted regression. Only these key clusters were used in the next stage of supervised ML boosted regression. No further investigation was performed on those clusters with non-significant odds ratios for depression.
Medical symptoms most prominent within key clusters
Supervised ML boosted regression , translated to a binary logistic regression analysis , was implemented for each of the key clusters to identify the most prominent medical symptoms associated with depression within these clusters. This technique has been previously used to identify biomarkers associated with depression  and to describe lifestyle clusters associated with depression  using data from the NHANES study. Depression was considered as a binary outcome and run for each key cluster using Friedman’s Multiple Additive Regression Trees (MART) boosted algorithm [37,38]. Consistent with previous research using this ML algorithm on the 2009 to 2010 NHANES data , validation was performed using a random split of each data set into 60% training and 40% validation, a regularization shrinkage parameter of 0.001, with 50% of the residuals used to fit each successive tree (50% bagging) . The maximum number of boosting interactions (i.e. number of terminal nodes plus 1) allowed was six, being marginally higher than the default (i.e. five) and within the recommended range . Whilst this technique has been used for predictive purposes , it also has the ability to be used as a variable selection method . This method was used as a variable selection technique to identify the prominent medical symptoms associated with depression within the key clusters . A relative importance (or contribution) of each medical symptom variable for each of the key significant clusters was produced from the ML boosted regression. Higher values of relative importance for a medical symptom within a particular key cluster indicates a stronger relationships with depression in this cluster. This technique for variable reduction has been recognised as effective  and previously used to delineate lifestyle clusters associated with depression .
Those medical symptoms explaining at least 80% of the total log likelihood variation across clusters were used to identify the most important medical symptoms for explaining differences across clusters. Resultant medical symptoms were then grouped into the 17 broad medical categories.
The SOMs and hierarchical clustering were performed in R with the SOM using the Kohonen package . The boosted regression and binary logistic regression statistical procedures were performed using Stata V14 software (StataCorp., 2014), with a Stata plugin for the boosted regression component of the analysis .
A summary of the results from the testing is presented in Fig 1.
The distance from each node’s weights to the sample of people represented by that node was reduced to a minimum plateau as the SOM training iterations progressed, indicating that no more iterations were required (Fig 2). Taking into account the heterogeneous nature of the self-reported medical symptom data, the counts plot indicated a reasonable distribution of people numbers across the map. The neighbour distance plot indicated the distances between each node and its neighbours were mostly similar with only a few dissimilar nodes, later identified as outlying clusters (Fig 2).
Note: The “Training progress” graph indicates as the SOM training iterations distance from each node's weights to the samples represented by that node reduces and plateaus to indicate no more iterations were required. The “Counts plots” indicates reasonable samples were mapped to each node on the map. The “Neighbour distance plot” or U-Matrix indicates the distance between each node and its neighbours.
Three to 12 cluster solutions were considered (Fig 3) and the 10 cluster solution was selected for further investigation because of clear cluster differences in terms of depression rates. There were some isolated nodes in this cluster solution, later confirmed as outliers.
Note: Clusters 3 to 12 solutions mapped onto the SOM grid. Colours indicate different clusters. The final 10 cluster solution selected for further analysis has been highlighted with a red border.
The final 10 cluster solution contained two dominant clusters (Table 1). One cluster was dropped from further analysis due to very low frequency (n = 8), leaving 9 of the 10 clusters for further analysis.
Initial investigation into the relationship between the remaining nine participant clusters and the depression measures revealed that the clusters exhibited an order with respect to both the percentage of participants depressed within each cluster and the average depression score (Fig 4).
Note: “Mean Depression Score” is the average total PHQ-9 score which ranged from 0 to 27. “Percent Depressed” based on a total PHQ-9 ≥ 10.
An initial inspection of the socio-demographics for the nine clusters (Table 2) showed clear differences. Due to the small frequencies for many of the clusters, only a qualitative investigation of socio-demographic differences was performed. Cluster 1 (n = 3,108) exhibited socio-demographics closest to the total across all cluster participants. Cluster 2 (n = 34) consisted of mostly male, non-Hispanic white with a high family income poverty ratio [40,41] and who were less likely to have never married. Cluster 3 (n = 57) consisted mostly of male, non-Hispanic white, older, married / with a partner, a household size of around two people, and a low family income poverty ratio. Cluster 4 (n = 446) members were more likely to be female, non-Hispanic white, middle aged, with a low family income poverty ratio, and less likely to have never been married. Cluster 5 (n = 50) were more likely to be older, non-Hispanic black, with a low family income poverty ratio, and less likely to never have been married. Cluster 6 (n = 83) members were more likely to be male, older, less than three members in the household, non-Hispanic white, and of low to mid family income poverty ratio, and less likely to have never been married. Cluster 7 (n = 55) were more likely to be middle aged, Mexican / Hispanic, with a low family income poverty ratio and less likely to have never been married. Cluster 8 (n = 52) were more likely to be female, older, non-Hispanic white, around two members in the household, with low family income poverty ratio and less likely to have been married. Cluster 9 (n = 29) were more likely to be young Mexican / Hispanic, with a large household and low family income poverty ratio.
Identification of key clusters with higher depression rates
The final binary logistic regression with depression as the outcome took into account the complex survey data of NHANES, as well as non-linearity, interactions and potential confounders (Table 3). The test for goodness of fit were not significant for the model indicating a good fit to the data (F(9,8) = 1.77, p = 0.216) . Clusters 4 and 6 to 9 had significantly higher rates of depression than cluster 1 after controlling for the potential socio-demographic confounders. These five clusters were considered the key clusters for further analysis. Since the odds ratios for depression for clusters 2, 3 did not significantly differ from cluster 1 these clusters were excluded from future analysis. A significant interaction was found between the cluster with the highest rate of depression (cluster 9) and the family income poverty ratio (p = 0.036) (Fig 5). Thus, the relationship between the probability of depression and cluster 9 varied depending upon the rate of the family poverty income ratio.
Medical symptoms most prominent within key clusters
ML boosted regression was used to establish which medical symptoms were associated with depression for each of the five key significant clusters. The top medical symptom variables explaining approximately 80% of the total log likelihood for each cluster were selected for categorisation and further investigation. Bowel symptoms (e.g. bowel movements per week, stool type) dominated the relative importance percentage across all the five key clusters (Fig 6). Further investigation into the top 3 to 10 ranked medical categories from the ML boosted regression found that bowel, pain and urine symptoms consistently exhibiting a relatively high importance percentage for each of the key clusters.
Note: Based on total boosted relative importance percentage across all clusters. Summed percentage from boosted regression across all five key significant clusters, thus total >100%.
The top 10 key medical symptom categories for the five key significant clusters indicated that each cluster exhibited different medical symptoms (Fig 7). However, bowel symptoms were consistently included in the highest ranked medical symptoms across all five significant depressive key clusters. In addition, the bowel symptoms dominated for cluster 7 and cluster 9, and had relatively high importance (i.e. >5%) for four of the five key clusters. Pain symptoms had the highest relative importance in cluster 4 and urine symptoms had relatively high importance (i.e. <10%) for two of the five key clusters. Whilst hearing symptoms were important in all five of the key clusters, they only dominated in cluster 8.
Note: Clusters presented in order of percent depressed. Note: Percentage sum does not take account of direction of relationship.
The individual clusters showed clear delineation with respect to medical conditions. The top three medical symptoms for cluster 4 related to the skeletal symptoms of pain, fractures and osetoporosis, and bowel symptoms. Cluster 6 was dominated by urinary medical symptoms. Cluster 7 was clearly dominated by bowel medical symptoms. Cluster 8 was a generally unwell cluster with the top five medical symptoms related to hearing, pain, bowels, respiratory and heart. Finally the top two medical symptoms for cluster 9 related to bowels and urine.
Irrespective of country, research has consistently found a high level of comorbidity between specific (e.g. sleep, appetite) and nonspecific symptoms and depression [43,44] but it has been difficult to identify the key somatic symptoms most prominent in this condition. This study utilized two machine learning techniques, complemented by traditional binary logistic regression analyses, to detect complex interactions between large numbers of medical symptoms in order to identify those most strongly linked to depression in an atheoretical manner. ML techniques have been used in the area of big data informatics in mental health. For example, text analysis  and regression models  have been used to predict the risk of suicide from clinical notes, but these techniques have not previously been used to investigate the relationship between depression and medical symptoms using epidemiological community based population data. The visual simplification of complex medical symptom data into clusters, using SOM, allows the researcher to easily identify the strength of the similarities across the map. The ML SOM’s intention to mimic an artificial network that learns, without supervision, has proven effective in creating nodes, subsequently grouped into clusters identified by a standard hierarchical clustering. Nine clusters of participants based on medical symptoms were found using the unsupervised graphical SOM ML technique. Traditional binary logistic regression showed that five of the nine clusters were characterised by higher rates of depression after controlling for potential confounders and taking account of the complex survey methodology of the population data.
A boosted regression ML algorithm was used to provide a relative importance percentage for each medical symptom for each of the five key significant clusters, allowing the easy grouping of symptoms into medical categories. The ML boosted regression algorithm was able to untangle the array of medical symptoms and detect three key medical condition categories as being particularly related to depression: bowel, pain and urinary symptoms. Of these categories, bowel symptoms dominated, validating previous research regarding the high comorbidity between gut symptoms and IBS with common mental disorders, including depression [3,47].
Gut disorders in particular share links with depression. Irritable bowel syndrome (IBS)  has been found to be closely associated with mental health conditions. IBS is not only comorbid with psychiatric conditions, but also comorbid with non-gastrointestinal somatic disorders . Crohn’s disease  and gastro-oesophageal reflux disease (GORD)  are similarly associated with higher rates of mood disorders than would be expected by chance. All these interrelationships impact on the quality of life, treatment compliance, length of stay in hospitals, costs of health care, morbidity and possibly mortality of individuals affected.
Medical symptoms relating to stool type and frequency and constipation were included in the bowel categorisation for this study, and these indicators have all been related to mood . Recently, ML boosted regression has identified an association between the gastrointestinal biomarker of bilirubin with depression  and bilirubin has been linked to varying stool type based on the speed at which the intestinal contents travel through the bowel .
There is an increasing focus in medical research on the role of symbiotic gut microbiota in health and disease, including mental health. Indeed, the human gut microbiota, and what is termed the ‘gut-brain axis’, are now increasingly regarded as potentially critical drivers of mood and behaviour, with much of the biological dysregulation associated with depressive symptoms and the diagnosis of clinical depression influenced by the gut microbiota . Such microbiota-influenced dysregulation involves inflammatory, metabolic, oxidative stress, HPA axis, neurotransmitter/neuropeptide, brain plasticity and other systems . Moreover, the normal intestinal barrier function is compromised in depression . This ‘leaky gut’ allows intestinal-microbe-derived lipopolysaccharide (LPS), an endotoxin, to gain access to the periphery. Even very low levels of LPS can provoke much of the aforementioned biological dysregulation noted in depression.
Importantly, many of the lifestyle and environmental factors connected to depression have a detrimental influence on the composition of the normal human microbiota. As just one example, unhealthy dietary patterns that increase the risk for depression  also diminish microbial diversity . Long-term, habitual diets are one of the strongest influences on gut microbial composition, determining an individual “enterotype” , however dietary change can prompt change in gut microbiota composition within 24 hours . The consumption of complex carbohydrates, plant-based foods/fruits and vegetables [58,60] positively influences microbial composition, synthesis of anti-inflammatory short chain fatty acids, and host health. Conversely, high fat diets trigger microbial dysbiosis, intestinal permeability (‘leaky gut’) and inflammation . We have previously demonstrated that healthy dietary patterns are associated with a reduced likelihood of depressive symptoms in adults participating in the NHANES . This suggests that unhealthy dietary behaviors may be a key factor negatively influencing both gut health and depression, with bowel symptoms signifying poor gut health.
Strengths and Limitations
The strengths of this study lie in the benefits of using both unsupervised and supervised ML techniques to identify patterns in data, using a large number of heterogeneous self-reported medical symptoms to form five clusters of individuals with relatively high rates of depression, most likely to have remained hidden using traditional statistical techniques. The largest cluster of participants (cluster 4, n = 446) comprised 7% moderately and 7% severely depressed participants; this compares to rates of 5% and 3% respectively in the general 2009 to 2010 US population in NHANES. The remaining key clusters (6 to 9) consisted of smaller groups of participants, with 15% moderately and 14% severely depressed participants overall. A main limitation with this study is the cross-sectional nature of the NHANES data that restricts the ability to infer causality. However, the use of this community population based survey data has the advantage of being representative of the large US population sampled during 2009 to 2010. The large number of participants included in this study, with its rigorous complex survey sampling methodology, ensures the data possess a good description of the relative characteristics of the civilian noninstitutionalised US population. As compared to other methods of data gathering, surveys are able to extract data that closely mirror attributes of the larger population.
It is acknowledged that the PHQ-9 instrument relates to depressive symptoms, and does not represent a clinical diagnosis of depression. Thus, this self-report instrument may have missed less severe cases of depression [27,28] exaggerating the imbalance in the data. Furthermore, the depression symptoms picked up by the PHQ-9 instrument for this study, such as fatigue, psychomotor problems, or insomnia are symptoms very common in medical conditions. Thus, it was not surprising that the results from this study confirmed prior research identifying depressive symptoms being often elevated in people with medical symptoms . The relationship between medical symptoms and depression is complex and often bidirectional. However, the identification of the dominant medical symptoms, such as those of the bowel cluster in this study, may be used to improve screening tools for depression in medically ill patients and to shed light on possible pathogenic processes. It is acknowledged that individuals with depression are more likely to report somatic conditions, and IBS has been found to be a disorder with a psychosomatic aspect . However, the NHANES study is considered representative of the US noninstitutionalised civilian population and has been used to produce health statistics for the US and in many studies investigating depression (e.g. to examine the prevalence, treatment and control of depressive symptoms ).
We addressed the limitation of the imbalance in the data of having only approximately 8% of the sample classified with depression by including only those clusters with high depression rates, hence reducing the impact of this imbalance on our analysis.
There are potential limitations in using the proposed ML techniques. The SOM can become conceptually expensive as the number of variables and the grid size increases, causing the number of distances the algorithm needs to compute to increase exponentially. In addition, the SOM requires a value for each variable for each participant in order to generate a map, so missing data poses issues for map generation with SOMs. Alternative less computer intensive traditional statistical techniques, such as k-means clustering or latent class analysis, could have been used. However, the SOM algorithm has been found to provide better results than either of these methods in the case of large data sets [65–67] such as used in this study.
The ML boosted regression has the advantage of automatically incorporating interaction effects when evaluating variable importance which is not possible with traditional statistical regression modelling . Also, variable selection processes, such as stepwise or regularized regression make variable selection difficult when there are highly correlated predictors as is the case with medical symptoms. The boosted regression overcomes this problem by reducing the number of selected variables at each iteration thereby being able to deal with highly correlated variables. However, ML boosted regression can fail to perform well with small data sets . In addition, the training process can be computationally memory intensive due to the fact that trees are built sequentially, requiring advanced computing capability such as parallel processing. In addition, the regularization implemented to reduce the effects of overfitting can mean the optimal number of iterations for a suitable shrinkage parameter can be considerably large .
Whilst this study performed validation using a random split of data into 60% training and 40% validation at the ML boosted regression stage, no validation of the methodology was performed on a separate data set using self-reported medical symptom data. However, this methodology has been successfully implemented to identify lifestyle clusters associated with depression .
This study implemented two ML algorithms and a standard binary logistic regression to identify and describe clusters of individuals with higher rates of depression based on self-reported medical symptoms in a large, cross-sectional epidemiological community based population study. Bowel symptoms, covering bowel frequency and stool type, were identified as the predominant concurrent symptom category for the key clusters with a significant positive relationship with depression across 17 varied medical symptom categories. This study encourages the future use of machine learning techniques to compliment traditional statistical approaches in the analysis of epidemiological studies to assist clinicians detect potential latent associations that can be further refined and clarified. This study also supports a research focus on the potential importance of the bowel symptoms, the gut and its resident microbiota in mental health research.
MB is supported by a NHMRC Senior Principal Research Fellowship 1059660.
LJW is supported by a NHMRC Career Development Fellowship (GNT1064272).
FNJ is supported by an NHMRC Career Development Fellowship 1108125.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors would like to thank the referees of this issue for their valuable comments and suggestions that have improved this paper.
- Conceptualization: JFD.
- Formal analysis: JFD.
- Methodology: JFD DM.
- Software: JFD.
- Visualization: JFD.
- Writing – original draft: JFD.
- Writing – review & editing: JFD JAP MB LJW SD FNJ DM.
- 1. Sanna L, Stuart AL, Pasco JA, Kotowicz MA, Berk M, et al. (2013) Physical comorbidities in men with mood and anxiety disorders: a population-based study. BMC Med 11: 1.
- 2. Sanna L, Stuart AL, Pasco JA, Jacka FN, Berk M, et al. (2014) Atopic disorders and depression: findings from a large, population-based study. J Affect Disord 155: 261–265. pmid:24308896
- 3. Fond G, Loundou A, Hamdani N, Boukouaci W, Dargel A, et al. (2014) Anxiety and depression comorbidities in irritable bowel syndrome (IBS): a systematic review and meta-analysis. Eur Arch Psychiatry Clin Neurosci 264: 651–660. pmid:24705634
- 4. Kronish IM, Carson AP, Davidson KW, Muntner P, Safford MM (2012) Depressive symptoms and cardiovascular health by the american heart association’s definition in the reasons for geographic and racial differences in stroke (REGARDS) study. PLoS One 7: e52771. pmid:23300767
- 5. Massie MJ (2004) Prevalence of depression in patients with cancer. Monographs-National Cancer Institute 32: 57–71.
- 6. Mezuk B, Eaton WW, Albrecht S, Golden SH (2008) Depression and type 2 diabetes over the lifespan a meta-analysis. Diabetes Care 31: 2383–2390. pmid:19033418
- 7. Fernandes BS, Hodge JM, Pasco JA, Berk M, Williams LJ (2016) Effects of depression and serotonergic antidepressants on bone: mechanisms and implications for the treatment of depression. Drugs Aging 33: 21–25. pmid:26547857
- 8. Harris B, Othman S, Davies J, Weppner G, Richards C, et al. (1992) Association between postpartum thyroid dysfunction and thyroid antibodies and depression. Bmj 305: 152–156. pmid:1515829
- 9. Luppino FS, de Wit LM, Bouvy PF, Stijnen T, Cuijpers P, et al. (2010) Overweight, obesity, and depression: a systematic review and meta-analysis of longitudinal studies. Arch Gen Psychiatry 67: 220–229. pmid:20194822
- 10. Passos IC, Mwangi B, Kapczinski F (2016) Big data analytics and machine learning: 2015 and beyond. The Lancet Psychiatry 3: 13–15. pmid:26772057
- 11. Monteith S, Glenn T, Geddes J, Bauer M (2015) Big data are coming to psychiatry: a general introduction. International journal of bipolar disorders 3: 1–11.
- 12. Kohenen T (1997) Self-Organizing Maps, Vol. 30 of Lecture Notes in Information Sciences. Springer.
- 13. Wehrens R, Buydens LM (2007) Self-and super-organizing maps in R: the Kohonen package. J Stat Softw 21: 1–19.
- 14. Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43: 59–69.
- 15. Mitchell TM (1997) Machine learning. 1997. Burr Ridge, IL: McGraw Hill 45.
- 16. Chekroud AM, Zotti RJ, Shehzad Z, Gueorguieva R, Johnson MK, et al. (2016) Cross-trial prediction of treatment outcome in depression: a machine learning approach. The Lancet Psychiatry.
- 17. Arnrich B, Setz C, La Marca R, Tröster G, Ehlert U (2010) Self Organizing Maps for Affective State Detection. Machine Learning for Assistive Technologies: 45.
- 18. Joanna F Dipnall JAP, Michael Berk, Lana J Williams, Seetal Dodd,Felice, N Jacka DM (2016) Why so GLUMM? Detecting depression clusters through Graphing Lifestyleenvirons Using Machine-learning Methods (GLUMM). Eur Psychiatry.
- 19. Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. Neural Networks, IEEE Transactions on 11: 586–600.
- 20. Van Hulle MM (2012) Self-organizing maps. Handbook of Natural Computing: Springer. pp. 585–622.
- 21. Linoff GS, Berry MJ (2011) Data Mining Techniques: For Marketing, Sales, And Customer Relationship Management Author: Gordon S. Linoff, Michael J. Be.
- 22. Chaovalitwongse W, Pardalos PM, Xanthopoulos P (2010) Computational Neuroscience: Springer.
- 23. Seref O, Kundakcioglu OE, Pardalos PM (2007) Data mining, systems analysis, and optimization in biomedicine: American Institute of Physics Inc.
- 24. Lumley T (2004) Analysis of complex survey samples. Journal of Statistical Software 9: 1–19.
- 25. Centers for Disease Control and Prevention National Center for Health Statistics (2013) National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010 U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES
- 26. Dipnall JF, Berk M, Jacka FN, Williams LJ, Dodd S, et al. (2014) Data Integration Protocol In Ten-steps (DIPIT): A new standard for medical researchers. Methods.
- 27. Kroenke K, Spitzer RL (2002) The PHQ-9: a new depression diagnostic and severity measure. Psychiatric Annals 32: 509–515.
- 28. Kroenke K, Spitzer RL, Williams JB (2001) The PHQ‐9. J Gen Intern Med 16: 606–613. pmid:11556941
- 29. (CDC). CfDCaP (2009–2010) National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire.: Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention.
- 30. Kohonen T (1995) Self-Organizng Maps-Springer Series in Information Sciences, vol. 30. Berlin: Springer Verlag.
- 31. Gabor A, Leach R, Dowla F (1996) Automated seizure detection using a self-organizing neural network. Electroencephalogr Clin Neurophysiol 99: 257–266. pmid:8862115
- 32. Magdolen J, Rappelsberger P, Dorffner G, Flexer A, Winterer G (1997) Evaluating multi-layer perceptrons and self-organising feature maps as a tool for identifying psychiatric disorders in EEG. Psychiatry Research: Neuroimaging 68: 171–172.
- 33. Köhn HF, Hubert LJ (2006) Hierarchical cluster analysis. Wiley StatsRef: Statistics Reference Online.
- 34. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55: 119–139.
- 35. Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27: 83–85.
- 36. Dipnall JF, Pasco JA, Berk M, Williams LJ, Dodd S, et al. (2016) Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression. PLoS One 11: e0148195. pmid:26848571
- 37. Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning: Springer series in statistics Springer, Berlin.
- 38. Schonlau M (2005) Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal 5: 330.
- 39. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28: 337–407.
- 40. Black MM, Cutts DB, Frank DA, Geppert J, Skalicky A, et al. (2004) Special Supplemental Nutrition Program for Women, Infants, and Children participation and infants’ growth and health: a multisite surveillance study. Pediatrics 114: 169–176. pmid:15231924
- 41. Bureau UC (2008) Current Population Survey: Definitions and explanations. Population Division, Fertility & Family Statistics Branch.
- 42. Archer KJ, Lemeshow S (2006) Goodness-of-fit test for a logistic regression model fitted using survey sample data. Stata Journal 6: 97–105.
- 43. Simon GE, VonKorff M, Piccinelli M, Fullerton C, Ormel J (1999) An international study of the relation between somatic symptoms and depression. N Engl J Med 341: 1329–1335. pmid:10536124
- 44. Kapfhammer H (2006) Somatic symptoms of depression. Dialogues Clin Neurosci 8: 227. pmid:16889108
- 45. Poulin C, Shiner B, Thompson P, Vepstas L, Young-Xu Y, et al. (2014) Predicting the risk of suicide by analyzing the text of clinical notes. PLoS One 9: e85733. pmid:24489669
- 46. Tran T, Phung D, Luo W, Venkatesh S (2015) Stabilized sparse ordinal regression for medical risk stratification. Knowledge and Information Systems 43: 555–582.
- 47. Mykletun A, Jacka F, Williams L, Pasco J, Henry M, et al. (2010) Prevalence of mood and anxiety disorder in self reported irritable bowel syndrome (IBS). An epidemiological population based study of women. BMC Gastroenterol 10: 1.
- 48. Whitehead WE, Palsson O, Jones KR (2002) Systematic review of the comorbidity of irritable bowel syndrome with other disorders: what are the causes and implications? Gastroenterology 122: 1140–1156. pmid:11910364
- 49. Persoons P, Vermeire S, Demyttenaere K, Fischler B, Vandenberghe J, et al. (2005) The impact of major depressive disorder on the short‐and long‐term outcome of Crohn's disease treatment with infliximab. Aliment Pharmacol Ther 22: 101–110. pmid:16011668
- 50. Sanna L, Stuart AL, Berk M, Pasco JA, Girardi P, et al. (2013) Gastro oesophageal reflux disease (GORD)-related symptoms and its association with mood and anxiety disorders and psychological symptomology: a population-based study in women. BMC Psychiatry 13: 1.
- 51. Shim L, Talley NJ, Boyce P, Tennant C, Jones M, et al. (2013) Stool characteristics and colonic transit in irritable bowel syndrome: evaluation at two time points. Scand J Gastroenterol 48: 295–301. pmid:23320464
- 52. Crofts D, Michel VM, Rigby A, Tanner M, Hall D, et al. (1999) Assessment of stool colour in community management of prolonged jaundice in infancy. Acta Paediatr 88: 969–974. pmid:10519339
- 53. Cryan JF, Dinan TG (2012) Mind-altering microorganisms: the impact of the gut microbiota on brain and behaviour. Nature reviews neuroscience 13: 701–712. pmid:22968153
- 54. Penninx BW, Milaneschi Y, Lamers F, Vogelzangs N (2013) Understanding the somatic consequences of depression: biological mechanisms and the role of depression symptom profile. BMC Med 11: 129. pmid:23672628
- 55. Maes M, Kubera M, Leunis JC, Berk M (2012) Increased IgA and IgM responses against gut commensals in chronic depression: Further evidence for increased bacterial translocation or leaky gut. Journal Of Affective Disorders 141: 55–62. pmid:22410503
- 56. Jacka FN, Cherbuin N, Anstey KJ, Butterworth P (2014) Dietary patterns and depressive symptoms over time: examining the relationships with socioeconomic position, health behaviours and cardiovascular risk. PLoS One 9: e87657. pmid:24489946
- 57. Dash S, Clarke G, Berk M, Jacka FN (2015) The gut microbiome and diet in psychiatry: focus on depression. Current opinion in psychiatry 28: 1–6. pmid:25415497
- 58. Wu GD, Chen J, Hoffmann C, Bittinger K, Chen Y-Y, et al. (2011) Linking long-term dietary patterns with gut microbial enterotypes. Science 334: 105–108. pmid:21885731
- 59. David LA, Maurice CF, Carmody RN, Gootenberg DB, Button JE, et al. (2014) Diet rapidly and reproducibly alters the human gut microbiome. Nature 505: 559–563. pmid:24336217
- 60. Albenberg LG, Wu GD (2014) Diet and the intestinal microbiome: associations, functions, and implications for health and disease. Gastroenterology 146: 1564–1572. pmid:24503132
- 61. Kim KA, Gu W, Lee IA, Joh EH, Kim DH (2012) High fat diet-induced gut microbiota exacerbates inflammation and obesity in mice via the TLR4 signaling pathway. PLoS One 7: e47713. pmid:23091640
- 62. Dipnall JF, Pasco JA, Meyer D, Berk M, Williams LJ, et al. (2015) The association between dietary patterns, diabetes and depression. J Affect Disord 174: 215–224. pmid:25527991
- 63. Olver JS, Hopwood MJ (2012) Depression and physical illness. Med J Aust 1: 9–12.
- 64. Shim RS, Baltrus P, Ye J, Rust G (2011) Prevalence, treatment, and control of depressive symptoms in the United States: results from the National Health and Nutrition Examination Survey (NHANES), 2005–2008. The Journal of the American Board of Family Medicine 24: 33–38. pmid:21209342
- 65. Abbas OA (2008) Comparisons Between Data Clustering Algorithms. Int Arab J Inf Technol 5: 320–325.
- 66. Hagenaars JA, McCutcheon AL (2002) Applied latent class analysis: Cambridge University Press.
- 67. Eshghi A, Haughton D, Legrand P, Skaletsky M, Woolford S (2011) Identifying groups: A comparison of methodologies. Journal of Data Science 9: 271–292.
- 68. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14: 1612.
- 69. Natekin A, Knoll A (2013) Gradient boosting machines, a tutorial. Front Neurorobot 7.
- 70. Dipnall J, Pasco J, Berk M, Williams L, Dodd S, et al. (2017) Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machine-learning methods (GLUMM). Eur Psychiatry 39: 40–50.