Complex genetic dependencies among growth and neurological phenotypes in healthy children: Towards deciphering developmental mechanisms

The genetic mechanisms of childhood development in its many facets remain largely undeciphered. In the population of healthy infants studied in the Growing Up in Singapore Towards Healthy Outcomes (GUSTO) program, we have identified a range of dependencies among the observed phenotypes of fetal and early childhood growth, neurological development, and a number of genetic variants. We have quantified these dependencies using our information theory-based methods. The genetic variants show dependencies with single phenotypes as well as pleiotropic effects on more than one phenotype and thereby point to a large number of brain-specific and brain-expressed gene candidates. These dependencies provide a basis for connecting a range of variants with a spectrum of phenotypes (pleiotropy) as well as with each other. A broad survey of known regulatory expression characteristics, and other function-related information from the literature for these sets of candidate genes allowed us to assemble an integrated body of evidence, including a partial regulatory network, that points towards the biological basis of these general dependencies. Notable among the implicated loci are RAB11FIP4 (next to NF1), MTMR7 and PLD5, all highly expressed in the brain; DNMT1 (DNA methyl transferase), highly expressed in the placenta; and PPP1R12B and DMD (dystrophin), known to be important growth and development genes. While we cannot specify and decipher the mechanisms responsible for the phenotypes in this study, a number of connections for further investigation of fetal and early childhood growth and neurological development are indicated. These results and this approach open the door to new explorations of early human development.

number of candidate genes that exhibit no significant pairwise dependence with a single phenotype, and would therefore be missed altogether by common genetic association methods. These can be characterized as fundamentally pleiotropic loci.
The specific purposes of this overall effort were rather different from most genetic studies. Rather than searching for a handful of highly significant causal genes (which is typical for a disease research) we focused on attempting to reveal biological determinants of growth and neurocognitive development in healthy children by finding multiple less significant genetic correlates, and to elucidate the specific dependencies among neurological development, physical development and SNPs of infants in the GUSTO study. We wished to identify candidate genomic regions, genes and/or regulatory interactors that may be involved in these developmental processes. Since synergy of biological effects is common, we sought to identify as many genetic signals as possible, including some that exhibit relatively low significance by themselves, in order to collect multiple pieces of evidence that might collectively point to a set of candidate genes or loci within the genome, and then to biological pathways or networks. The compilation of extensive regulatory and gene expression data on implicated genes allowed us to implicate a number of developmental processes. Notable were the large number of connections to brain-specific and brain-related expression and processes known to affect brain phenotypes. While the wide range of information that is integrated in this analysis suggests several intriguing conclusions, the outstanding limitation of this study is that, to our knowledge, there is no comparable data set that can be used for cross validation. Nevertheless, the resulting candidate dependencies identified by our method are indirectly validated using multiple public databases.
In the Results section, we present the outcomes of a consecutive set of analyses. We examined dependencies between longitudinal growth parameters of head circumference, and neurological development scores of two-year olds; next, we looked at the genetic dependencies of each of these phenotypes separately; and finally, determined the pleiotropic three-way dependencies among phenotypes and specific SNP's. Significant dependencies were found in each of these steps, and these sets of genetic variants collectively implicated some of the same processes. Nomenclature:

Relationships between fetal and early childhood growth and neurological development
To determine if there were effects on neurological development of fetal and early childhood growth profiles, we looked for dependencies among various data variables that represented aspects of these processes. We used our information theory methods, which assume no models (see Methods Section 4.3) [13][14][15], to examine dependencies between growth phenotypes and neurological measurements. Our initial attempts to detect dependencies between the raw growth measurement data points (head circumference) and neurological measurements (e.g., Bayley phenotypes) of infants at two years of age led to relatively poor statistics, probably caused by characteristics of the growth data including the variable times of growth measurement, the noise in these single point measurements and missing data, particularly in the fetal growth data sets. To resolve this problem, we fit the growth data of the entire population to a parameterized Gompertz-like model resulting in a population mean curve and we then estimated individual deviations and used the model parameters estimates for each subject as the growth phenotype variables to examine for dependency together with the neurological data (see Methods section 4.2.3 for details and fit statistics). The growth model involves three parameters that describe the final growth limit, a growth rate parameter and the non-linearity of growth deceleration. These three growth curve parameters were analyzed for two-and three-way dependencies with the neurological phenotypes, including Bayley [16], Infant Toddler Social Emotional Assessment (ITSEA) and Child Behavior Check List (CBCL). The pairwise dependence measures and the three-way dependence measure (see Methods section 4.3) were calculated for 1073 subjects and permutation tests were performed to generate p-values (described in Methods section 4.4). We found a number of effects in both the twovariable and three-variable cases. As shown in Table 1, the strongest effect (lowest p-value) for two-way dependencies was clearly between the composite Cognitive Bayley phenotype and any growth parameter. The correlations indicated between phenotypes suggest that there is a relationship and, hence, the possibility of common causes. Furthermore, such correlations suggest that stringent corrections for multiple tests may not be appropriate (see Methods section 4.4.1). We found later that there were strong pleiotropic genetic effects for the Social-Emotional composite scale, the Adaptive scale, and the limiting pre-natal head circumference. A number of other phenotype dependencies were observed, and overall there was a clear relationship between robust growth and the Bayley phenotypes at age two. In our view it is best to consider these dependencies not as a collection of pairwise effects, but as a network of interdependencies implicating relationships among growth and Bayley phenotypes. We will address the network properties further when we consider regulatory effects implicated by our analysis. These results support the idea that there are significant dependencies between fetal and early childhood growth and neurological development that should be investigated further and suggest a strong biological connection between early growth and the development of the brain. This suggests that in order to explore the biological sources of the dependence, genetic effects on both of these phenotype classes should be examined.

Pairwise genetic relationships with neurological and growth variables
To explore the genetic relationships, we first examined the mutual information scores (see Methods section 4.3) between the 495,719 SNPs (a subset of 557,070 SNPs after preprocessing) and the five composite Bayley phenotypes for 433 subjects. The details of the acquisition of the Bayley phenotypes are provided in [16]. The subjects and the SNPs were those without any missing data values (see Methods section 4.2.4). The pairwise analysis shown in Table 2 reports permutation-based p-values as described in the Methods section 4.4.1 (only genetic effects with p-values better than 2.7x10 -6 are shown). The collective conclusion derived from the number and nature of the implicated genes is that there are significant genetic influences on the neurodevelopmental phenotypes. We flagged these loci for further analysis (Fig 1). The genetic effects of one of these variants (NELL1) is shown in Fig 8A. The possible confounding effects of the different ethnicity were calculated as well. While a few SNPs have significant confounding effects, most do not. These results and calculations are discussed in Methods section 4.4.2. Although in studies aimed at identifying causative SNPs, as is typical in GWAS, the p-value cutoff for significant SNPs is typically 5x10 -8 , driven largely by considerations of multiple hypothesis testing. The majority of our SNPs fall short of this cutoff and only two are better. This cutoff, however, has been shown to be very stringent, not taking into account correlations between variables (which we have in abundance among both SNPs and phenotypes), and is specifically meant to assure significance for causal SNPs [17][18][19][20][21][22]. In this paper we argued against applying this cutoff, or performing other common corrections for multiple hypothesis testing, since our goal is not to search for causal SNPs, but to detect a set of biologically relevant SNPs that may be statistically weaker on their own, but together can implicate pathways and processes of growth and neurological development. We therefore decided to use a higher p-value cutoff to allow for SNPs with weaker signals in this population to be collected for our downstream knowledge-based analysis. Since Bayley and Growth phenotypes are of different type (categorical vs numerical), we used two different cutoffs for selecting associated SNPs. We used 2.7x10 -6 as a p-value cutoff for SNPs associated with Bayley phenotypes and 8x10 -6 as a cutoff for growth associated SNPs (see Methods section). Tables 2 and 3 show the corresponding Bayley and Growth associated SNPs. The detailed descriptions of these considerations and methods are found in the Methods section.
Similarly, we examined the mutual information scores between the 448,658 SNPs and the three growth parameters for 1053 subjects. The subjects and the SNPs were those without any missing data values (see Methods section 4.2.4). Table 3 shows the pairwise genetic effects with permutation-based p-values better than 8x10 -6 .
These loci were associated with the three growth parameters, linf, lambda and alpha, considered as phenotypes (Table 3, Fig 2). Two loci showed notably strong effects (rs12734338 near PPP1R12B gene, rs6672510 near PLD5 gene). The former is a protein phosphatase subunit, which is implicated as the most significant celiac disease risk locus outside of the HLA region. This intronic SNP, rs12734338, was reported specifically for the Celiac risk effect [23]. The SNP rs9691259, with the highest score for alpha dependence, is notable since it is located near genes IGFBP3 and IGFBP1. Gene IGFBP3 produces insulin-like growth factor binding protein 3 directly involved in growth pathways, affecting growth factor stabilities, and also released by astrocytes in the brain [24]. Furthermore, rs9691259 is located between the 5' end of IGFBP3 and the closest known enhancer (at coordinate 46,515,654 of genome build hg19). Thus, a regulatory effect is a reasonable conjecture for this genetic association.  Table 2). SNPs with p-value<2.7x10 -6 (red line) are highlighted and labeled. https://doi.org/10.1371/journal.pone.0242684.g001

Pleiotropic effects: Genetic locus dependence with pairs of neurological and growth variables
We used the three-way dependence method to discover pleiotropic genetic variants that were simultaneously interdependent with two phenotype variables, one each from the neurological and growth phenotype sets. The genetic variants in these three-way dependencies are not discovered by any pairwise dependence (see Methods section 4.3). For the three-way dependency calculation, we used 495,719 SNPs (without missing data), five composite Bayley phenotypes (Adaptive, Cognitive, Social-emotional, Motor and Language), and three growth model phenotypes (linf, lambda, and alpha) measured for 428 of 1073 subjects (without any missing values). This calculation identified 53 SNPs with candidate dependency for both neurological and growth phenotypes (Table 4, Fig 3). The locus with the most significant dependency is the SNP in the RAB11FIP4 gene, a highly brain specific gene (p-value of 2x10 -8 ). This locus is associated with growth phenotype lambda and Bayley phenotype Adaptive (Table 4 and Fig 3) and is contiguous to the NF1 gene (neurofibromatosis), and therefore implicated in growth in the neural system. This variant is in an intron. The next most significant locus is within the DNMT1 gene (DNA methyl transferase 1, with a p-value of 6.5x10 -8 ). This is a synonymous variant in an exon. Finally, the next locus is intronic to LHFPL2 and near ARSB (p-value 1.1x10 -7 ), both brain-expressed genes. LHFPL2 has been reported to affect Parkinson's and Alzheimer's risk [25,26].
Because our three-way analysis uses the symmetric measure Delta3, which is the product of three factors, the asymmetric Deltas corresponding to each variable (see Methods section 4.3), it is not possible to determine which variable dependencies dominate. In order to capture dependencies that have only one or two large factors that might not be seen by the symmetric Delta3, we also examined each individual factor. These measures are specific for each variable (Δ 1 for growth phenotypes, Δ 2 for Bayley phenotypes, and Δ 3 for SNPs). For the asymmetric Delta analysis the same five Bayley phenotypes, three growth model parameters, SNPs, and subjects were used. Each asymmetric Delta, Δ 1 , Δ 2 , and Δ 3 identified 77 SNPs (S1 Table), 106 SNPs (S2 Table), and 117 SNPs (S3 Table), respectively, but with higher p-values. These analyses uncovered only a couple of additional loci (TRANK1, for example) suggesting that most of the collective dependencies detected by the symmetric delta are relatively balanced in their phenotype pleiotropies.

X and Y SNPs
If analyzed together with other SNPs, the X and Y SNPs overwhelm the statistical signal due to the genotype patterns distinguished by sex. As a result, the dependencies for the X and Y chromosomes were assessed separately and the subjects separated by sex. The results for all combinations of phenotypes and gender are listed in Table 5. The subjects for each analysis, and the preprocessing for this analysis are shown in S5 Table. The X-linked DMD gene (dystrophin) is  a notable locus with 2 SNPs implicated by the Delta3 score for females. Note that the dystrophin gene has been previously reported to affect brain development [27]. Note that the number of subjects here was considerably smaller, after separating them by gender and removing subjects with missing data, resulting in a substantial loss of statistical power. With p-values less than 1.5x10 -5 some of these are not very significant, but we include them here as candidates for potentially important pathways. The two loci with the lowest p-values lie in the pseudo-autosomal regions Par1 and Par2 (at the ends of the X chromosome) respectively. The first is a gene of unknown function, however it is located over 40kbps from the implicated SNP. The second locus, the sprouty gene locus (SPRY3), implicated in males, is a gene reported to be involved in placental development [28]. Note that the two phenotypes in the dependency with SPRY3 are linf, related to head circumference, and Adaptive, a composite Bayley phenotype. The best 10 SNPs with respect to p-values in each phenotype category are listed in S4 Table.

Linkage disequilibrium SNPs
Recall that a large number of the original 933,886 SNPs with high mutual information between each other were removed to reduce redundancy before conducting the dependency analysis. To find additional SNPs potentially implicating other candidate genes, we searched for all possible SNPs in Linkage Disequilibrium (LD) with 82 SNPs previously identified using two-way and three-way dependency analysis (the SNPs in Tables 2-4). The LD was calculated for the   Table), they provided no new information about other potential genes that might affect the phenotypes (they were either within the same gene intron/exon or in the same intergenic region). Hence, the disequilibrium, while strong in many cases, did not add to our list of potential biological influences.

Gene interaction
The genetic dependencies reported in the previous sections are pairwise associations or pleiotropic effect variants. We expect that there may also be interactions involving multiple variants that contribute to the overall dependencies. Since the three-way measure can assess two variant effects on a phenotype, we calculated the interaction between each locus already implicated above and all other variants. For this calculation, 39 single locus effect SNPs that have been noted in neurological development or growth (see Tables 8 and 9) were combined with 495,718 other SNPs for each phenotype. The p-values for these measures were calculated using the same permutation methods as for the single locus effects. This resulted in the detection of interactions between loci detected in pairwise dependencies and loci not seen with any significant other dependence, as presented in Table 6.
There are a number of notable interacting pairs here, for example, the variant at the sphingosine-1-phosphate receptor 2 gene shows significant interaction with DNMT1 and both of these genes are strongly expressed in the placenta. The locus at HTR7P1 shows interactions with a diverse range of other loci, on eight different chromosomes. It is clearly an interaction hub of some kind. The significance of multiple interactions, including the RAB11FIP4 and PAK6 loci is currently unclear, but intriguing.

Variant annotation.
To investigate the potential biological interactions implicated by the genetic dependencies, we integrated the candidate sets of SNPs identified by two-way and three-way analysis (from Tables 2-4). The list contains 230 unique SNPs after removing 3 SNPs lacking mapping information. We re-annotated the candidate variants based on their location in the genome using Variant Effect Predictor (VEP, https://www.ensembl.org/vep). Functional annotation of these 230 SNPs showed that majority were non-coding and located either in the intergenic or intronic regions of the genome (Fig 4A). There were two neutral coding SNPs and one missense SNP (rs2064317) located in the coding region of TULP1 gene.

Regulatory functional analysis.
The majority of the SNPs identified in GWAS studies to date are located in the noncoding regions of the genome, and even though they may have been implicated simply because of their linkage disequilibrium with a causative SNP, they are equally likely to point to regulatory elements [29,30]. As most of the SNPs we identified are also located in non-coding regions of the genome, it is likely that there are some regulatory effects. To carry out this analysis we used RegulomeDB [31] (http://www.regulomedb.org/). Of the 230 SNPs, 148 SNPs were scored as having potential regulatory effects (Fig 4B and S6  Table) with two known eQTL SNPs, rs2164807 in the regulatory region of ATOH8 gene, and rs525410 in the intronic region of LAMC2 gene.
Furthermore, 17 SNPs were scored by RegulomeDB as having strong regulatory functions, indicated by the top 5 categories, namely 1b, 1f, 2a, 2b, and 3a (see Fig 4B for description). These SNPs together with their 69 transcription factor (TF) co-regulators (see S8 Table) are part of a regulatory network governing child's growth and neurocognitive development. Regulatory genetic networks underlying a phenotype arise from regulatory SNPs affecting the transcription factor recognition sequences. To reconstruct a transcription regulatory network, we connected the SNPs (annotated here by their nearest genes) with common regulatory interactors/TFs as intermediate components, allowing for a connected sub-network (genes without any connections, were excluded). The network of regulatory interactors ( Fig 5) connected 13 key regulators (SNPs in regulatory regions of ATOH8, CTCFL, LINC01639, CD99, PFKB3, Gene 1 here indicates a locus that has been noted for several reasons: expression profiles, brain or growth specific known effects, or low p-values in a single locus effect (see Tables 8 and 9). The coordinates are for genome build hg19. The p-values are calculated for the three-way interaction measure (Delta3 for two SNPs, one of which is a and the single locus effect SNP). https://doi.org/10.1371/journal.pone.0242684.t006 LAMC2, PPEF1, RIMBP2, LOC101928738, CXorf36, PAK6, ASMTL-AS1, DHRSX) with 38 TF interactors. Of the 13 genes with regulatory SNPs in the network (Fig 5), PAK6 seems to be the central node in the network. PAK6 identified through our three-way dependence analysis belongs to a group of p21-stimulated serine/threonine kinases, and is a key regulator of signal transduction pathways, cellular division regulation, gene transcription, cytoskeleton rearrangement and apoptosis. PAK6 protein expression profile points to highest expression in tissues such as skin, placenta, testis and cerebral cortex [32] (https://www.proteinatlas.org), while the RNA expression profile distinctly points to brain tissue specific expression including cerebral cortex and caudate (https://www.proteinatlas.org, http://gtexportal.org). In a study carried out by Nekrasova et al. [33], it was shown that PAK6 is highly expressed in the brain and PAK5/PAK6 double knockout mice exhibit several locomotor and behavioral deficits. Nekrasova and colleagues concluded that normal expression of these two proteins are required for normal level of activity, and for normal learning and memory, which suggests an important role of PAK6 in neurological and growth development.

Gene expression profiling.
In addition to reconstructing a transcription regulatory network using RegulomeDB, as shown in Fig 5, we analyzed tissue specific gene expression of genes and/or eQTLs associated with our integrated set of SNPs, using the Genotype-Tissue Expression (GTEx) database (http://gtexportal.org). Using GTEx we identified 56 SNPs, out of which 2 were previously detected by RegulomeDB (S6 Table). Moreover, several of these SNPs were also shown to have a modest effect on the expression of their associated genes in tissues such as skeletal muscle, tibial nerve and several brain tissues/regions (see Table 7).

Application of functional mapping and annotation of genome-wide association studies (FUMA).
We combined functional annotation and gene mapping results using We selected two particularly interesting lead risk loci, rs178850 and rs6672510, to analyze using FUMA. The first SNP, rs178850, has the best p-value (2x10 -8 ) of those identified in a three-way dependency with Adaptive (Bayley phenotype) and lambda (growth phenotype). Clustering and visualization of the network was carried out using Cytoscape v.3.3.0 (undirected network and betweenness centrality statistics). The degree of nodes (the number of edges per node) is shown with their color, ranging from orange (the highest degree), to yellow, green, and then blue (the lowest degree). In addition, larger nodes correspond to hubs with higher degree. The edges with high betweenness centrality, whose removal would partition the network into connected subnetworks, are depicted by thick, orange lines. The small blue nodes are additional factors connected to the dependent loci. "Orphan genes" (unconnected nodes) are not shown. Nodes with blue, green, and red rings correspond to loci detected with two-way and three-way analysis (see the legend). https://doi.org/10.1371/journal.pone.0242684.g005

PLOS ONE
This SNP is particularly interesting as it is located in the intronic region of RAB11FIP4 gene on chromosome 17 and is also very close to NF1 and OMG, which is another brain-specific gene located within NF1. Thus this SNP could affect three brain genes. See S9 Fig for the details of the one-SNP three genes structure. All three of these genes are highly expressed in brain. GTEx data from 53 tissue types shows that RAB11FIP4 has the highest expression in all 13 brain tissues in normal samples, which highlights its potential role in neurological development and growth, and its neighbor, NF1, is expressed almost exclusively in the brain and nervous system.
The second interesting SNP, rs6672510, was identified in two-way dependency analysis with lambda (growth rate parameter phenotype) with a p-value of 3.12x10 -8 . This SNP is located in the intronic region of PLD5 gene on chromosome 1. Like in the previous example, it is also shown to be highly expressed in brain, though its associated phenotype is growth.
FUMA generated circular plots (Fig 6), indicating positions of chromatin interactions and eQTLs of the two lead SNPs (see Methods section 4.5.5). In the case of RAB11FIP4, 21 genes were linked to the risk locus, three via eQTL mapping and 18 via chromatin interactions ( Fig  6A). In the case of PLD5, 20 genes were linked through chromatin interactions and one through eQTL mapping ( Fig 6B). While not all the genes identified here are relevant to neurological and growth development, they can serve to identify additional genes and regions that are not indicated by proximity to the genetic variants and could be used in future experimental studies. Specifically, four genes (UTP6, CTC-542B22.2, COPRS, RP11-848P1.5) and three eQTLs (MIR4724, CTD-2349P21.9, RHBDL3) linked to the lead SNP in the RAB11FIP4 gene have been shown to be highly expressed in several brain tissues (https://gtexportal.org). Similarly, five genes (AL590483.1, ZBTB18, EXO1, CHML, KMO) linked to the lead SNP in PLD5 have been shown to have high expression levels in several regions of the brain. The presence of chromatin interactors with expression profiles in brain tissues similar to RAB11FIP4 and PLD5, both of which are highly brain-expressed genes, is therefore highly suggestive of their roles in a regulatory network of neurological development.
For more information obtained from FUMA and GTEx applied to our candidate set of genes see S7 Table. The analysis of our set of candidate genes showed differential expression in frontal cortex, hypothalamus, caudate, nucleus accumbens and putamen, all known for affecting cognitive and motor functions.

Estimated effects of interacting SNPs.
Since the visualization of two-variable dependence with genetic variants is straightforward it is interesting to examine the distribution of the phenotypes in the population of children. Here we show three diverse examples of The normalized effect size (NES) defined as the slope of the linear regression of the effect of the alternative (minor) allele relative to the reference (major) allele, based on hg19 reference genome [34].
https://doi.org/10.1371/journal.pone.0242684.t007 results to illustrate the way in which the genetic variants in this cohort affect the phenotypes. It is clear that the distributions of these quantitative phenotypes are distinctly different. This effect is seen both for the spectrum of Bayley phenotypes in 24-month infants and in growth parameters. We illustrate the result for three phenotypes as a function of the variants labeled by their closest genes NELL1 (composite Bayley phenotype, Motor), PPP1R12B (limit head size, linf), and PLD5 (growth rate, lambda). These genotype-specific profiles are interesting in several ways: the NELL1 stratification suggests that the effect on the Bayley phenotype of the minor allele is recessive to the major allele. For the PPP1R12B profile, the effect of a single minor allele seems to sharply affect the head size distribution. The fact that there are no observed homozygous minor genotypes at all at this position raises the question of whether the shift to smaller head size of the heterozygote may be highly detrimental in the homozygous state. In the third example of PLD5, the major allele appears to be partially recessive to the minor allele, which reduces the average growth rate.
Another, more quantitative, way to compare the distributions for specific SNP genotypes is to use a Chi-square or Kolmogorov-Smirnov (K-S) test, which provides a useful way to visualize pleiotropic dependencies. To illustrate its use, we show K-S tests for pairwise dependencies for NELL1, PLD5 and MTMR7 (see Fig 7). Here the K-S score indicates the p-value of testing the hypothesis that the distributions are the same.
Using the same measure, we can now visualize the pleiotropy by looking at the similarities and differences between phenotype distributions for different values of the second phenotype, as shown in Fig 9 for the pleiotropic variant in the PAK6 gene.
Clearly profiles of the growth (linf, the head circumference at birth) phenotype distribution for subjects with high and average values of Bayley phenotypes (Language score at 24 months) are similar, but the profile of the growth parameter for low values of Bayley phenotypes is distinctly different. While it is often difficult to visualize the complex three-variable dependencies inherent in pleiotropic genetic effects, these measures of similarity seem to provide useful profiles.

Discussion
The finding of significant dependencies among the variables characterizing fetal and early childhood growth and those characterizing neurological development in the GUSTO project data led us to explore the genetic dependencies of these variables. The overall goal of this effort was to gain insight into the underlying biological mechanisms in healthy children and to implicate processes and pathways involved. In order to mobilize the results of the genetic analysis of this large data set into possible insights that point to mechanistic pathways and networks involved in these critical processes we analyzed and integrated the results in several ways. Features of a collection of the genes linked to 22 SNPs that we have tied to the neurological traits, having p-values < 5x10 -6 and either: high brain levels of expression or specificity of expression, or published phenotypic effects related to neurological functions in human studies. The notation (M) in the N column indicates that the dependency was determined for male subjects only. TRANK1, marked with an asterisk, falls just below the p-value threshold so does not appear in Table 4. https://doi.org/10.1371/journal.pone.0242684.t008 We used our three-way dependence measure here to identify complex relationships, in this case pleiotropies, for the first time in human data. Considering that the cohort was not selected for any traits, and appeared to be normal, healthy children, the results were striking. First, we found several genetic dependencies of neurological development as indicated by the five different Composite Bayley scale scores at two years of age. Second, genetic dependencies of the fetal and early childhood growth parameters were also identified using parameters fit to growth data as phenotypes. The set of candidate genes identified using the pairwise measure (mutual information), with potential functions known to affect growth and brain development and function, included some intriguing candidates and were encouraging. We then looked for genetic effects on two phenotypes together, pleiotropic effects, using the three-way measure from our multivariable dependency method, and found another set of interesting candidates. Our information-based dependency measures confer the advantage of reduced sensitivity to undersampling relative to a model fitting approach, so that the number of subjects and the potential complexity of the dependencies in this work yield results that permutation tests suggest are significant.  8). The scores, indicating the similarity between the distributions, show the dominance of the major allele for NELL1 and the dominance of the minor allele for PLD5, and MTMR7 (not shown in Fig 8). https://doi.org/10.1371/journal.pone.0242684.g007 The largely disjoint sets of SNPs in the three classes (affecting growth parameters, Bayley phenotypes, and both together) is perhaps surprising, since one might expect that a SNP affecting both neurological development and early growth should have a significant presence for two-way dependence for each class of phenotypes. As we have discussed in previous work on multiple dependencies [15], this is not necessarily the case. To further explore this disjoint effect, we looked at the two-way dependencies for each of the SNPs identified in the three-way analysis and confirmed that there were no significant two-way dependencies. The initial lists of results tell the full story. While the interpretation of this observation is unclear, it seems to indicate that the source of the three-way effects is largely distinct from the two-way effects.
The use of our three-variable dependency measure has been shown to yield a number of interesting results that could not be detected using only two-way methods [15], which has significant implications for the way in which human phenotype data are analyzed. Finding threeway effects that are distinct from any two-way effects represents a sharp shift of approach and should be considered in future studies.
Keep in mind here that our insights are based on attempted interpretation of the effects of SNPs that fall largely in intergenic regions and introns. While this means that we are attempting to implicate some genes by their proximity to the SNPs, we also have used regulatory data analysis, and gene expression profiles to attempt to pull pieces of the puzzle together.
The fetal and early childhood growth parameters and the neurological development show a pattern of dependency on one another, and the genetic effects on both classes of phenotypes that we see are striking. It is yet unclear what the most important biological pathways involved in these effects are, but it is intriguing that the patterns are rather consistent in the prevalence  https://doi.org/10.1371/journal.pone.0242684.g008 of brain-specific or brain-related genes. It is not surprising that SNPs near genes that are expressed in the brain and CNS are implicated in the neurological development, but this pattern is also present in the three-way dependencies with growth and neurological development. It is clear, perhaps not surprisingly, that the overall growth of the head circumference and the development of the infant brain are strongly coupled. It is probably worth further investigation to also determine the extent to which the growth of the early brain may be involved in regulating the overall growth of the fetus.
To explore the biological relevance of the candidate SNPs identified using our two-way and three-way dependency measures, we compiled the set of 230 variants, including LD SNPs and those located on the X and Y chromosomes. Functional annotation of the integrated set of SNPs showed that majority are in intronic and intergenic regions, so we examined the potential regulatory functions of this set using RegulomeDB and identified two eQTL SNPs, selected using three-way dependency analysis between neurological and growth phenotypes (S6 Table). The eQTL SNPs rs2164807 (p-value of 1.39E-06; identified through dependency between <Adaptive, linf, SNP>) and rs525410 (p-value of 4.56E-07; identified through <Social-Emotional, lambda, SNP>) are located in the regulatory regions of ATOH8, a transcription factor involved in nervous system development (GO:0007399) [42] and LAMC2 implicated in neurite outgrowth among other functions [43], respectively.
The many genes identified in this work represent the multiple pieces of evidence that can point to processes and pathways. While this integration is clearly at its outset, we can illustrate something of its value by a comparison of the p-values linking SNPs to Bayley phenotypes (either two-or three-way dependencies), relative levels of expression in the brain, placenta and other relevant information, and the attribution of effects of variants in genes on human phenotypes as recorded in the literature. To illustrate this kind of integration, we compiled two tables of relevant SNP variants that could be linked to neurological development. While it is somewhat artificial to separate growth and brain development in the presence of so many pleiotropic effects, we do so for simplicity. They should be considered together. Table 8 identifies the SNPs, the genes nearby or containing the SNPs, the expression levels and effects linked to brain and neurological development or linked to relevant human traits reported in the literature.
Similarly, we compiled a table of relevant SNP variants we identified that could be linked to growth in a broader sense and therefore could be directly relevant to fetal and early childhood development. Table 9 identifies 17 of these SNPs, the genes nearby or containing the SNPs, the expression levels and links to the literature.
The X-linked gene, DMD, did not have a low enough p-value to be included in the above tables (1.5x10 -5 ), but it is particularly relevant to brain and fetal and early childhood development, and should be kept in mind as a possible player in some cases. This is the gene mutated in Duchenne muscular dystrophy. The dystrophin protein provides a key part of an actinbinding, multifunctional unit, a complex that provides a key component of an astrocyte "foot" that engages neurons in the developing brain [20]. There is now clear evidence of developmental disturbances that result in neuropsychiatric abnormalities in children, particularly males with mutations in DMD [73]. Dystrophin is also widely expressed, and therefore likely is engaged in more functions than only in brain and muscle as part of the dystrophin associated complex. We should therefore consider the link to the DMD gene in this study as a pointer for future investigation.
We have explored the integration of the identified set of predicted regulatory SNPs (annotated by their nearby genes) in another way by constructing a regulatory network to find key genes and/or transcription factors potentially involved in neurological and growth development and evaluated their expression profile in normal tissues using GTEx database [34] (https://gtexportal.org). The transcription factor regulatory network constructed by Regulo-meDB (Results section, Fig 5) points to key genes, most of which were identified through our three-way dependency measure. Examples include PAK6, which is a gene central to signal transduction and cellular regulation. PAK6 is involved in several cellular processes, such as cytoskeletal dynamics, cell motility, gene transcription, and death and survival signaling, and is highly expressed in several brain-tissues (https://gtexportal.org). Another notable example is MPPED1, proposed as the most abundant transcript in the brain [74], particularly in frontal cortex and cerebral cortex, based on GTEx, HPA (https://www.proteinatlas.org), and FAN-TOM5 [75].
While exploring the tissue-specific gene expression and regulation database (GTEx), we identified additional 53 eQTL SNPs, most of which indicated expression in several tissues of the brain, muscle and nerves (S7 Table). To capture additional functional information, we used FUMA analysis, as described in the methods and the result sections (Fig 6). Two loci were probed for their chromatin effects: rs178850 (p-value of 2.02x10 -8 , in the intronic region of RAB11FIP4 gene, identified by three-way dependency: <Adaptive, lambda, SNP>); and rs6672510 in the intronic region of PLD5 gene (4.46x10 -8 ; identified by two-way dependency; <lambda, SNP>) (Fig 6). Both variants are indicated as having intra-chromosome interactions using the chromatin interaction mapping data.
It is interesting that while RAB11FIP4 gene expression is not exclusive to the brain, its expression in brain tissues is higher than in all other tissues reported by GTEx database. RAB11FIP4 has the highest expression in cortex and frontal cortex. Note also that the neighboring gene NF1 (containing the oligodendrocyte myelin glycoprotein gene, OMG) is well known to affect neural growth, and is highly expressed in brain and thyroid. As illustrated in Fig 6A, eQTL mapping of RAB11FIP4 gene identified three genomic regions in chromosome 17, MIR4724, CTD-2349P21.9, and RHBDL3. The non-coding micro RNA MIR4724 is involved in post-transcriptional regulation of gene expression. The non-coding transcript CTD-2349P21.9 is highly expressed in several brain tissues-it has the highest expression in cerebellar hemisphere and cerebellum (responsible for coordination and voluntary movement) compared to all 53 tissue types reported in GTEx database. A similar pattern is observed for RHBDL3 gene. Expression of this gene is the highest in brain tissues, particularly in frontal cortex and cortex (responsible for cognition, memory, and language).
Additionally, four genes, UTP6, CTC-542B22.2, COPRS, RP11-848P1.5, linked via chromatin interaction to the RAB11FIP4 locus, were shown to have high expression profiles in all brain tissues (https://gtexportal.org). Taken together all these relationships point strongly to their important role in neurological and growth development in early stages of life.
High expression levels in several brain tissues in GTEx database are also observed for PLD5 gene, with the highest expression in cerebellar hemisphere second to aorta tissues followed by cerebellum. Additionally, chromatin interaction pointed to five genes-AL590483.1, ZBTB18, EXO1, CHML, KMO-with expression levels in several tissues of the brain. Two of these genes, AL590483.1 and ZBTB18, are also expressed highest in cerebellar hemisphere and cerebellum. Expression in cerebellar hemisphere and cerebellum points to the potential role of these genes in movement and activity, fully consistent with our finding of its pleiotropic effect on neurological and growth phenotypes.
While the wide range of information that is integrated here suggests several intriguing conclusions, primarily that the brain-specific, or fetal/placenta-specific character of most of the implicated genes points to brain development as central to growth and infant neurological development, the outstanding weakness of this study is that, to our knowledge, there is no comparable data set that can be used for cross validation. While the number of GUSTO subjects is substantial, it was not statistically sufficient, resulting in some of the candidate relations included in the collection of evidence to be on the border of significance when considered alone. The arguments in favor of collecting a large number of candidate relations, including those that are borderline significant, are substantial if any patterns can be ascertained. This integration of the evidence from our analysis and the knowledge from previous work has allowed us to consolidate such a body of evidence related to neurological development and fetal and early childhood growth in healthy infants that should provide the basis for many future investigations. The results of this study thus represent an initial effort to implement multi-variable genetic analyses to generate a collection of genetic results that can be marshalled to form specific biological hypotheses that need further examination. Further studies will need to provide some validation from independent data sets, as well as capturing existing biological evidence of developmental pathways involving the identified gene candidates and regulatory networks.

Description of key features of the GUSTO data.
The GUSTO study of Singapore is the one of the most comprehensive birth and parent-offspring longitudinal cohort studies. It focuses on phenotypic measurements, genetic and epigenetic observations and medical records with detailed study from gestation through the early years of the child's life [3,4].
The primary purpose of the GUSTO cohort study is to evaluate the role of developmental factors and influences, including genetic and environmental factors, that affect growth and health. The other objectives are to identify maternal effects on offspring and association with early lifestyles and nutrition that may influence growth and neurocognitive development.
The GUSTO study is an ongoing cohort study that began in 2009. The pregnant women aged 18 years and above were recruited when they attended their first trimester antenatal dating ultrasound scan clinic at Singapore's two major public maternity units, the National University Hospital (NUH) and the KK Women's and Children's Hospital (KKH) between June 2009 and September 2010. The mothers had to be Singapore citizens or permanent residents with Chinese, Malay or Indian ethnicity and homogeneous parental ethnic background, intending to deliver in NUH or KKH and to reside in Singapore for the following 5 years. Mothers receiving chemotherapy, psychotropic drugs or who had type 1 diabetes mellitus were excluded from the study. The women also agreed to donate birth tissues to the study at delivery, i.e., cord blood, cord, and placenta.
Ethics approval and consent. Written informed consent was obtained from all women who participated in the study. Approval for the study was granted by the ethics boards of both KKH and NUH in Singapore. These boards are the Centralized Institute Review Board and the Domain Specific Review Board, respectively.
The recruitment of the mothers for GUSTO cohort study was completed in September 2010. 1,163 pregnant women were recruited: 56% of parents were Chinese, 26% were Malay, and 18% were Indian. The women were on average 30 years old, ranging between 18 and 46 years.
Women recruited in the first trimester returned to the hospital at 19-21, 26-28 and 32-34 weeks of gestation for ultrasound scans to assess gestational age and growth. Detailed interviews were conducted in the clinic at the time of recruitment, and at about 26-28 weeks gestation. Birth tissues were obtained, and anthropometric measurements of the newborn were conducted within 24 hours of birth. During infancy, the babies were examined at home at 3 weeks, 3 months, and every 3 months thereafter until 15 months of age. The children were then evaluated at the clinic at 18, 24, and 36 months, and the Bayley scale scores used in this work were acquired at 24 months.

SNPs, Phenotype data, and growth parameter data used for the analysis-results elucidated in the GUSTO study.
The acquired genotype data consists of 933,886 SNPs from 1,073 infants and parents, as previously reported. The phenotype data consists of 10,378 features from 1,237 infants and parents, which include ethnicity, gender, anthropometric measurements, socioeconomic measurements, and neurological phenotypes such as Bayley scales of infant and toddler development, Brief Infant Sleep Questionnaire (BISQ), Child Behavior Check List (CBCL), and Infant Toddler Social Emotional Assessment (ITSEA). The phenotypes data, of particular note the Bayley scale data, were collected by professionals.
In this paper, we only focused on the infant data. Moreover, in this paper we only consider three types of information: genotype (SNPs), neurological, and growth data. The neurological data consists of the following subsets: where categorical features are qualitative variables that take on non-numerical values (words). All neurological data used in this paper was measured from 6 months to 48 months. Furthermore, when analyzing the genetic component of the infant development, we used only 5 aggregate or composite Bayley scales: Cognitive, Language, Motor, Adaptive Behavior, and Socialemotional, measured at 24 months. We simply refer here to Bayley phenotypes when directly referring to these 5 aggregate scales.
The growth data we used consists of three parameters of Gompertz-like growth model fits that describe fetal head circumference growth as a function of gestational age. The growth parameters were available for 1,053 infants (see Section 4.2.3).
We analyzed multiple pairwise and three-variable dependencies. For each type of dependency, we only used infants with values in all analyzed variables, and vice versa, we removed genetic variables (SNPs) with missing values (see Section 4.2.4). The following is the summary of the different types of dependencies we analyzed and the corresponding number of SNPs and samples used in each type:

Preprocessing of data
The flow of the data analysis using Delta measures is shown schematically in Fig 10. There are three principal stages of the analysis: Preprocessing, Delta Computation, and Statistical Evaluation, leading to genetic candidates. Note that by "gene candidates" we refer to nearby genes to the implicated SNPs. Although we use the closest gene to a SNP to indicate a locus, in each case we also examine the region of the genome to determine if there are other nearby genes of interest.
In the preprocessing stage, we generate input data files for the Delta software from the raw data of SNPs, neurological phenotypes, and growth parameters. The input data must be discrete, represented by positive integers, so all continuous data must be binned. To account for specific structures and properties of data subsets, SNPs, neurological phenotypes, and growth parameters were preprocessed individually.

Preprocessing SNP data.
The genotype data for the infants of the GUSTO project consists of 933,886 SNPs, obtained using the Illumina Omniexpress & exome array. SNPs with call rates < 95%, or minor allele frequency < 5%, or those that failed Hardy-Weinberg Equilibrium test were excluded from the analysis. Out of 1071 infant subjects, 2 infants with no genotype information were removed. To prepare the data for our analysis, we reduced the amount of redundancy and gaps in the SNP data. Fig 11 summarizes the SNP filtering steps.
Preprocessing of genotype data starts with the removal of constant SNPs that show no variation among all infants in the study (Step 1). At Step 2, the completely correlated SNPs, i.e., SNPs that are in complete linkage disequilibrium, are "collapsed" for the dependency analysis, keeping one SNP per correlated group of SNPs. Note that these correlated SNPs are omitted only at the stage of dependency detection, since they do not add any more information about dependency and put back into the analysis once the candidate dependencies have been detected. Preprocessing continues with Step 3 by removing SNPs with more than 25% of missing values. At Step 4, we compute mutual information for all pairs of SNPs and "collapse" SNPs with high (over 1.2) mutual information, keeping one representative of each mutual information cluster. At Step 5, we remove SNPs with extreme distribution of genotypes, which are the SNPs that show no variation in more than 95% of infants. In our analysis, because of the gender differences in the SNP variables and the potential differences between male and female growth rates, we decided to eliminate effects of gender; that is, we looked only for those effects that were common, and therefore Step 6 of the preprocessing removes the SNPs from X, Y chromosomes. Fig  11 shows the number of removed or collapsed SNPs at each step after all the preprocessing steps. There are 557,070 SNPs in 1071 infants selected for the dependency analysis.
We performed the dependency analysis separately for SNPs sets from X, Y chromosomes, including the pseudoautosomal regions, which were removed during preprocessing at Step 6 for male and female infants. The number of male and female subject as well as the number of SNPs used in each two-way and three-way dependency analysis are provided in S5 Table. Fig  12A and 12B shows the distributions of growth and Bayley phenotypes divided by male and female infants used in the dependency analysis. The differences in the distributions between males and females were observed in growth phenotypes linf and lambda. However, no

Preprocessing neurological data.
The original phenotype data contains 10,378 features, consisting of categorical (taking on a small set of word answers to questionnaires) and numerical phenotypes. The phenotype data includes various observations such as anthropometric measurements and questions about child environment at different time points for 1,237 individuals. We focused on neurodevelopmental data of each child, consisting of Bayley scales, BISQ, CBCL, and ITSEA. Preprocessing of categorical and numerical neurological phenotypes was done separately. In this study we only used the Bayley scale data.
During preprocessing, we removed categorical phenotypes with more than 10 different categories. Having variables with too many possible values strongly weakens the statistical power of dependency detection. To convert the categorical phenotypes into the input for the dependency analysis, their values were encoded using integers.
To preprocess the numerical phenotypes, we examined the distributions of values and it was commonly appropriate to discretize their values into four bins as follows: (-1,l], (l,μ], (μ, r], (r,1), where μ is the mean and l and r are the medians of the values below and above of the mean. Each bin was encoded with an integer. Similar to Step 5 of the SNP preprocessing (Fig  11), categorical and numerical phenotypes that have the same value in more than 95% of infants were removed.

Preprocessing growth data and the growth model description.
A non-linear mixed effects (NLME) Gompertz-like model was used to fit the growth data (head circumference) to obtain three parameters characterizing growth of each individual subject. These describe the final growth limit for head circumference (linf), nonlinearity of head growth deceleration at around 20 weeks gestation (lambda), and an early velocity of head circumference (alpha). The growth of each infant is represented by these three parameters. We further discretize these growth parameters using the same approach as in the case of numerical neurological phenotypes.
Model description. Subject head circumference growth trajectories from early pregnancy to early childhood (ended at 54 months) were characterized using the following NLME model This functional form was selected from among several candidate models based on the AIC and examination of residual plots. As in the standard three-parameter Gompertz model [75], fixed-effect (population-level) parameters are used to characterize the growth limit (L p ), ratio of lower to upper limit (α p ), and growth rate (β p ) for the entire population. To this model, two more fixed-effect parameters are added to characterize the rate (θ p ) and nonlinearity (λ p ) of the deceleration in growth rate that begins around 20 weeks gestation. Finally, to account for variation in growth trajectories between subjects, subject-level effects were inferred for each subject for these parameters: growth limit (L i ), ratio of lower to upper limit (α i ) and nonlinearity of growth deceleration (λ i ), (linf, alpha, lambda). The subject level random effects were assumed to be uncorrelated with each other or with the error term (ε ij ).
Parameter Estimation for growth model. The model was fit to head circumference data measured on subjects from four birth cohort studies included in the Bill and Melinda Gates Foundation knowledge integration database. The combined dataset included a total of N = 11,818 subjects, each contributing between 2 and 18 measurements of head circumference, measured between 8 and 290 weeks post-menstrual age [77]. Note that a 5-parameter growth curve was fit to the entire population and allowed for subject-specific deviations from the population mean curve in 3 parameters (linf, alpha, lambda) as shown in Table 10. After that the individual estimates for these 3 parameters across 1,191 subjects of GUSTO cohort study were used as the input growth data in our downstream dependency analysis. Calculations and fitting to the individual infant data were completed using the nlme package [78] in the R statistical software (https://CRAN.R-project.org/package=nlme [79]).

Missing data: Selecting optimal subsets by linear programming.
Once the data preprocessing is complete, we needed to face the problem of missing data among the variables. When performing the dependency detection using Delta software, missing data can cause significant fluctuations and decrease reliability of the results. We therefore selected subsets of data that reduce the number of missing values while keeping as many subjects and variables as possible. Many clustering and bi-clustering methods are suitable for this task, but we decided to use a simple linear programming optimization method.
Linear programming is a method for optimization of a linear objective function z, subject to linear inequality constraints allowing us to maximize the number of variables and subjects while minimizing the amount of missing values (assuming all missing values are encoded as -2) as follows: where n is the number of subjects, m is the number of variables, b ij is a binned value of a variable j in a subject i, M j is the number of missing data in a variable j, and τ is a threshold for missing values for each variable. Constraint (2) ensures that we remove a subject if each of the m variables has missing values. Note that constraint (2) assumes that a missing value is represented by -2 and all other values are non-negative, therefore, if a subject has at least one variable with non-missing (nonnegative) value (out of m variables), then the sum of binned values of all variables would be higher than -2m. However, if values of all m variables are missing for a subject, the sum of binned values is equal to -2m. Constraint (3) ensures the amount of missing data (the fraction M j /n) is below a certain level (τ) for each variable. In order to find the solution to our linear programming problem, we need to provide the value of τ. During the analysis of the pairwise dependencies between the neurological phenotypes and the growth parameters, we used τ = 25%. For the analysis of all other types of dependencies τ was set to 0.
Using our linear optimization method, we selected optimal subsets of variables and samples to be used in the dependency analysis, as described in Section 4.3.

Multi-variable dependency using information theory methods
Biological data is filled with various dependencies since it is obtained from complex systems with many interactions. Therefore, we need detection of multivariable dependencies of diverse kinds in order to effectively analyze biological data. We have recently introduced an information theory-based set of dependency measures and implemented the discovery of multivariable dependencies in a large set of variables capitalizing on a distinct advantage of separating the detection of the dependence from defining the nature of the dependence [13][14][15]. In general, information theory measures have several advantages: they are inherently model-free and non-parametric in nature, and they exhibit only modest sensitivity to undersampling [80]. We have described these methods in several papers previously [13][14][15] and will briefly summarize the methods here for up to three variables, which is the maximum number used in this paper.
Our information theory-based method iteratively searches through a set of variables (e.g., SNPs, growth parameters) and systematically detects strong dependencies with increasing degree, starting with the pairwise dependencies, then three-variable dependencies, and so on. In this paper, we limited our method to only pairwise and three-variable dependencies. To measure a general dependence between two variables, X and Y, we use mutual information I (X,Y), defined as IðX; YÞ ¼ HðXÞ þ HðYÞ À HðX; YÞ; ð4Þ where H(X) and H(Y) are single entropies of variables X and Y and H(X,Y) is their jointentropy.
To measure a general dependence between three variables, X, Y, and Z, we use symmetric delta � DðX; Y; ZÞ. Before providing the definition for the symmetric delta, we need to introduce interaction information, which was proposed long ago as a multivariable generalization of mutual information [80]. For three variables interaction information I(X,Y,Z) is defined as IðX; Y; ZÞ ¼ IðX; YÞ À IðX; YjZÞ: ð5Þ Given Eq 5, we define differential interaction information Δ as the difference between values of successive interaction information arising from adding variables: Here Δ X is called the asymmetric delta for the target variables X. In order to detect a fully cooperative dependence among the variable set, we want any single measure to be symmetric.
As a result, we define a more general measure � D, called the symmetric delta (or delta), by multiplying Δ with all possible choices of the target variable: The key property of the symmetric delta is that if any of the three variables is independent of the others, then the measure is zero. Note that although we focus on three-variable case here, this definition can be generalized to any number of variables.
Note also that the asymmetric deltas are related in a subtle way: where O is the multi-information, called total correlation when introduced by Watanabe [ The dependency can sometimes be usefully interpreted as a relation among variables described as logical functions such as AND, OR and XOR, albeit for more than a binary alphabet. The Delta can effectively detect an XOR type of function, for example, which has no pairwise dependence and therefore no mutual information between any pairs of its variables.
The Delta measures and the methods for optimally computing these measures across large data sets have been implemented in software we refer to here as the "Delta software".

Delta p-values by permutation test.
In order to estimate the significance of the dependency scores calculated by Delta software, we carried out a permutation test by generating randomly shuffled input files and examining the distributions of resulting scores. We used two criteria for analyzing the statistical significance: (1) its information score's p-value, and (2) whether the score is above or below the threshold calculated from the maximal random scores. The information scores obtained as described below were both those not adjusted for ethnic confounding and the adjusted values (see section 4.4.2) To obtain unadjusted p-values of dependency scores we follow the permutation strategy proposed by Churchill and Doerge [82]: we shuffle the input data, breaking the connections between variables, compute the dependency scores of all shuffled tuples, and count how many randomized scores are above the original score of interest. We repeat this procedure 1000 times tallying the number of scores above the score of interest. The p-value is then the fraction the exceeding randomized scores take in the total number of tuples times 1000. Note that when determining the statistical significance of pairwise dependencies, we permuted the values of the phenotype variables thus breaking the phenotype-SNP relationships in the data, while at the same time preserving all linkages between SNPs. Similarly, for three-variable dependencies, we independently permuted the values of Growth phenotypes and Bayley phenotypes, thus breaking the relationships not only between phenotype and genotype variables, but also between Growth and Bayley phenotypes. Note that these p-values are unadjusted, not accounting for multiple hypothesis testing.
Although we decided not to perform conventional multiple hypothesis testing, since the goal of our paper is not to search for causal SNPs, we acknowledge that due to the large number of SNPs in our analysis the number of false positive may be high. To shed some light on the amount of false-positives in our analysis, we followed the approach of Churchill and Doerge [82] and calculated Family-Wise Error Rate (FWER). To calculate FWER, we find an absolute maximum randomized score for each shuffle, construct a distribution of 1000 maximum randomized scores (since we performed 1000 shuffles), and find how many scores from this distribution are above the original score of interest.
FWER is the probability of getting at least one false positive result given a large number of comparisons we made (the number of SNPs in the analysis), so it is highly conservative and it is not a surprise that many of our top dependencies have a high value of FWER. Nevertheless, a number of our results showed low values of FWER. Furthermore, we used FWER to compare the confounding effect of ethnicity on our pairwise dependencies (see the next section).

Ethnicity of subjects and possible confounding effects.
Since the subjects selected for data analysis are of three ethnicities, and it is clear that there are allele frequency differences among the Chinse, Indian and Malay populations, it was necessary to examine the possibility of confounding effects of the ethnic differences on our genetic results. Because the number of subjects is rather small, we divided the subject population into two groups: (1) Chinese, and (2) Indian + Malay. The full population of infants with no missing Bayley phenotypes had 433 subjects, 258 of which were Chinese and 175 were Indian or Malay. We then used a binary variable ε i indicating ethnicity (Chinese or Indian-Malay) to examine the relevant conditional probabilities, and determined the confounding effects of the ethnicity on the mutual information measures. The probabilities sum to one: P(ε 1 ) + P(ε 2 ) = 1, For the set of SNPs, S, and the set of Bayley phenotypes, B, and The adjusted mutual information then is pðsÞpðbjsÞlogðpðbÞÞ Table 11 shows the values of adjusted mutual information for the dependencies from Tables  2 and 3.
The changes for most of the p-values are rather small, some almost non-existent. However, we note that there are several that improve modestly upon adjustment, and a couple, CERS6 and LRRTM3 for which the confounding effect is dramatic, and significantly reduces this dependency.

Variant annotation.
Functional annotation was carried out using Ensemble Variant effect predictor (VEP; https://www.ensembl.org/vep) for two-way and three-way dependency sets. VEP determines the effect of genomic variants on genes, their transcripts, and protein sequence, as well as regulatory regions. Additional identifiers for each variant generated by VEP includes information such as gene symbol, variant specificity (such as exonic, intronic, UTRs), splice site (donor/acceptor), transcription factor binding sites, synonymous codon changes and frameshift variants.

Regulatory analysis.
The majority of the identified variants in the two-way and threeway dependency sets are located in the non-coding regions of the genome including intronic, intergenic, upstream or downstream from genes and in 3' and 5' UTRs. We examined their potential effect on regulatory functions using RegulomeDB [31] (http://www.regulomedb.org). Regulo-meDB, an integrated database, includes all available ENCODE transcription factor ChIP-seq, histone ChIP-seq, FAIRE, and DNase I hypersensitive site data [83], transcription factor ChIP-seq data available from the NCBI Sequence Read Archive [84][85][86][87][88][89][90][91][92] as well as a large collection of eQTL We show the population wide mutual informations and corresponding p-values, with the adjusted mutual informations and p-values, followed by the FWER value. The three sets of SNP-phenotypes separated by thick lines are those for which: upper-the absolute change in p-values is less than 25% of the original; middle-the change in p-value is less than 100%; lower the change in p-value is greater than 100%. Note that there are some dependencies for which the p-values improve on adjustment for ethnicity confounding.

Genotype-Tissue Expression identification.
Since the phenotypes of interest in our study were neurological and growth related, we examined our set of prioritized gene candidates using the Genotype-Tissue Expression (GTEx v7; http://gtexportal.org) portal, a catalogue of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants. This allowed us to acquire additional information on healthy human gene expression patterns in multiple tissues [103]. The output includes the levels of gene expression across all tissue types as well as within tissues, some of which are of interest as they are involved in neurological and growth processes such as expression levels in skeletal muscles, tibial nerve and multiple tissues of brain. GTEx database (http://gtexportal.org) captures the estimated effect size of an eQTL allele on gene expression, which allows for identifying genes, whose expression is affected by genetic variation, providing information on variant's potential involvement in phenotype.

Regulatory interaction network analysis.
The NetworkAnalyzer package [104][105][106] available in Cytoscape v.3.3.0 (http://www.cytoscape.org) was used for clustering and visualization of both regulatory and direct protein-protein interaction network. No notable results from this are reported. NetworkAnalyzer allows for computing a set of graph parameters for undirected and directed networks. In particular, we used betweenness centrality for clustering and visualization of both regulatory and direct protein-protein interaction networks.

Functional mapping and annotation of genome-wide association studies (FUMA).
To further evaluate the candidate set of variants identified by our two-way and three-way dependency analysis, we used FUMA v1.3.3c [35] (http://fuma.ctglab.nl), which uses a set of statistically significant SNPs as an input and provides functional annotations. FUMA uses data from positional mapping, including eQTL mapping, and 3D chromatin interaction mapping (Hi-C for 14 human tissues including prefrontal cortex and hippocampus), to predict potential regulatory effects from chromatin states at the position of the SNP of interest, and MAGMA gene expression analysis [35], selecting 53 tissue types from GTEx [34]. We used default parameters during the analysis. Out of the set of resulting SNPs, we focused on two interesting loci located in RAB11FIP4 and PLD5 (see the results section and Fig 6).
Supporting information S1  Growth model parameters (linf, lamda and alpha) and Bayley neurological phenotypes for both pair-wise and three-variable dependency. SNPs that occur in more than one category (gender or phenotype) are identified with the same color. The "exm" SNPs that were excluded from the downstream analysis are highlighted in red. (XLSX) S5 Table. Preprocessing of SNPs in X and Y chromosomes. (XLSX) S6 Table. The RegulomeDB score for 148 candidates associated with either neurological, growth parameters or both using Mutual Information and Delta3 dependency measures respectively. Variants that returned "no data" are not included. (XLSX) S7 Table. Tissue specific gene expression evidences using GTEx and regulatory function scores from RegulomeDB for the candidate SNPs. The SNP rs178850 has the best p-value (2x10 -8 ) of those identified in a three-way dependency with Adaptive (Bayley phenotype) and lambda (growth phenotype). This SNP is located in the intronic region of RAB11FIP4 gene on chromosome 17 and is also very close to NF1 and OMG, which is located within NF1. All three of these genes affected by rs178850 are highly expressed in brain.