Network Modeling of Crohn’s Disease Incidence

Background Numerous genetic and environmental risk factors play a role in human complex genetic disorders (CGD). However, their complex interplay remains to be modelled and explained in terms of disease mechanisms. Methods and findings Crohn's Disease (CD) was modeled as a modular network of patho-physiological functions, each summarizing multiple gene-gene and gene-environment interactions. The disease resulted from one or few specific combinations of module functional states. Network aging dynamics was able to reproduce age-specific CD incidence curves as well as their variations over the past century in Western countries. Within the model, we translated the odds ratios (OR) associated to at-risk alleles in terms of disease propensities of the functional modules. Finally, the model was successfully applied to other CGD including ulcerative colitis, ankylosing spondylitis, multiple sclerosis and schizophrenia. Conclusion Modeling disease incidence may help to understand disease causative chains, to delineate the potential of personalized medicine, and to monitor epidemiological changes in CGD.


Introduction
Crohn's disease (CD) is a complex genetic disorder presumed to result from the interplay between susceptible genotypes and (still unknown) environmental risk factors in a given individual. Patients typically suffer from chronic diarrhea, abdominal pain and weight loss. CD seems to reflect a loss of immune tolerance of the host toward bacteria present in its digestive tract [1]. Several cell types present in the intestinal mucosa contribute to CD pathophysiology including epithelial cells, dendritic cells, lymphocytes, etc. As a whole, CD is characterized by an intestinal barrier dysfunction, an inflammation of the mucosa containing Th1/Th17 orientated T-cells and the development of fibrosis.
To date, genome-wide association studies (GWAS) have identified more than 140 CD susceptibility loci, which allowed the identification of biological pathways centrally involved in the disease [2]. The associated polymorphisms do not usually alter the peptidic chain of encoded proteins [3,4] but rather affect regulatory DNA sequences [5,6]. Most of the disease-associated polymorphisms exhibit odds ratios (OR) lower than 1.5 (ref 2). Search for common copy number variations through the genome reported limited significant associations [7]. Epistasis was also limited [8]. Finally, mutations with a strong phenotypic effect have been reported but in only a small number of CD patients with very early onset [9][10][11]. Thus, at the opposite of classic Mendelian disorders, all these findings support a diffuse causality in CD [12].
Diffuse causality is a well-known property of networks, here the biological network formed by CD susceptibility gene products [1,2,13]. It is now acknowledged that in many cases a network model is suitable to describe living systems. In such biological network models, physiological functions or molecular pathways are associated with network modules assumed to act independently [14,15]. The (patho-) physiological status of an organism may then be defined by the activity status of all the functional modules at a given time. We accordingly developed a network-based model of susceptibility to CD and derive from it an expression for the disease age-specific incidence rates and disease propensity of at-risk alleles.

A disease network model with functional modules
For a given disease, we assume that only a limited number (N) of functional modules are pertinent to the disease status. In the case of CD plausible candidates are, for example, Th1/Th17 orientation of lymphocytes, intracellular autophagy or bacterial sensing [1]. This set of modules forms a sub-network (referred here to as the "CD network") of the whole human biological network. The state of each module is either permissive for CD or protective against CD. The disease is assumed to occur when all of the modules involved in its pathology are in a permissive state (Fig 1).
The structure and activity of each module depend on environmental stimuli to which the organism is exposed. They are also influenced by constitutive structural and regulatory variations within the genes. Stochastic events may also contribute to the ontogenesis and activity of each module. As a whole, a functional state of a module must be seen as the first level of integration of the gene-environment interactions. A consequence of this model is that disease-associated risk alleles (respectively environmental risk factors) contribute more or less to the propensity of a particular module to become morbid, but they usually do not determine it entirely. The module structure and activity may thus vary from one individual to another, even among monozygotic twins. This feature agrees with the relatively low disease concordance rate among monozygotic twins in CD [16].
A delayed occurrence of the disease is the rule for many complex genetic disorders. As an example, CD usually occurs in young adults and appears exceptionally in the first years of life, indicating that -at least some-modules do not function in a disease-permissive state at birth. Many hypotheses may be invoked to explain this finding, including cumulative effects impacting the module function with time (e.g. immune response to enteric infections), exposure to environmental factors in adulthood only (e.g. cigarette smoking or alcohol consumption), the absence of specific modules in childhood (e.g. underdeveloped Peyer patches in the gut or absence of sexual hormones before puberty), etc. Whatever the causes, we thus assumed an ontogenetic period for the functional modules.
For simplification, we shall postulate below that each module is initially in a protective state but a more general model is presented in S1 File. Stabilization of the modules in a mature state occurs along the development of the organism. Whether the stabilized state is permissive or protective depends on both environmental exposures and structural or regulatory genetic variations (Fig 1). As CD is normally a life-long disease, the mature modules are assumed to stay in the adopted state for long periods. However, environmental and stochastic events may ultimately affect the functionality of the modules, with the possibility of subsequent conversion at a low rate. To model the evolution of the network activity after the ontogenesis of the modules, we adapted the model of organism longevity proposed by Gavrilov and Gavrilova and inspired from a general theory of system failure [17]. According to this model, death is a consequence of the aging of a network built with non-aging elements. By analogy, disease is viewed here as a consequence of the stochastic switch of a module from a protective state toward a disease- These elements may be proteins, DNA regulatory sequences, small RNAs, metabolites, etc. Links denote physical or biochemical interactions and the circle size is proportional to the connectivity of the corresponding functional element. Nodes contributing to specific functional modules are represented by different colors (here the disease network is composed of four modules: M1 to M4). As an example, Nod2 is a node of the innate immune response module. Grey elements connect the different modules. Elements of the global network that are not involved in CD-associated functions are not represented. (b-c) Due to genetic, environmental and stochastic events, each module is in a protective (Pr) or permissive (Pe) state. (b) Most of the possible combinations of the functional states of the N modules are associated with health (here a single healthy combination is depicted) while (c) only one (or few) results in the CD phenotype. The protective/ permissive states of each module are the result of many factors. Long-term environmental exposure may alter some modules (e.g cigarette smoking which may affect the intestinal permeability, module 1). Genetic mutations may also be deleterious for a given module (e.g. ATG16L1 mutations and autophagy, module 2). External factors may divert a functional element to alternative modules (e.g. the Yersinia effector YopJ affects NOD2 induced NF-kB activation in favor of interleukin-1b secretion, module 3). Finally stochastic events may also affect the structure and function of the modules with functional consequences (module 4).
permissive state (note that the inverse change could also be seen with a consequent disease cure but these very rare cases are neglected in the following developments).
The state of a module at a given age can thus be modelled by a continuous Markov process with 3 states, as diagrammed in Fig 2. Transition of a module to its mature state is assumed to follow an exponential process with parameter 1/τ i , while subsequent state conversion occurs at a constant rate 1/T i for the i th module, with T i far larger than τ i . The mature state may be permissive or protective, with probabilities F i and 1-F i respectively. The probability F i is referred to as the module disease propensity (MDP). On these bases, we derived the reliability R i (x) that the module M i is still in a protective state at age x and the probability that CD arises at a given age (S1 File).

Calculation of age-specific incidence rates
Under the assumption that each module is initially in a protective state, the probability that CD arises before age x can be written: This general model has 3N+1 parameters, where N is the number of modules. To reduce the number of parameters and avoid over-fitting of the data, we made a so-called "homogenization" or "mean-field approximation", whereby the values τ i, T i and F i are replaced in each module (i.e. for each i) by their respective geometric means τ , T and F. The probability that CD Module activity presented as a Continuous Markov Process. At birth each module is in a naive or immature state. Over time the modules stabilize into a state that can be protective or permissive for CD. The change of state is assumed to be an exponential process with rate parameter 1/τ i and it may be towards a permissive or a protective state with probabilities F i and (1−F i ) respectively. Two types of modules are considered (S1 File). In the upper panel the module is protective in its naive state. We also allow for failure of the protective state 2 to the permissive state 3 as an exponential process with rate parameter 1/T i , assumed to be far slower than the initial process. In the lower panel we show modules that are permissive in their naive state. We allow failure from the permissive state to the protective state with rate parameter 1/T i . The corresponding transition matrices for the Markov Processes are shown on the right of the figure Diamond: unstable or naive state, Square: stable or mature state, Yellow color: permissive state, Purple color: protective state.
doi:10.1371/journal.pone.0156138.g002 occurs before age x for this 4-parameter model is then written: The age-specific incidence rate of CD is, by definition, equal to which can be written as: This equation predicts an exponential increase followed by a peak and a slow decrease at advanced ages, as generally observed for age-specific CD incidence curves.

Impact of genetic polymorphisms
We also investigated how OR of the disease-associated alleles, measured in GWAS studies, are related to our model. In the above analysis, MDPs were defined as averages over the whole population, notwithstanding genetic polymorphisms. To go beyond this simple analysis, we considered a polymorphism α with two alleles, one protective, α P , and the other, α R , at-risk for CD. The frequency of the at-risk allele α R in the population (risk allele frequency, RAF) is denoted p α . For each module M i , we introduced the MDP F i (α R ) over the subpopulation carrying the allele α R , (i.e. either the homozygote (α R, α R ) or heterozygote (α R, α P ) genotypes). In addition we denoted F i (α P ) the MDP over the rest of the population, which genotype was (α P, α P ). We assumed that a given genetic locus α predominantly affects a single module M i(α) among the N modules of the CD network, and thus simply denoted F(α R ) the MDP of this module. For rare diseases like CD (for which the OR can be approximated by the relative risk), the OR of the at-risk variant at locus α can be expressed (S2 File) as a function of p α and F(α R ):

Results
Fitting the age-dependent incidence curves for CD Extensive fitting of the 4-parameter non-linear model to published data, using a quasi-Newton method to minimize squared residuals, gave excellent fits to the data, with the model explaining about 98% of the variance. Several sets of parameters well fitted the tested data sets with values of τ, N, F and T ranging respectively from 7y to 12y; 6 to 24; .38 to .79 and 1150y to 26300y. In all the tested data sets a model with 12 functional modules, with an expected mean time to stabilization τ of 8 years, gave among the best fits (Fig 3). This consistency was observed among sexes in a population-based registry from Northern France [18]. It was also observed in countries exhibiting very different disease prevalence rates (may be except for the oldest people in one dataset). In Sweden [19], where the disease is ancient and frequent, the values of F and T were slightly lower than those observed in France (F = 0.59 and T = 830 years) while in Korea [20], where the disease is rare and recent, the value of F was lower (0.53) with a ten-fold higher estimated value of T (12.300 years). Since the number of biological modules and their time of maturation are likely to be constant between populations, constancy between data sets was reassuring and we fixed these parameters (N = 12, τ = 8) in subsequent analyses. Of note, the best values of F were higher than 0.5 indicating that, on average, a module more often adopts a disease-permissive state once stabilized. Large values of T confirmed that the functional state of a module is a persistent life-long status in most people.
Of course, the model per se does not provide any information on the function of the module. However, in the report of the largest GWAS for CD, between 10 and 14 functional modules have been derived from genetic analyses 2 : inflammatory response, defence response to bacteria, IgG binding, innate immune response, T cell co-stimulation, B cell receptor signalling, cytokine-mediated signalling, interferon gamma-mediated signaling, T cell receptor complex, T cell activation and autophagy, ubiquitination and NF-kB, TGFβ signaling, and RORγt. It thus appears that the number of 12 modules proposed here is in good concordance with the literature.

Consequences of environmental changes
CD incidence increased significantly during the 20 th century in Western countries and many authors agree that this increase was caused by an environmental change associated with the modern occidental way of life [21]. Looking at data from Olmsted County Minnesota from 1950's to 1980's [22], we observed that the proposed model with N = 12 and τ = 8y remained valid in most cases, allowing to adjust T and F values for each decade (Fig 4). The obtained value of F increased from 0.51 to 0.63, consistently with the increased incidence of CD. T also increased from about 350y to 1000y suggesting that at the beginning of the outbreak, the occurrence of environmental risk factor(s) temporarily destabilized the functional modules toward the disease-permissive state with a subsequent transient decrease of T. Later, exposure of the whole population resulted in a re-stabilization of the modules with higher MDP values and again large T values. Age-specific CD incidence rates observed in several populations, compared with model predictions. Parameters τ, T, N and Φ were first estimated from several published age-specific incidence curves; the values of τ = 8 years and N = 12 were retained for the model and the estimated values of T and Φ updated. Fitted data sets in a) French male population-based register [18], b) Females from Northern France [18], c) Sweden [19] and d) Korean males [20]. Reported data are shown as red dots while the fitted theoretical curves are in blue. To further explore these findings, we investigated the impact of new environmental risk factor (s) on the age-specific incidence curves in our model (S3 and S4 Files). We assumed that an increasing proportion of the population was exposed to the new environmental risk factor(s) and computed the evolution of the modeled age-specific incidence curves for the decades around the time t 50 representing the moment where half of the population has been exposed to the risk factor(s). We used the parameters derived from the preceding analyses (τ = 8y, N = 12, F before = 0.51, F after = 0.63). Assuming a stable environment before and after the transition, T was set identical before and after the environmental changes (equal to the present value of 1240y). The parameter reflecting the duration of the transition did not notably affect the curves for a wide range of values (not shown). Under these conditions, the curves displayed an increasing incidence peak between the ages of 20y and 30y, which stabilized about 30y after t 50 (Fig 5A). However, the effect of the new environmental risk factor(s) was difficult to detect before t 50 . A second peak was observed forty years after t 50 in the oldest people, and disappeared after a few decades. This unexpected evolution of the age-specific incidence curves is in fact also observed in real long-term follow-up datasets [23].
Based on the computed age-specific incidence curves, we derived the annual incidence rates in the population from -20y to +80y around t 50 (Fig 5B). The delayed capacity to detect the impact of the environmental factor was confirmed: less than a quarter of the maximum annual incidence rate over time was observed at t 50 . The incidence increased until year 40 after t 50 with a small decrease thereafter. This secular trend was concordant with CD literature with often a global incidence increase during 3 or 4 decades followed by a small decrease [24].
Comparing the computed curves and the reported data on CD incidence during the 20 th century in Western countries, it was possible using our model to propose some dates corresponding to t 50 and then to speculate on putative risk factors. CD was reported in 1932 by Crohn and colleagues in New-York [25]. It initially developed in white, urban, middle-class people and then extended to the whole population. Population-based data with long-term follow-up suggest that a quarter of the maximum incidence was reached in the 40's in USA [22], in the 50's in Sweden [19], in the 60's in United Kingdom [26] and later in Southern Europe. At the same time, half of the population was equipped with a home refrigerator in these countries [27]. These observations further argue for the hypothesis of a role of refrigerated food in CD [28].

Impact of at-risk alleles
We also considered the impact of genetic variations on the fate of the modules of the CD network. The OR distribution corresponding to the dataset of 140 CD-associated risk alleles derived from GWAS displays a maximum value for OR%1.1 (ref 2). However, the lowest values of OR are unavoidably under-represented due to inherent limitations of GWAS statistical power. We thus corrected for this bias (S5 File) and obtained a plausible estimation of the exact distribution ν of significant ORs (Fig 6A). Then we established from Eq 5 an explicit relationship between this distribution ν, the RAF distribution ρ, and the distribution g of the variables F(α R ) over all loci (S6 File): According to this formula the propensities F(α R ) over all loci were very narrowly distributed in the vicinity of F (Fig 6B). Hence, the huge majority of all at-risk alleles are associated with Evolution of the age-specific incidence curves following an environmental change. a) Using the transition model described in Fig A in S4 File, age-specific incidence curves were computed for several decades before and after the transition time t 50 defined as the time of exposure of half of the population to a spreading environmental risk factor (it corresponds to reference 0 on the curves). Parameter values τ = 8y, N = 12, Φ 1 = 0.51, Φ 2 = 0.63, T = 1240y and T * = 350y were derived from other datasets (Fig 3) b) Temporal variation of the global incidence rates computed from -20y and +80y around t 50 .
doi:10.1371/journal.pone.0156138.g005 nearly the same MDP, close to F, with no value higher than 0.75 ( Fig 6B). Thus, at-risk alleles have each limited effects at the population scale, a finding which is in accordance with previous reports, even for the most at-risk alleles [29]. Of note, F(α R ) is a population average of the distribution of individual propensities. As a comparison, for an allele causing a Mendelian trait (i.e. a disease with a single module network), F(α R ) would be its penetrance. Interestingly, if for the huge majority of individuals, at-risk alleles have limited functional effects, this does not preclude the possibility that a very small fraction of allele carriers exhibits high individual MDPs corresponding to strong functional effects of the at-risk alleles.

Application to other complex genetic disorders
Complex genetic diseases are all characterized by an interaction between multiple genetic and environmental risk factors. For disorders mainly affecting the young adults, age-specific incidence curves most often resemble each other with an exponential increase toward a peak of incidence followed by a slower decrease of incidence in the oldest people (Fig 7). These diseases thus appear as good candidates for applying our model. (Note that, in contrast, for ageingrelated and degenerative disorders, the curves are most often monotonously ascending and do not reach a peak. If this finding does not argue for the use of our model it does not discard the rationale underlying the proposed model for these disorders. It may only indicate that the ageincidence curves are truncated before the peak of incidence due to life expectancy of human beings in case of ageing-related and degenerative diseases).
We fitted the age-specific incidence curves available for ulcerative colitis (UC) [18], schizophrenia [30], multiple sclerosis [31] and ankylosing spondylarthritis [32]. The values of τ fluctuated from 10 to 19 years while the values of N fluctuated from 6 to 17. Interestingly, for UC, a lower number of modules than for CD was predicted. This could be seen contradictory with the fact that CD and UC share most of their susceptibility alleles. However, despite common genetic risk alleles, several functional modules like autophagy or innate immunity seem to be specific to CD and may thus explain the discrepancy. Finally, and as expected, for all the tested chronic diseases, T was always large. Overall, these results suggest that the proposed model also applies to other complex genetic conditions.

Discussion
The model proposed here is based on the representation of biological functions as a modular network. The functional states of the modules are seen as random variables affected by geneenvironment interactions. The disease is then defined by a limited number of modules, each in a given at-risk functional state. Aging dynamics of the functional network allows explaining epidemiological findings like the age-dependent incidence curves (and their variations across time and space) or the disease risk attributable to susceptibility alleles for CD and other complex genetic disorders.
The concept of biological network is now widely acknowledged by biologists. The modular nature of the biological networks is also widely accepted [15]. The main originality of our model is to integrate gene-environment interaction at the level of biological modules instead of at the level of the whole organism/network. In other words, the reaction norm defining the phenotype from its genetic background and its environmental exposure is displaced to a lower scale, which can be seen as a sub-phenotype. This way of thinking is logical if one reasons in terms of biological function, which is a direct consequence of functional states of cells or even molecules. The whole phenotype of an organism (here a morbid condition) thus needs to be dissected and analysed at lower levels and must be seen as a systemic property of a hierarchical network.
The proposed model strongly challenges the current reductionist understanding of disease causality. The phenotype is fully determined by the functional status of biological modules but the functional status of the modules themselves are not fully predictable. The only predictable thing is their respective MDPs, which are themselves a consequence of genetic, environmental and gene-environment parameters. However, MDPs are only propensities and it is thus impossible to fully predict the status of a given module, and by consequence of the module network. Application to other complex genetic disorders. The 4-parameter model was fitted to published datasets for (a) French male ulcerative colitis [18]; (b) schizophrenia [30]; (c) multiple sclerosis [31]; (d) ankylosing spondylitis [32]. Published data are shown as red dots and the computed curves as blue lines. Accordingly, the disease is fully determined neither by the DNA sequence (the genome) nor by the exposure to environmental factors (the exposome) nor by any combination of genetic and environmental factors. Additional factors must be taken into account, namely stochastic events that draw CD-permissive or CD-protective modules randomly with their respective propensities. As a result, the model leads to an individual-centred notion of health, disease risk and preventive actions. This opinion is fully supported by the incomplete concordance rates between monozygotic twins (who share their genetic and environmental backgrounds) in most of complex genetic disorders.
Finally and more practically, the proposed model may be used as a tool for public health decision-makers. As shown for CD, overseeing the age-dependent incidence curves may help to follow the impact of environmental changes and to test the plausibility of putative risk factors on disease outbreaks.