Using spatio-temporal surveillance data to test the infectious environment of children before type 1 diabetes diagnosis

The “hygiene hypothesis” postulates that reduced exposure to infections favours the development of autoimmunity and childhood type 1 diabetes (T1D). But on the other side, viruses, notably enteroviruses, are suspected to trigger T1D. The assessment of the possible relationships between infections and T1D still defies the classical tools of epidemiology. We report the methods and results of a geographical approach that maps the addresses of patients to a communicable diseases surveillance database. We mapped the addresses of patients at birth, infancy and T1D diagnosis to the weekly estimates of the regional incidences of 5 frequent communicable diseases routinely collected since 1984 by the French Sentinel network. The pre-diagnostic infectious environment of 3548 patients with T1D diagnosed between 0.5 and 15 years was compared to those of 100 series of age-matched “virtual controls” drawn randomly on the map. Associations were classified as “suggestive” (summer diarrhea, SD, and varicella, V) when p< 0.05, or “significant” (influenza-like infections, ILI) when they passed the Bonferroni correction for FDR. Exposure to ILI and SD were associated with T1D risk, while V seemed protective. In the subset of 2521 patients for which we had genome wide data, we used a case-only approach to search for interactions between SNPs and the infectious environment as defined by the Sentinel database. Two SNPs, rs116624278 and rs77232854, showed significant interaction with exposure to V between 1 and 3 years of life. The infectious associations found should be taken as possible markers of patients’ environment, not as direct causative factors of T1D. They require replication in other populations. The increasing public availability of geographical environmental databases will expand the present approach to map thousands of environmental factors to the lifeline of patients affected by various diseases.


Introduction
The natural history of T1D makes the search for environmental markers difficult. Indeed, it takes a silent period lasting several months up to several years for the autoimmune reaction to achieve the near-complete destruction of ß-cells, as indicated by the appearance of autoantibodies in a child's serum long before T1D is diagnosed [1][2][3]. The rate of ß-cell destruction varies across patients for unknown reasons, but seems accelerated in young children [4]. Environmental factors can affect susceptibility, triggering of autoimmunity, and possibly preclinical course of T1D. The limited concordance of T1D (30-40%) across monozygotic twin pairs is the main proof of major non-genetic risk factors [5][6][7][8] that are thought to interact with the genetic background of predisposed children. The genetic background that predisposes to childhood T1D nowadays may not be the same as 30 years ago [9].
The currently increasing incidence of early forms of T1D in many developed countries [10][11][12][13][14] suggests a deleterious remodeling of infants' environment, either because ancient protective factors have vanished or because exposures causing T1D have emerged or have become more frequent. Incidence and recent increase in incidence of T1D, however, are different across European countries [15].
Infectious exposures are the main environmental suspects, but there is no direct evidence of their implication in T1D risk or natural history in humans. Based on experimental observations in laboratory rodents, the "hygiene hypothesis" postulates that the reduced exposure to infectious agents that children enjoy nowadays has altered the longstanding balance between the developing immune system and microbiological pressure, favoring the occurrence of allergic or autoimmune diseases [16,17]. Indeed children in developed countries are less exposed to infections than 50 years ago, a result of antibiotics, vaccination, as well as improved hygiene and sanitation. It is thus tempting to think that this may be an underlying cause for the increase in childhood T1D incidence [18,19]. An inverse relation between increasing T1D incidence and decreasing tuberculosis, rheumatic fever, pinworm infestation, is found at a population level [18,20,21], but the almost complete disappearance of these prototypical diseases decades ago cannot explain the more recent increase in childhood T1D. Widespread treatment with antibiotics has decreased the number of infectious episodes while modifying the physiological immune stimulation afforded by intestinal microbiota [22]. Vaccinations, now applied to most infants, prevent measles, mumps, and rubella that affected the vast majority of them in a recent past. On the other side, viral infections may be favored by community-acquired transmission in day-care centers and pre-school nurseries. Rodent models show that viruses may instead induce autoimmune diabetes through their action on ß cells or on the immune system [2,3,[23][24][25]. Viruses, notably enteroviruses [26], have been suspected to cause human T1D [27]. Pioneering studies have examined specific viral infections in biological samples from T1D children compared to controls, or have used questionnaires to describe environmental events retrospectively. However, results of these studies are discordant [28][29][30].
In conclusion, it is possible that early exposure to viruses does protect, or on the contrary trigger T1D, it is also possible that some viral infections protect, while other viruses trigger T1D mechanisms.
The epidemiological search of possible infectious determinants of T1D, and of possible gene-environment associations, is difficult. Indeed, given the low incidence of the disease and its protracted and insidious course [2,3], cohorts require a long, costly and prospective observation of infectious events in millions of children of the general population. A more promising approach has been developed in the Teddy and BabyDiet cohorts, where specific environmental factors are studied prospectively among children at risk for T1D such as siblings or offspring of T1D patients [31,32]; those who develop T1D or autoantibodies to ß cell antigens are compared with those remaining disease-free and autoantibody-negative. On the other side, the case-control design requires to retrieve viral events in the past histories of children, with a high risk of recall bias.
An alternative to these classical epidemiological designs can be, as we propose in this paper, to take advantage of the development of the new systems of surveillance of communicable diseases in the general population that provide data with an ever better granularity in time and space (see webpage of the International Society for Disease Surveillance [33] for a list of local and national experiments and resources). Moreover, these databases are increasingly publicly available to researchers without cost. It is then straightforward to map the address of a patient at a given time of his life to know whether he was-or not-living at that time in an area where there was an influenza-like epidemic, or any of the diseases under surveillance. France established a national real time monitoring system called "Sentinel" in 1984 [34], which relies on sentinel general practitioners who report their clinical encounters of 5 frequent communicable diseases. In this work, we show how the geolocation of the patients of a cohort can be used to test if infectious exposures were different in T1D patients and in the general population. This may be done by generating "virtual controls" which provide the reference values. This also enables the search for possible gene-environment associations.

Patients
The patients of this study (n = 3548) were selected from the large multicenter Isis-Diab cohort which was established in 2007 and involves 99 participating centers covering most of continental France. Inclusion criteria for the current study were birth after 01/01/1980 in continental France, Caucasian ethnicity and age at clinical diagnosis between 0.5 and 15 yrs (Table 1). T1D was defined according to the American Diabetes Association [35] and by the presence of at least one class of autoantibodies to glutamic acid decarboxylase, insulin, or islet antigen-2.

Ethics
The research protocol was approved by the Ethics committee of Ile de France (DC-2008-693) and the computer security and confidentiality guarantees given to patients was approved by the Commission Nationale Informatique et Libertés (DR-2010-0035). The ClinicalTrial.gov identifier was NCT02212522. All patients provided written informed consent for participation in the study and donation of samples. We obtained written informed consent from the next of kin, caretakers, or guardians on behalf of the children enrolled in the study.

Geocoding
Addresses of patients at birth, between birth and 4 years and at time of T1D diagnosis were collected as part of an environmental questionnaire filled by patients. Geocoding of the patients' addresses was done using the ArcGIS 9.3.1 system, the ArcView software, and the database BD ADRESSE1 V2 database provided by the French National Geographic Institute (http://professionnels.ign.fr/bdadresse).  Assessing the environment of patients

Males
Geocoding was used to map the addresses of the patients at the 3 times defined above with public geographical databases, in order to characterize the patient's infectious exposures from birth up to the clinical onset of T1D. The weekly infectious exposures of patients to a selected set of frequent communicable diseases were assessed using the French Sentinel System [34] created in 1984, which collects in real time the corresponding information in sentinel general practitioners. Weekly incidence data are available on the web http://www.sentiweb.fr. We studied the past exposure of patients to three communicable diseases of early childhood (measles, mumps and varicella), and to two communicable diseases which affect children of all ages: influenza like illnesses, and acute diarrheas. The weekly surveillance of Influenza-like illnesses started in 1984. The weekly surveillance of measles and mumps-that were very frequent at that time [36]-started at the same date. The weekly surveillance of Varicella and Acute diarrheas started in 1990.
The association of T1D with past infectious exposures was studied using several different definitions of exposure to an infectious environment (see S2 File). We assessed the infectious environment of a child during various time-windows in his life (Table 2) using the regional estimate of the incidences at the places of life. A French region is a Local Administrative Unit of level 1 (LAU1) according to European Union definition (http://ec.europa.eu/eurostat/web/ Table 2

. Variables used for the study of the Infectious environment (INF-E).
The past exposure of patients was studied for measles, mumps, varicella, influenza like illnesses, and acute diarrheas during different exposure windows. It was abstracted using two methods: "cumulative" is the total of the regional incidences around the case during the window. "high" is the mean of the 10 largest regional incidences to which the patients or the VC were exposed. See Material and Methods, and S2 File for details on the measures of environmental infectious exposures.

Exposure window (method) Influenza like illnesses exposure
Measles exposure

Genome wide genotyping and imputation
A subset of 2521 patients from the Isis-Diab cohort had a whole-genome scan (age at diagnosis: 6.9 ± 3.5 yrs SD). Genotyping was performed at the Centre National de Génotypage (Evry, France) and interpreted at the Genome Québec Innovation Center (Montréal, Canada) with SNP imputation performed at the Riken Institute (Kanagawa, Japan). Technical details are provided in the S1 File.

Design
Environmental study. We used a case-control design with a 1:1 matching in which controls are "virtual controls" (VC), not real persons. The past infectious environment of the T1D patients was compared to that of these VCs. To define the VCs, we followed the key epidemiological principle which is that controls should be obtained from the same source population as the cases [39]. The difficulty to overcome was that the Isis-Diab cohort does not cover uniformly the whole French territory, and that its coverage has varied with time: in the initial years of the cohort, patients were recruited by the first cooperating centers in a much narrower territory than the current one (which is almost all France). To overcome this difficulty, we performed the VC sampling algorithm within successive 2-year classes of birth (starting in 1941, ending in 2011).
VC sampling algorithm: The French territory was partitioned using a grid of 20 x 20 = 400 cells of width of 0.7425˚longitude, and height of 0.495˚latitude (see S1 Fig). To model the time variation of the recruitment of patients, we considered the spatial distribution of the cohort by classes of 2 years of birth. For each period of two years, we denoted b i the number of patients in the cohort who were born in grid cell i, and B the sum over all grid cells namely B = S(b i ). Thus B is the number of patients of the cohort born in the period of 2 years of birth to which belongs the case for which a virtual control is searched, and b i describes the spatial dispersion of the cases of the cohort at that period. The weight of a square i in the cohort is w i = b i /B. A small w i indicates that few patients of the cohort were born in cell i during the 2 years period considered. The variation of the w i with time models the temporal and spatial variation of the recruitment of patients.
For each case in the cohort, an age-matched VC was sampled thanks to a two-step process. In the first step, the grid cell i containing the VC was randomly chosen with probability w i . In the second step, the precise geographical coordinates of the VC within this "large" grid cell i is obtained by sampling a "small" 200m x 200m elementary unit within cell i. This sampling was done using the Proportional Per Size R package by Jack G. Gambino (Functions for PPS sampling: http://CRANR-projectorg/package=pps). The local density of population needed by the pps algorithm was obtained at the 200m x 200m precision level from the French National Institute of Statistics (INSEE) 2009 database (https://www.insee.fr/fr/statistiques/2520034). Finally, the age of the VC was taken randomly (uniform distribution) between the age of the case-6 months, and the age of the case + 6 months.
This algorithm was repeated 3548 times to obtain a first set of 3548 VCs. Then the whole process was reapeated 100 times, in order to generate 100 sets of 3548 VCs. The environmental variables were then compared between the group of 3548 patients and each of the 100 groups of 3548 VCs.

Gene environment analysis
The possibility of gene-infectious environment (GxE) interactions was tested with a case-only method described by Khoury [40] where the exposures to the considered environmental factors are compared among the three genotypic groups at each SNP position.

Statistics
In the environmental study, exposures of cases and controls were compared with the conditional logistic regression R-package CLogit. A test was reported as indicative of a possible difference when the median p value computed over the 100 comparisons of the cases with the 100 sets of virtual controls was <0.05. It was reported as significant when the median p value was below the Bonferroni limit p<0.05/N, where N = 76 (total number of exposure variables in Table 2).
In the gene-environment study, we used the case-only method as implemented in PLINK [41]. We first searched for GxE association by using Wald tests based on the linear regression model. Results were visualized using a Manhattan plot (using ggplot2 package in R). Dominant, recessive, and genotypic analyses were performed. A test was considered as indicative of a possible association when p was between 0.05/N and 1/N, and significant when p<0.05/N where N = 7,329,768 (total number of imputed SNPs).

Environmental study
Among the variables that we used to measure the potential exposure to infectious environment, the total influenza-like illnesses (ILI) burden between the age of 1 year and time of diagnosis was found significantly larger in T1D children than in controls after Bonferroni correction (Fig 1). T1D children also showed a significantly greater past exposure to acute Red/ blue dots indicate that the patients were more/less exposed than the controls. The two lines stand for tests significance at the 0.05 level without (bottom line) and with (upper line) Bonferroni correction for multiple testing (see Table 2 for the definition of INF-E variables). summer diarrheas, and a lower past exposure to varicella, but these differences did not survive the Bonferroni correction (Fig 1).

Gene-infectious exposures
When the infectious exposures were assessed between 1 and 3 years (the child being no longer covered by maternal antibodies), there was an indication of a possible interaction between genetics and infectious exposure. Associations of 2, 1 and 10 SNPs was found with acute diarrheas, ILI, and varicella respectively (Table 3). rs116624278 and rs77232854 survived the Bonferroni correction for an interaction with varicella (Fig 2). rs116624278 is an intergenic SNP located between PGRMC2 and JADE1 genes, in a QTL for chronic obstructive pulmonary disease, heart rate and osteoarthritis. rs77232854 is an intergenic SNP located between PIGG and PDE6B. Both SNPs seem to be in inactive chromatin regions. When cumulative exposures to infectious diseases were assessed over the entire period of 0 to 3 years of age, there was an indication of a possible association for 7 SNPs located on chromosome 11 and for 6 SNPs located on chromosome 16 with acute diarrheas (Table 3). Two other SNPs showed association with varicella, and 5 SNPs with ILI, but none of these associations survived the Bonferroni correction.

Discussion
This study detected differences in infectious exposures between future T1D patients and control children that occurred at the location of residence, while social environment was comparable. This observation supports the existence of a relationship between infections and T1D occurrence. A weakness of our study is that it is exclusively based on addresses of residence and brings no proof that the child living in the area of residence of infected individuals has actually developed the infection reported by the sentinel surveillance network, or even has been exposed to infected individuals. This is why in this manuscript said "exposures" shoud be understood as "potential exposures". Our results suggest that T1D is associated with an exposure to more influenza-like illnesses and summer diarrhea during infancy, and to less varicella. These inverse relationships of T1D with different infectious exposures do not support a simple view based on the hygiene hypothesis alone [18]. The development of autoimmunity and T1D may be favored by the interaction of diarrhea with enteroviruses [26] or other members of the gut microbiome [42]. Outside of the gut, T1D may be favored by the previous interaction of viruses with the immune system [23]. For varicella, recent data regarding the presence of the VZV in the neurons innervating the islets of Langerhans could be used to speculate on a protective mechanism that may result from varicella infection [43,44]. However, we think that one should resist the temptation to extrapolate biological interpretations from purely statistical observations, unless clear mechanisms can support this speculation, which is not the case here.
It would be very interesting, if feasible, to test if the appearance of autoantibodies to ß-cell antigens correlates with infectious exposures, especially because it would make it possible to estimate the time interval between exposure and the earlier detectable manifestation of autoimmunity to ß cells. Indeed, the natural history of childhood T1D studied in siblings and offspring of patients with T1D indicates that autoimmunity often declares itself in early childhood [2,3]. These observations inspired our exploratory approach by focusing our environmental search on early times of life.
There is an almost unlimited number of possible combinations that can be analyzed by our spatio-temporal approach. First, any complex combination of environmental factors could be considered. Second, an environmental factor can target its victim at different possible ages: this is the reason why this work has tested different windows of exposure, but more could be easily considered with the same approach. Even though the number of possibilities is daunting, the good news is that it is possible. We can for example study the consequences of an exposure during the 2 nd trimester of pregnancy just by matching the dates of this second trimester with the values of the environmental databases at the place of residence of the mother. Finally, this is becoming a simple problem of computing power availability. Everything suggests that the computing power that researchers will be able to use will not be a limit to the ambitions of this kind of research. Until environment can be defined more exhaustively at the individual level, our study should be viewed as an exploratory proof-of-concept approach of a multifactorial disease whose environmental causes are yet largely unknown.
Rappaport and Smith have proposed the "exposome" as a new paradigm in which signatures elicited by environmental factors can be detected in serum or in circulating cells [45]. More specifically, next generation sequencing can be used to search for viruses "blindly" in plasma, a promising approach used in the Teddy cohort [46]. Once significant exposures have been identified in blood, it may be possible to determine their sources in the pre-disease environment. Our approach of children's exposures could be complementary to this strategy, either by providing suspect factors to future exposome analyses, or by testing exposome-driven information across the environmental landscape. The blood exposome and the external exposures can be combined in a cohort or in a case-control approach. More specifically, serology microarrays may soon provide tools for specific pathogen discovery in serum of patients at time of appearance of β-cell antigen antibodies and T1D diagnosis.
The other part of our study was an attempt to cross genetic and environmental information. For this purpose, we used the case-only method proposed by Khoury [40], which compares the level of exposure to the chosen environmental factors across the genotypes at each SNP position. The strong constraint of the case-only approach is that the tested environmental factors need to be independent from genetics. Indeed, if a given child's infection is influenced by his genotype, a statistical interaction will be found between infection and genotype, which will have no meaning for T1D research. A handful of SNPs were found associated with the studied infectious exposures, indicating possible gene-environment interactions. It is difficult to interpret these statistical observations in the light of a plausible biological interaction between infectious exposures and genetic factors. Notably, the two SNPs that showed a significant interaction with varicella, rs116624278 and rs77232854, were not located within known coding or regulatory sequences of genes directly relevant to immune processes or beta-cell autoimmunity. Their interaction with the varicella exposure eludes our understanding, but might be thought of within a protective context for varicella towards T1D. Our lack of biological understanding of the gene-environment interactions should not be felt as too disappointing since it is common to the vast majority of genomic studies in multifactorial diseases [47,48]. Our blind genome-wide exploration observed no association of infectious exposures with SNPs located within gene loci known to be involved in infectious or immune response, including HLA class II. This may be due to the lack of involvement of these genes versus the studied exposures, or to the heterogeneity of T1D mechanisms [49]. Indeed, the genetic predisposition to T1D, including HLA genotypes, is known to be highly heterogeneous at the individual level since patients' genotypes do not carry the same combination of susceptibility alleles. Thus it is possible that particular groups of genes play a role only in specific subsets of T1D patients by interacting with certain environmental factors. If very distinctive of groups of patients, such interaction could escape statistical detection when a whole-genome search is crossed blindly with markers of infectious environment, as performed in the current study.
In conclusion, the current study supports the contribution of infectious exposures to recent childhood T1D epidemiology, yet should be viewed as a methodological proof-of-concept attempt that awaits for confirmation and replication.