Cohort profile: Celiac disease genomic, environmental, microbiome and metabolome study; a prospective longitudinal birth cohort study of children at-risk for celiac disease

The Celiac Disease Genomic, Environmental, Microbiome and Metabolomic (CDGEMM) study is an international prospective birth cohort in children at-risk of developing celiac disease (CD). The CDGEMM study has been designed to take a multi-omic approach to predicting CD onset in at-risk individuals. Participants are required to have a first-degree family member with biopsy diagnosed CD and must be enrolled prior to the introduction of solid food. Participation involves providing blood and stool samples longitudinally over a period of five years as well as answering questionnaires related to the participant, their family, and environment. Recruitment and data collection have been ongoing since 2014. As of 2022 we have a total of 554 participants and the average age of the cohort is 56.4 months. A total of 54 participants have developed positive antibodies for CD and 31 have confirmed CD. Approximately 80% of the 54 participants with CD have developed it by 3 years of age. To date we have identified several microbial strains, pathways, and metabolites occurring in increased abundance and detected before CD onset, which have previously been linked to autoimmune and inflammatory conditions while others occurred in decreased abundance before CD onset and are known to have anti-inflammatory effects. Our ongoing analysis includes expanding our metagenomic and metabolomic analyses, evaluating environmental risk factors linked to CD onset, and mechanistic studies investigating how alterations in the microbiome and metabolites may protect against or contribute to CD development.

Introduction Celiac disease (CD) is an autoimmune condition characterized by enteropathy and triggered by the ingestion of dietary gluten in genetically predisposed individuals [1]. CD is common, with a global prevalence of 1.4%, in the general population, which varies according to age, sex, and region [2]. Compatible human leukocyte antigen (HLA) genetics are necessary to develop CD but not sufficient as approximately 40% of the population have permissive genetics, but only 3% will develop CD during their lifetime [3]. The incidence of CD is rising in children [4] and adults, and is expected to triple by 2050 [5]. However, neither the genetic predisposition nor the trigger have changed which suggests that other environmental factors are contributing to this rapid rise in incidence. Factors such early life infections [6], infant feeding practices [7], antibiotic exposure [8], and viral pathogens [9] have all been implicated in contributing to CD onset. These factors have also been shown to influence the microbiome [10,11]. However, to explore these factors, detailed data and sample collection must begin before disease onset [12].
CD is an ideal model to prospectively study autoimmune conditions and the role of environmental factors contributing to CD onset as high-risk children can be identified from birth using HLA genetics [13]. The trigger, gluten, is known and can be accurately traced in terms of timing of introduction and quantity in the diet, and CD onset can be identified by measuring known highly specific autoantibodies in serum [14]. To date three prospective birth cohort studies [13,15,16] of children at-risk for CD (due to having genetic compatibility and a family history of CD) have generated information that has improved our understanding of early CD onset in children at increased risk of CD and shaped clinical feeding recommendations [17]. For example, these studies have identified HLA genetics as the strongest risk factor for developing CD [13,15,16] and found that neither early gluten introduction [15] at 4 months of age nor late introduction of gluten [13] at 12 months of age affected the risk of CD onset.
The Celiac Disease Genomic, Environmental, Microbiome, and Metabolomic (CDGEMM) study was designed to fill knowledge gaps by focusing on early environmental factors and their interaction with the intestinal microbiome on the earliest steps in CD pathogenesis contributing to the loss of tolerance to gluten and onset of autoimmunity [18]. The specific aims of CDGEMM are (1) To study alterations in the infant's microbiome and metabolome from birth in relation to tolerance vs. immune response that leads to the autoimmune insult characteristic of CD; (2) To investigate the impact of specific bacterial strains and bacteria derived metabolites on gut mucosal molecular pathways contributing to CD pathogenesis; and (3) To use multi-omic statistical analysis to identify biomarkers of CD and to predict the chance of CD development. The aim of this cohort profile paper is to provide a comprehensive description of the CDGEMM cohort as a resource for the scientific community to access data and potentially establish collaborative efforts. It includes an overview of the study design, description of the baseline characteristics, and summary of results to date.

Study design and study population
The CDGEMM cohort was established to collect prospective, longitudinal, detailed metadata and samples for multi-omic analysis. The CDGEMM cohort recruits pregnant mothers after the first trimester and infants up to 6 months of age from the United States and Italy. Recruited children are characterized as having an increased risk of CD due to having a parent or sibling with CD. Based on previous studies, 75% of these recruited children will have permissive HLA genetics to develop CD and based on their genetics a 10%-26% chance of developing CD by 8 years of age [13,15].
The study design and sample collection methodologies for the cohort have previously been described [18] and are shown in Fig 1. Briefly, when possible, mothers and infants are enrolled in the study prior to birth to collect samples at delivery and in the first 7-10 days of life. Following this early sample collection, stool is obtained from children every 3 months and blood is collected every 6 months for the first 3 years of life. Stool is collected every 6 months and blood is collected yearly from 3 years of age until completion of the study at either 5 or 10 years of age. Detailed clinical data is obtained regularly as shown in Fig 1 and topics included in the questionnaires are listed in Table 1.

Ethics
This study was approved by the MassGeneral Brigham Human Research Committee Institutional Review Board (IRB) and the Sapienza University of Rome Comitato Etico (Italian IRB) which serves as the coordinator center in Italy. Parental consent was obtained for each child enrolled.

Recruitment
To date, we have recruited 554 participants, 262 (47.5%) in the US and 292 (52.5%) in Italy. Given that we aim to recruit children with a family history of CD prior to birth we required innovative strategies for recruitment of these children. In the US, we employed social media to engage with and identify this unique population. Specifically, we utilized the online community of bloggers and dedicated patient advocacy groups to promote recruitment using Twitter, Facebook, Instagram, blogs, and podcasts. IRB approved language was provided to all individuals for use on their social media site. A link to our dedicated study website (with both English and Italian translations) and email was provided for interested families to reach out to the study team to either enroll or receive more detailed information about study participation. Social networking both nationally and internationally has been a strong source of recruitment for all sites.
In addition to social media, the study teams in the US and Italy have presented at conferences or expos (related to celiac disease and/or gastroenterology) to reach a local population. Patient referrals from other pediatric gastroenterology providers across the US and Italy have also contributed to successful recruitment. Specifically in Italy, most patients are recruited in clinic or through flyers posted in referring clinic offices. In a short survey asked of all families enrolled reveals social media as being the most successful in the US at 87%, and patient referrals the most prominent for Italy at 80%.

Retention
Given that this is a longitudinal study which requires serial sample and data collection over the course of 5-10 years, the study team has focused on methods to retain participants and keep both the families and children engaged in the study. Currently, the retention rates are as follows: 82% from the US and 70% from Italy, which can be due to the activities mentioned below. Upon enrollment into the study, families are provided with a welcome packet that includes: • Welcome letter from the study team • Handout with all study team members and contact information • Handout about the impact the study will have on celiac disease research • Invitation to join a Facebook group • Study calendar outlining upcoming visits and what is required (stool, blood, data) • Progress certificate with spots to add a sticker onto upon completely each year of the study All sample packages are mailed directly to participants' homes (for blood and stool collection), and the study coordinator schedules bloodwork for the families at a local lab. Additionally, the study team sends out study updates in the form of a newsletter alerting families to new developments or analyses that are ongoing, publications, or sample collection updates. As a follow up to the newsletter, the study team hosts video calls with families, to discuss the study updates and answer any questions they may have.
While strong communication with these participants and their families has been crucial for retention, other creative ways of keeping families involved include: • A private Facebook group for families to join and get to meet other families, some of whom may be local to them • A video and book created by the study team designed to outline for the children what the team does with the samples they collect and why they collect them, allowing the children to better understand their participation, especially as they get older • Each child receives a birthday card with a sticker to place on their progress certificate • At study completion, each child receives a certificate and personalized lab coat • For local US families, a party was hosted by the study team in June 2017 and June 2019 at MGH for families to come meet the team and other enrolled participants The focus of retention techniques has been to ensure each subject feels involved in the study, more than just sending in samples and data, as well as to ensure each child and family understand the impact they are making on CD research. Impact of COVID-19. In March 2020, the CDGEMM study (across all sites) was placed on hold due to the COVID-19 pandemic. All staff members were remote; however, approved study staff were able to package up stool kits to mail to participants to remain on track with specific study time points. Participants were instructed to collect the stool around the appropriate time point and keep in their home freezer until the lab resumed and was back to full operations. In the US, all blood samples were on hold from March to June 2020, and in Italy from March to September 2020; however, upon resuming the blood samples, there were still several time points missed due to participants not able to make it to an outside lab, lab shutdowns, staffing shortages, and several hospitals only allowing urgent cases to be seen, not elective bloodwork. All samples that were missed or delayed were logged appropriately.
The pandemic also led participants to either drop out of the study or become lost to follow up. The most common reason being refusal of study bloodwork. According to records, 9.2% were dropped out of the study from the US and 17.6% from Italy during March 2020 onwards. Additionally, given that participants were not able to obtain bloodwork and upper endoscopy procedures were not available for several months it is possible that there was a delay in CD diagnosis in some participants. In the future we aim to examine whether SARS-CoV-2 has served as a viral trigger of celiac disease and thus increased the incidence of disease in our cohort.

Baseline characteristics
The CDGEMM cohort is currently composed of 554 participants enrolled from 42 of the US states, and in 8 different regions in Italy. Baseline demographic data and clinical characteristics are displayed in Table 2. The average age of the overall cohort is 56.7 +/-23.6 months, with an age range of 0-9 years. Across all sites, we have enrolled 9 sets of twins, and 67 sets of siblings. Additionally, 48% of participants are female and 58% were born via vaginal delivery.
For the purposes of this study, families are allowed to choose whether and/or when they introduce gluten to their child(ren), and thus, the average age at gluten introduction is 8.9 months in the US and 8.1 months in Italy. In the first year of life, parent(s)/guardian(s) report feeding method (breast fed, formula fed, or a combination of both). Of the reported data, in the US, 44.75% of the cohort were primarily breast fed, 15.85% were primarily formula fed, and 10.85% received both breast milk and formula. Moreover, in Italy, 22.83% of the cohort were primarily breast fed, 21.06% were primarily formula fed, and 6.28% received both breast milk and formula. Further data is listed in the S1 Table.

Samples collected
To date, there have been over 120,000 samples collected and 170,000 data points obtained. There are several different blood sample components that are obtained at each study visit, including: serum, blood clots, whole blood, RNA pax gene, and peripheral blood mononuclear cells (PBMCs). Additionally, at designated sample collection time points stool is collected into both RNA Later solution and without. Data is obtained from parent-reported surveys as well as data obtained from pediatrician offices related to child height, weight, and immunization records. All samples and data listed in

Development of CD to date
To date, there are 54 participants (US (n = 17) and Italy (n = 37)) who have had positive tissue transglutaminase IgA (tTG IgA) either through our research lab and/or in a clinical lab. All participants were advised to follow up with local physicians to repeat serology testing and based on those results either confirm the diagnosis with a biopsy or follow ESPGHAN criteria and begin a gluten-free diet without biopsy. Per our study protocol, the following diagnostic criteria is used to categorize all participants who test positive for celiac disease autoantibodies in our research lab: • CD by Biopsy: Upper endoscopy performed revealing Marsh 3 histology.
• CD by ESPGHAN criteria: tTG IgA value that is greater than 10 times the upper limit of normal, EMA positive (at least on one occurrence).
• CD Autoimmunity (CDA): tTG IgA value that is positive (or DGP IgG positive if subject is IgA deficient) on at least two occurrences.
• Potential CD: tTG IgA value that is > or < 10 times the upper limit of normal, EMA positive, but histology negative.
• Transient CD: tTG IgA value that is positive on one occurrence (but not 10 times the upper limit of normal), EMA positive or negative, and then normalization of labs (without adhering to a GFD).
Thus, to date, 16 participants have confirmed CD by biopsy, 15 are considered confirmed CD by ESPGHAN criteria, 9 with CDA, 6 with potential CD, and 8 with transient CD. Of those with transient CD, 4 of them are currently pending clinical workup and the other 4, no symptoms and on a gluten containing diet, are currently being monitored through study labwork at the set time points. Given that this study is ongoing, participants diagnosed as CDA,

PLOS ONE
Cohort profile: CDGEMM study Potential CD, and Transient CD are closely followed, unless the participant drops out of the study, with both study and clinical labs.

Ongoing activities and future plans
We previously published work examining the influence of the HLA genetics and environmental factors including birth delivery mode, infant feeding type, and antibiotic exposure on the gut microbiota composition, function, and metabolome prior to the introduction of gluten [10]. We performed multivariate association, cross-sectional, and longitudinal analyses using metagenomic and metabolomic data collected at three consecutive timepoints (birth, 3 months, and 6 months of age) from 31 children selected from the CDGEMM cohort [10]. Through this work, we ultimately identified taxonomic and functional shifts in the developing gut microbiota of children at risk of CD linking genetic and environmental risk factors to detrimental immunomodulatory and inflammatory effects. After identifying shifts in the microbiota even before the trigger of CD was introduced, we next aimed to identify microbiota signatures predictive of CD onset in children from the CDGEMM cohort who went on to develop CD. Here, we again assessed alterations in the gut microbiota and metabolome before disease onset using shotgun metagenomic sequencing and untargeted metabolomic analysis to perform cross-sectional and longitudinal analyses in the 18 months prior to disease onset in 10 participants that developed CD and the corresponding time point in 10 matched control participants [19]. Our longitudinal analysis identified several microbial species/strains/pathways/ metabolites occurring in increased abundance and detected before CD onset which have previously been linked to autoimmune and inflammatory conditions while others occurred in decreased abundance before CD onset and are known to have anti-inflammatory effects. We

PLOS ONE
also identified previously unreported microbes/pathways/metabolites that could serve as CDspecific biomarkers [19]. We are now in the process of expanding this analysis to include the additional participants that have developed CD. In other work we utilized a multi-omic approach to examine the microbiota and metabolome of breast milk from mothers enrolled in the CDGEMM study with CD compared to those without CD [20]. We identified differences in bacterial and viral species/strains and in functional pathways between these two groups. We are now expanding these findings in context with the matched infant stool samples from birth to understand how the breast milk microbiota may influence microbiome engraftment in neonates, with potential impact on their future clinical outcomes.
Our next steps include using machine learning to investigate which environmental factors other than gluten may influence the development of CD, taking a mechanistic approach to understanding how microbes identified and isolated from CDGEMM participants may contribute to or protect against the development of CD, and analyzing genomic data, single cell sequencing of isolated immune cells, and epigenetic data. In addition to samples described, we are also generating organoids from duodenal tissue obtained from CDGEMM patients who undergo esophagogastroduodenoscopy (EGD) for suspected CD. The long-term goal of the CDGEMM study is to combine in-depth metadata with multi-omic information inclusive of genomic, microbiome, and metabolomic data and utilize advanced statistical approaches to predict who may develop CD before it occurs to learn how to prevent it.

Strengths and limitations
The strengths of this cohort include robust metadata and longitudinal sample collection from a well characterized international cohort of children at risk of developing CD. We are predominantly utilizing the fecal microbiome as a surrogate marker of the intestinal microbiome which may be deemed a limitation for the cohort. However, we are obtaining duodenal tissue from participants who undergo EGD which can be used for future analyses including of the intestinal microbiota. In addition, feces must be used as a marker for longitudinal analysis as it is not possible nor ethical to obtain serial endoscopies in otherwise healthy children. Another limitation pertains to the use of self-administered questionnaires which can lead to unreliable and missing data. This limitation is common to longitudinal cohort studies.
In order to understand the earliest steps of CD pathogenesis we must begin detailed data and sample collection and analysis before the onset of disease. The CDGEMM cohort serves as a guide for prospective longitudinal study designs to better understand the role of gut microbiota in disease pathogenesis and to develop therapeutic targets to reestablish tolerance and/or prevent autoimmunity. We welcome collaborations with those interested in efforts to prevent chronic inflammatory disease.