Global Carrier Rates of Rare Inherited Disorders Using Population Exome Sequences

Exome sequencing has revealed the causative mutations behind numerous rare, inherited disorders, but it is challenging to find reliable epidemiological values for rare disorders. Here, I provide a genetic epidemiology method to identify the causative mutations behind rare, inherited disorders using two population exome sequences (1000 Genomes and NHLBI). I created global maps of carrier rate distribution for 18 recessive disorders in 16 diverse ethnic populations. Out of a total of 161 mutations associated with 18 recessive disorders, I detected 24 mutations in either or both exome studies. The genetic mapping revealed strong international spatial heterogeneities in the carrier patterns of the inherited disorders. I next validated this methodology by statistically evaluating the carrier rate of one well-understood disorder, sickle cell anemia (SCA). The population exome-based epidemiology of SCA [African (allele frequency (AF) = 0.0454, N = 2447), Asian (AF = 0, N = 286), European (AF = 0.000214, N = 4677), and Hispanic (AF = 0.0111, N = 362)] was not significantly different from that obtained from a clinical prevalence survey. A pair-wise proportion test revealed no significant differences between the two exome projects in terms of AF (46/48 cases; P > 0.05). I conclude that population exome-based carrier rates can form the foundation for a prospectively maintained database of use to clinical geneticists. Similar modeling methods can be applied to many inherited disorders.


Introduction
Recent advances in next-generation sequencing (NGS) technology have revolutionized the field of clinical genetics [1][2][3][4].This technology has facilitated the identification of the novel causative genes for >3,000 inherited disorders, which are currently annotated in the Online Mendelian Inheritance in Man (OMIM) [2,3].Most of these disorders are referred to as rare or orphan diseases because of their low incidence [5].In clinical practice, molecular genetic testing is already being applied to screen for these inherited disorders [6].However, the epidemiological information of many inherited disorders is completely insufficient and inconclusive.Particularly for rare diseases, epidemiology is a research field that remains largely unexplored by clinical geneticists and researchers [7].Total global prevalence of all monogenic disorders at birth has been calculated to be several percent [5].In Canada, it has been estimated that single-gene disorders may account for approximately 40 percent of cases in pediatric practice [8].Therefore, the public health impact of Mendelian diseases is a topic of growing interest worldwide.Reliable estimates of the populations affected by inherited diseases have become increasingly important to guide efficient allocation of public health resources in each country, region, and city [7,9,10].
The lack of epidemiologic studies of inherited disorders is particularly true for developing countries with limited resources [11][12][13].Most epidemiologic researches have been conducted with individuals from Europe and North America, who represent only a fraction of the global population [11,12].In developing countries, consultation rates, data collection methods, and population-based registries for inherited disorders vary considerably by urbanization grade and ambient environment [11][12][13].
To overcome these limitations I analyzed the global carrier rates of rare inherited disorders using geographical population exomes.The global map of the carrier rates showed strong population-specificity and this prediction represented equivalent accuracy that may be achievable with clinical practice.This is an initial global overview of the carrier rate of genetic disorders using population exome sequences.

Strategy for epidemiological research on Mendelian disorders using population exome sequences
As an initial study toward determining the genetic epidemiology of inherited disorders, genetic pipelines from 1000 Genomes (1000G) [14,15] and National Heart, Lung, and Blood Institute (NHLBI) projects [16,17] were collected for variations with the potential to affect protein integrity (Fig 1).The dataset included the exome and its surrounding intronic sequences for 1,092 individuals (525 males, 567 females) of 14 ethnic origins and 6,503 individuals (2,443 males and 4,060 females) of two ethnic origins.Population demographics are summarized in S1 Table .Caucasians comprised 34.7% and 66.1% of subjects from the 1000G and NHLBI groups, respectively.Asian and Hispanic populations, which were represented only in the 1000G, constituted 26.2% and 16.6% of the group, respectively.A total of 65.9% were female.Many samples were from within the United States; a minority were from China, Japan, Colombia, Mexico, Puerto Rico, Finland, England, Spain, Germany, Italia, Nigeria and Kenya.These populations under the study are likely depleted for individuals with rare genetic disorders, but when the prevalence rates are so close to 0 (<0.25%)under Hardy-Weinberg equilibrium the carrier rate is usually approximated as follows: where p and q indicates allele frequencies and p + q = 1 (p<0.05;q>0.95).

Disease carrier states of Mendelian disorders
As expected, among 15 genetic diseases detected, the most common was SCA, with a frequency of 1 in 66.6 (1.50%) (Table 1).In contrast, MCPH1 was the rarest disorder, with a frequency of 1 in 14,160 (0.0071%).In addition, carrier prediction unexpectedly revealed high carrier rates (1 in 254.0) for CEP152 mutations for SCKL5.Carrier statistics are fully reported in Table 1.

Carrier rate variability by race and ethnicity
Carrier frequencies for disease-causing mutations varied significantly by racial and ethnic groups although the sample size is not so large in Hispanics and Asians [7,9,10].Fig 2 shows the global map of carrier distribution of eight causative mutations for three Mendelian disorders.For example, an average of 0.11% of individuals were carriers for Miller syndrome, but the frequency ranged from 0.18% of European individuals to 0% of Africans, Asians, and Hispanics (Fig 2C).For ethnic groups such as European, this higher frequency was unreported before and thus suggest that the European population is right target for screening for Miller syndrome.Among 15,190 haploid exomes, causative alleles for seven disorders (SCA, SCKL5, Primary immunodeficiency, Canavan disease, Pustular psoriasis, CRPT1 and AGS6) were more or less prevalent in both African and European populations (Fig 2 and Table 1).In contrast, mutations for the other eight disorders (CPHD2, RCD, MCPH1, PCH1B, Miller syndrome, FDLAB, GCCD4 and MEGDEL syndrome) were observed only in Europeans while they were not detected in other populations.There were no carriers for any of the 18 inherited disorders among the dataset from Asian populations.

Estimated carrier rates correspond to those seen in clinical practice
SCA is an inherited blood-related disorder that affects hemoglobin and is characterized primarily by chronic anemia and periodic pain episodes [18,19].A mutation in the HBB gene, commonly called Hemoglobin S (HbS), causes SCA [18,19].SCA is common among persons whose ancestors descended from tropical regions, particularly Sub-Saharan Africa, South America, Saudi Arabia, India, and Mediterranean countries (e.g.Italy, Greece, and Turkey) [18,19].The CDC has reported that in the United States, SCA affects approximately 90,000-100,000 persons, most of whom have ancestors of African descent [20].The disease occurs in about 1 in every 500 African-American births and 1 in every 36,000 [20] (or 1,000-1,400 [21]; the incidence rate is controversial) Hispanic-American births.However, highly accurate epidemiological studies based on clinical practice are still rare.
The prevalence rates of SCA in Hispanic Americans are controversial (1 in 36,000 [20] or 1 in 1,000-1,400 [21]), but the projected carrier rate here could support both data depending on the ancestral origin (Table 2).Taken together, exome-based estimates corresponded to those in the clinical prevalence survey and represented equivalent accuracy that may be achievable in clinical practice.

Screening priority for genetic testing
Current genetic testing is generally performed according to the ranking of carrier rates of the target mutations.Yet, precise data of targeted panel of genetic testing is not sufficient in clinical practice due to the large number of rare disorders.This tendency is particularly true for recently identified causative genes.Here, I demonstrated that the exome-based methods made it possible to identify a small number of high-priority nonsense and missense mutations linked to genetic disorders (Table 1).For example, the data suggests that, among six causative mutations for PCH1B, only one mutation (p.Asp132Ala,) should be high priority for EXOSC3 mutation screening in European populations, whereas other mutations are speculated to be quite rare (Table 1).The ranking of carrier rates of mutations was as follows: p.Asp132Ala (NHLBI EA, 0.128%) > p.Val80Phe (0.0116%) = c.475-1269A>G (0.0116%) > other mutations (0%).In the case of Miller syndrome, for which mutations have been reported in several papers, three mutations [p.Arg135Cys (NHLBI EA, 0.0607%), p.Gly152Arg (0.120%), and p.Arg346Try (0.161%)] should be given first priority for DHODH mutation screening in Europeans but not Africans.A different tendency was obtained for SCKL5: two mutations [p.Lys667Arg (NHLBI AA, 1.29%) and p.Tyr678 Ã (0.176%)] occupied a central position in African populations (Table 1).These frequent mutations were detected in the 1000G dataset [p.Lys667Arg (1000G AFR, 1.22%) and p.Tyr678 Ã (0.203%)].Taken together, these data will allow the formulation of a suitable mutation panel that can be applied to determine the priority of genetic testing in clinical practice.I further searched for undetected mutations using the Exome Aggregation Consortium (ExAC), which summarizes and categorizes exome data of 60,706 unrelated individuals from a variety of large-scale sequencing projects into six races (Table 3).The ExAC dataset detected additional 29 mutations although this data did not provide country-by-country genetic epidemiology of inherited disorders (Table 3).This result suggested that larger sample sizes and/or combinational use of a set of large exome sequencing projects could allow for more accurate prediction of carrier rates.

Consistency of data between two different exome sequencing projects
I next examined the extent of differences in two exome-based carrier rates by comparing carrier rates in African and European ancestries between 1000G and NHLBI datasets.A pair-wise proportions test [24] was used, which was appropriate to test the null hypothesis stating that proportions in the two estimates were significantly different.This formula is referred to as a z-test because the statistic was as follows: where p^= (p 1 + p 2 )/(n 1 + n 2 ) and the indices (1, 2) refer to the first and second column of the table.A pair-wise proportion test between two exome resources showed no significant differences between the two different exome studies (46 cases; P >> 0.05), except in two African cases (P < 0.05) (S3 Table ).This finding raises the possibility that exome-based predictions are divorced from sources of various arbitrary errors (e.g., diagnostic capacity) and may be an objective indicator.

Risk simulation and mutation detection rate of autosomal recessive disease
Finally simple deterministic formulae were introduced to predict the mutation detection rate of genetic risk using exome studies assuming a single-gene disease with an autosomal recessive inheritance pattern.The formula of the mutation detection rate (D) of Mendelian disorders was as follows: where p refers to the mutation carrier rate in each population, and σ indicates the error rate of    exome sequencing.N refers to the number of exomes available for epidemiological analysis.Fig 3 shows the simulation curve for the mutation detection rate.This prediction equation is applicable to general cases of predicting the incidence of inherited disorders.This predictive equation is responsive to parameters that affect carrier rate and data accuracy, and it is independent of the distribution of fitness effects.The epidemiological study was performed using a total of 7,595 samples from NHLBI and 1000G datasets, and a target mutation with carrier rate of 0.001 in this group could be theoretically detected with a probability of 99.95% under the condition of σ = 0.01.When the ExAC dataset was used under the same conditions, the probability of undetected rates was 7.70E-25%.Exome sequencing errors now are generally small (σ < 0.01) and thus have a small effect on mutation detection rates (S1 Fig).

Discussion
During the past several decades, biomedical research has identified the causative genes for almost >3,000 Mendelian disorders [1][2][3][4].NGS results have provided empirical evidence that the genetic architecture of Mendelian disease is one of many rare causal mutations, although NGS have not yet identified all genetic mutations [2][3][4].Despite the accumulation of significant genetic data, the epidemiology of Mendelian disorders remains unknown.The initial study here demonstrated the structured concept that genetic risk prediction using exome sequences accurately revealed carrier frequencies for rare Mendelian mutations with a small margin of error (Fig 2 and Table 2).The estimation algorithm was successfully applied to developing countries, and showed strong regional specificity of causative alleles (Fig 2).This study also set priorities aligning causative mutations with their carrier rates (Table 2).The accumulation of these data will make it possible to perform closely focused diagnostic genetic tests in specific countries and cities and to plan clinical services, assess priorities, and monitor prevalence trends.I have recently showed that exome-based epidemiology also could have the potential to provide a clue to understand the penetrance of each mutation [25].
A recent exome-based study [26], which focused on common diseases of interest, also successfully performed the risk prediction of target genetic disorders of newborn-screening, agerelated macular degeneration (ARMD) and drug response across the two populations (American African and European).Their and my results suggested that NGS data could yield the useful information for applying genetic screening of genetic disorders in clinical practice.
Except Asian populations, the other populations have wider range of genetic variations, and the regional specificity is largest in African populations [14,15,27].Therefore recent analysis [26] about a per-region breakdown of African allele frequency estimates possibly does not reflect the complex genetic structures in African populations.It is rational to analyze countryby-country and ethnicity-by-ethnicity epidemiology by using 1000G (Fig 2 ).

Data quality and limitations
The simulation studies here suggest that larger sample sizes or combination studies will allow for more accurate prediction of genetic risk (Fig 3).The ExAC data highlighted usefulness of large population size.Yet note that the present ExAC data also contains individuals sequenced as part of various disease-specific studies and does not reflect the complex genetic structures in African populations.
There were also some logistical issues that must be addressed when performing genetic epidemiological studies.The first limiting factor is consanguineous marriage [28][29][30], which is irregular from the standpoint of population genetics.This practice largely influences the prevalence rate for autosomal recessive disorders [28,29].Most recent studies have used wholeexome sequencing of individuals from consanguineous families to identify rare coding variations in the rare pathogenesis [2][3][4], and some rare heritable disorders may never occur with outbreeding.Rates of consanguinity (e.g., marriage between cousins) vary greatly between and within countries and regions, but the prevalence is highest in North Africa, the Middle East, and South Asia and among migrant communities in North America, Europe, and Australia [29,30].At present, about 20% of the world's population lives in communities with a preference for consanguineous marriages [29].Public understanding regarding the genetic risk of consanguinity is still low in these countries [29,30]http://www.nature.com/ejhg/journal/v22/n4/full/ejhg2013167a.html-bib8 The current accepted belief is that the consanguinity infrequently cause genetic disease, so it is important to provide evidence-based recommendations for genetic counseling and screening for consanguineous couples and not to provoke unnecessary alarm.The research here may promote the diffusion of overview on reproductive risks associated with consanguinity when the sample size are further extended.Intriguingly recent research also provides a fascinating view that the genomic inbreeding coefficient of each individual is an unexpected high to varying degrees even in 1000G data [31].
The second limiting factor is prenatal genetic counseling and testing.SCA, for which the U. S. Preventive Services Task Force (USPSTF) recommends screening [30], is a good example.Recent advances in prenatal genetic diagnosis make it easier than ever to gather more information on individuals prior to their birth [32,33].It is, therefore, crucial to consider the potential effect of abortion on the prevalence rates.
The third limiting factor is the mode of inheritance.The initial dataset in this study was originally derived from individuals with no cognitive impairment.Predicting risk has been successful for diseases that follow a simple mode of recessive inheritance, but risk prediction is challenging for autosomal dominant traits in this dataset.To analyze the autosomal dominant disorders, it is necessary to collect general population in specific area independent of their phenotypes.
The fourth limiting factors are the experimental limitations and uncertainties in identifying causative disease mutations.There is often the case where the causative disease-causing mutations are determined too easily without analyzing potential effect of mutations [34,35] and the population exomes may not have read coverage over all of the causative loci.Some causative mutations may have been previously unreported and would occur de novo in the future as the past has already shown [36][37][38][39].In addition the degree of penetrance of the mutations remain largely unknown, and some reported disease mutations may be in fact not disease causing [40].Therefore the carrier rates could be underestimated or overestimated.I suppose that discordance between carrier and prevalence rates of each mutation could provide a clue to understand the penetrance as well as screening priority.

Carrier rate in developing countries
One of the greatest merits of exome-based epidemiology is that we can easily conduct a part of public health surveillance of genetic disorders even in developing countries.According to the World Health Organization (WHO), congenital and inherited disorders increasingly contribute to perinatal morbidity and mortality in developing countries [41].Despite this fact, many countries in Africa, South Asia, and South America still lack national policies and recommendations regarding screening for developmental abnormalities [12].Genetic epidemiological studies have the potential to provide scientific evidence of genetic risks in most countries and disseminate public health advice.Given the lack of sampling depth in these countries, it seems that the ethnic groups who need the information and counseling the most, have the least sampling.The geographical portfolio of exome-based prediction could be expanded to more disorders and more countries.Furthermore, on this basis, key infrastructure requirements must be placed in sociopolitical frameworks, and medical resources must be allocated for institutions in both developed and developing countries.

Analysis of genetic mutations using two representative population exome projects
Genotyping pipelines from 1000G (Phase 1) (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) and NHLBI (http://www.nhlbi.nih.gov/)projects were collected in VCF format.The dataset consisted of a total of 15,190 haploid exomes from high-coverage exome sequence data derived from 14 + 2 ethnic groups.NHLBI data contains individuals sequenced as part of various disease-specific studies and may not partially reflect the precise genetic population structures while 1000G collected healthy individuals.The validity of a part of the NHLBI dataset was previously assessed by NHLBI using Sanger sequencing [novel singleton variants, 143/145 (99%); novel nonsingleton variants 316/323 (98%)] [17].The genotype accuracy of 1000G was estimated at 97.4% (20,687/21,235) by comparing with the HapMap genotype calls [15].The 1000G and NHLBI datasets (VCF files) were filtered on Variant Tools (http:// varianttools.sourceforge.net/Annotation/HomePage)and Microsoft Excel by total read depth, the number of individuals with coverage at the site, the fraction of mutation reads in each heterozygote, and the average position of mutation alleles along a read.Eighteen recessively inherited diseases were probatively retrieved and selected from literature (published from 1957 to 2014) and derived from NCBI OMIM (http://www.ncbi.nlm.nih.gov/omim) and PubMed (http://www.ncbi.nlm.nih.gov/pubmed).Causative mutations for inherited disorders were derived from these datasets based on the corresponding chromosome position (UTR, coding, intron, and splice site).ClinVar and HGMD were supplementarily reviewed to collect the mutations.Identified mutations were then classified by mutation type, allele frequency, racial groups, and clinical impact.Information on mutation types, positions, reference sequences, and pathogenicity were retrieved from NCBI dbSNP (http://www.nlm.nih.gov/SNP/) and UCSC genome browser (http://genome.ucsc.edu/) to generate exome-based epidemiology.Statistical analysis, including carrier rate (%), was performed with Excel.ExAC Browser (http:// exac.broadinstitute.org/)was additionally searched for the mutation alleles of 18 inherited disorders.A global map of carrier rate distribution was manually constructed for 15 recessive disorders collated from literature sources.A world map was obtained from Free Editable Worldmap (http://free-editable-worldmap-for-powerpoint.en.softonic.com/)and modified.

Pair-wise proportion tests of data consistency between two different exome resources
To project the performance of risk prediction based on analyses of exome sequence studies, I statistically compared exome-based estimates with the clinical prevalence survey.Evidence of data consistency was based on significant differences in pair-wise comparisons between populations if two estimates differed significantly (two-sample test for equality of proportions with continuity correction).The standard hypothesis test was H 0 : π 1 = π 2 against the alternative (two-sided) H 1 : π 1 6 ¼ π 2 The pair-wise prop test can be used to test the null hypothesis that the proportions (probabilities of success) in two groups are the same.In a two-way contingency table where H 0 : π 1 = π 2 , this should yield comparable results to those of the ordinary χ 2 test.

Mutation detection simulation of inherited diseases
To perform mutation detection simulation based on population exome sequences, a deterministic formulae (D ¼ ½1 À f1 À pð1 À sÞg N ) was calculated to predict the mutation detection rate of genetic risk using exome studies assuming a single-gene disease with an autosomal recessive inheritance pattern.The variable p refers to the mutation carrier rate in each population, and σ indicates the error rate of exome sequencing.N refers to the number of exomes available for epidemiological analysis.The simulation curve for the mutation detection rate is calculated and drawn using the R 3.13 statistical software (http://www.r-project.org/)together with the RColorBrewer package (http://cran.r-project.org/web/packages/RColorBrewer/index.html).

Fig 1 .
Fig 1. Strategy for epidemiological research on Mendelian disorder using exome sequences.A flow chart used to study the geographic prevalence shows the process of mutation detection using 1000G and NHLBI datasets.A total of 15,190 haploid exomes were screened for 161 causative mutations linked to 18 genetic disorders.Several platforms (NCBI dbSNP and UCSC Browser) were used to access the validity of mutations and examine previous information on gene annotations and alleles.doi:10.1371/journal.pone.0155552.g001 doi:10.1371/journal.pone.0155552.t001

Fig 2 .
Fig 2. Geographical minor allele frequency distribution for the causative mutations of representative three Mendelian disorders.Pie areas are proportional to the minor allele frequency of the causative mutations for three inherited diseases (A: SCA, B: Pustular psoriasis, C: Miller syndrome).1000G and NHLBI (2 + 14) populations are displayed separately.The thick white circle indicates the absence (0%) of mutations in the population.The right bar chart shows the mutation minor allele frequency in each population.A world map was obtained from Free Editable Worldmap (http://free-editable-worldmap-for-powerpoint.en.softonic.com/)and modified.doi:10.1371/journal.pone.0155552.g002 .1371/journal.pone.0155552.t003

Fig 3 .
Fig 3. Risk simulation and mutation detection rate of autosomal recessive disease.The simulation graph depicts the theoretical mutation detection probability of high-penetrance genetic mutations (under the condition of σ = 0.01) that are associated with inherited disorders.The simulation sample sizes range from 1 to 100,000.The y-axis corresponds to the detection rate of causative mutations.doi:10.1371/journal.pone.0155552.g003

Table 1 .
Estimated carrier rates of 15 Mendelian disorders by race, ethnicity, and country.The information about the mutation and carrier rate is shown in this figure.Pustular psoriasis caused by is yet described in OMIM.The abbreviations are as follows: AA, African Americans; EA, European Americans; ASW, American's of African Ancestry in SW; CEU, Utah Residents (CEPH) with Northern and Western European ancestry; CHB, Han Chinese in Beijing; CHS, Southern Han Chinese; CLM, Colombian from Medellin; FIN, Finnish in Finland; GBR, British in England; IBS, Iberian population in Spain; JPT, Japanese in Tokyo; LWK, Luhya in Webuye; MXL, Mexican ancestry from Los Angeles; PUR, Puerto Rico from Puerto Rica; TSI, Toscani in Italia; YRI, Yoruba in Ibadan.

Table 2 .
Comparison of predicted exome-based carrier rates with previous clinical estimates.The Pvalue is calculated from Chi-square tests between two carrier estimates.

Table 3 .
Estimated carrier rates of 17 Mendelian disorders using ExAC data.The carrier rates of Mendelian disorders were estimated using ExAC dataset.Child-hood cardiomyopathy (MIM no description) and Usher syndrome type 1J (USH1J) (#614869) were detected in ExAC but not in 1000G and NHLBI.ExAC populations are largely divided into six races:African, Latino, European (non-Finnish), European (Finnish), South Asian, East Asian, and Other.