DALIA- a comprehensive resource of Disease Alleles in Arab population.

The Arab population encompasses over 420 million people characterized by genetic admixture and a consequent rich genetic diversity. A number of genetic diseases have been reported for the first time from the population. Additionally a high prevalence of some genetic diseases including autosomal recessive disorders such as hemoglobinopathies and familial mediterranean fever have been found in the population and across the region. There is a paucity of databases cataloguing genetic variants of clinical relevance from the population. The availability of such a catalog could have implications in precise diagnosis, genetic epidemiology and prevention of disease. To fill in the gap, we have compiled DALIA, a comprehensive compendium of genetic variants reported in literature and implicated in genetic diseases reported from the Arab population. The database aims to act as an effective resource for population-scale and sub-population specific variant analyses, enabling a ready reference aiding clinical interpretation of genetic variants, genetic epidemiology, as well as facilitating rapid screening and a quick reference for evaluating evidence on genetic diseases.


Introduction
The Arab population encompasses twenty three Arab States with a total population of around 420 million individuals and roughly encompassing 5% of the world population [1,2]. The rich genetic diversity in the Arab population has been contributed by a genetic admixture resulting from waves of migration, to form one of the largest geocultural units in the world [3]. A number of genetic diseases are prevalent in the Arab population. These include Hemoglobinopathies such as sickle cell disease (HbS) as well as thalassemias, Familial Mediterranean Fever, Osteopetrosis Syndromes, glucose-6-phosphate dehydrogenase (G6PD) deficiency and several metabolic disorders [4]. The prevalence of genetic disorders in the Arab population has been extensively reviewed by a number of authors [4][5][6][7][8]. It is widely believed that datasets with well-curated and annotated genetic datasets are urgently required to establish disease epidemiology, especially in genetically isolated subpopulations which bear a particularly heavy disease burden [4]. A few efforts in the past have attempted to systematically catalog the genetic diseases in the Arab population. The first and foremost major initiative was the Arab Genetic Disease Database, put together by Teebi and colleagues [9]. This was followed by a more systematic effort in later years, resulting in the creation of the Catalogue of Transmission Genetics in Arabs (CTGA), a comprehensive resource mapping genetic diseases in Arab population. The database created by the Centre for Arab Genomic Studies (CAGS) [10] provides information about phenotypes and related genes. In addition, a few resources have emerged that have collected the genetic variants in Arab populations. These include the Greater Middle East (GME) initiative [10,11], which sequenced multiple individuals in the region to create a comprehensive variome resource. In addition, the Qatar genome programme (About Qatar Genome. Retrieved from https://qatargenome.org.qa/node/5) and a dataset of over 1005 Qatari exomes and genomes [12] aim to provide an overview of genetic diversity in the region. More recently, a genome-scale database, al mena [13] was created by our group, integrating datasets to provide a comprehensive allele frequency resource for the population, apart from providing allele frequencies of variants associated with diseases. A few more population-scale initiatives are underway which include the Saudi Human Genome Project [14] and the 1000 Arab Genome Project [14,15].
With the increasing adoption of sequencing in clinical settings, a number of new genetic variants and diseases are being reported from the region. Nevertheless there has been no structured effort to curate these genetic variants in a systematic format to allow comparison and enable genetic epidemiology analyses. Our previous studies demonstrate how a systematic, manually curated resource of genetic variants could enable the establishment of genetic epidemiology of disease causing variants in the population, with implications in better diagnosis of the disease [13,14,16].
To fill in the gap, we have created a comprehensive manually curated resource of genetic variants in the Arab population. This database, DALIA-for Disease Alleles in Arabs provides a ready reference to genetic variants published from Arab populations for clinicians, patients, as well as researchers. In addition, the resource also serves as a resource for genetic epidemiology.

Literature coverage and curation
A list of relevant publications was retrieved using pubmed.mineR [17] tool using country names to query for publications describing "mutation", "variant" or "polymorphism". An exhaustive list of PubMed IDs was retrieved for each of the 23 countries which have a significant number of Arabs. These include 22 countries which were part of the Arab League and speak Arab-Algeria, Bahrain, the Comoros Islands, Djibouti, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Mauritania, Oman, Palestine, Qatar, Saudi Arabia, Somalia, Sudan, Syria, Tunisia, the United Arab Emirates, Yemen apart from Israel which has a significant number of Arabs [18][19][20][21][22][23].
Each of the full-text articles were retrieved and manually curated, to include an extensive array of information, including country of origin and ethnicity apart from the variant type, the methods used for the annotation of variants etc. Special emphasis was made on including only papers that described variants in the countries, and not just reported them.
Each of the variants were further normalised to the GRCh37/hg19 build version of the Human genome for the genomic location and variant position, as well as according to the Human Genome Variation Society (HGVS) nomenclature for the representation of the variants. Gene names were similarly normalised to the Human Gene Nomenclature Committee (HGNC) nomenclature using Mutalyzer tool [24] which check for consistency for the normalised variants. Wherever applicable, dbSNP IDs and ClinVar IDs were added for variants mapping to respective databases. The disease names were also normalised according to the annotations in the Online Mendelian Inheritance in Man (OMIM) [25][26][27] database and linked wherever they could be consistently mapped.
For each of the variants, additional information on the zygosity of variant, and the method or assay used for identification of the variant were also collected. To ease the annotation, a prefilled list of techniques was used as a drop-down. The entire activity was performed on a template available to all annotators through Google docs online as well as offline through templates in Microsoft Excel. The annotators individually filled in variants for each country. Along with the template, the annotators were also given tutorial slides to train themselves on the curation guidelines so as to maintain uniformity across all curators.

Quality control and compilation of annotations
Each of the annotation sheets corresponding to the variants from each annotating group were compiled. Each of these sets were then independently cross-checked by a different team member to eliminate manual errors. The corrected sets were compiled into the master sheet. Each of the entries were verified for (i) accuracy and consistency of annotations as per the HGVS nomenclature for variants. (ii) accuracy of the variant position and reference and alternate genotypes on the hg19 genome build and (iii) uniformity and consistency of annotations, using the Mutalyzer tool.

Allele frequencies and genetic epidemiology
The allele and genotype frequencies of the variants were systematically compiled for the global as well as Middle Eastern populations. The choice of Middle Eastern datasets was based on the fact that no specific datasets are publicly available representative of the Arab populations and the dataset from the region could at least in part provide insights into the prevalence of variants in the region. The global allele frequencies were derived from the ExAC [28] as well as 1000 Genomes [29] databases, while the allele frequencies in Middle Eastern populations were compiled from the al mena database. The al mena database encompasses data from the Qatar subpopulations representing African, Arabian, Bedouin, Persian and South Asian ancestry subgroups as well as the data from the subpopulations which were represented in the Greater Middle East (GME) study [11] representing Asian Peninsula, Central Asian, Israel, Northeast Africa, Northwest Africa, Syrian Desert and Turkish Peninsula. The comparisons were tested for significance using Fisher's Exact Test. Comparisons were done for each subpopulation vis-a-vis the population average and with the global frequencies as derived from ExAC and 1000 Genomes databases.

Analysis of genes under natural selection
We used two different metrics, the Integrated Haplotype Score (iHS) and Fixation index statistic (Fst) to identify genes under natural selection. The top 1% genes sorted by the iHS scores in Qatar population were searched for the genes in the database. Pairwise Fst scores [30] were then computed for pairs of Qatari population (QALL) and the African (QAFR), Arab (QARB), Bedouin (QBED), Persian (QPER), and South Asian (QSAS) subpopulations.

Database and web interface design
The database was developed using MongoDB, a popular open-source NOSQL database system in view of the flexibility offered and the scalability for population-scale variant curation. The web server was configured in Apache 2.4.12, and the interface was coded in AngularJS and PHP.

Data compilation
We compiled a total of 3577 genetic variants from 368 genes and associated with 1984 diseases from over 1113 publications originating from 23 countries. Of these variants, a total of 2790 variants reported were unique, out of which a total of 2679 variants (96%) were exonic, while the remaining included 110 non-exonic (26 3'UTR, 24 intronic, 17 splicing, 12 UTR5, 11 upstream, 1 ncRNA intronic, and 1 a downstream variants).
Apart from the genetic context, associated information on the gene, disease, population from which the variant was reported, and a spectrum of computational scores predicting pathogenicity of the variants, including SIFT [31] Polyphen [27] and CADD [25] were also compiled for each of the variants. Allele frequencies of the variants across population-scale datasets like gnomAD [32] along with linkouts to other relevant databases including OMIM, ClinVar [33] and dbSNP [34] were also compiled systematically.

Database features
The database is designed to have a user-friendly interface to the data compilation. The prospective user can search the resource in multiple query formats, including genomic location, rsID or gene name. Population-specific variants can also be viewed by searching the name of the country. A complete list of example formats is available on the homepage (Fig 1). A comprehensive result containing details regarding the gene, disease caused, ethnic and geographic details of the patient, along with various allele frequencies as well as variant annotation scores have been linked to each search query (Fig 2).

Comparison with other databases
The variants in DALIA were compared with variants in ClinVar and the HGMD-Public version to see whether the variants covered in the present compilation are also similarly covered in other databases. Comparison of the 2790 unique genetic variants in dalia with Clinvar revealed that 2074 variants were shared by ClinVar. A similar comparison of variants with HGMD revealed 2197 variants were shared with the Public version of HGMD.

Allele frequency analysis
We performed a comparison of the allele frequencies of the variants across the population scale datasets of allele frequencies from 1000 Genomes, ExAC, Greater Middle Eastern, and Qatar. The former two represented global populations while the latter two represented population data from the middle east. The Qatar and GME allele frequencies were retrieved from the al mena database. Of the variants, a total of 564 variants had a reported allele frequency in at least one of the four population scale datasets considered, while 255 had frequencies across all the four datasets

Genes and variants under natural selection
Since the individual-level genotypes were available only for the Qatar population, we analysed whether any of the genes harbouring the genetic variants showed signals of natural selection. Briefly, after retrieving the genes in the top 1% of |iHS| scores which encompass 1368 genes, we obtained a total of 10 genes which overlapped with the genes in our compendium. Pairwise Fst scores were further computed for each of the variants in the genes. The 10 genes in the DALIA database had a total of 12 genetic variants reported from the Arab population. The genes and annotations are summarised in S1 Table. Of specific mention would be the ATP7B gene which is associated with Wilson's disease. A total of 37 variants in the ATP7B gene associated with Wilson disease were reported in the DALIA database. The variants had an allele frequency ranging from 0 across QAFR, QARAB, QPER AND QSAS to 0.777 in the QBED Qatar sub-population, and from 0 across GME-CA, GME-NWA and GME-TP subpopulations, to   0.702 in GME-AP sub-population in the GME dataset. The iHS scores and the pairwise Fst scores for the variants are plotted in Fig 4. The iHS and Fst scores for the remaining 9 genes are included in the Supplementary file (S2-S10 Figs).

Integration of datasets for understanding genetic epidemiology
We further explored whether the dataset could provide insights into the genetic epidemiology of mendelian diseases and traits. To this end, we analysed the allele frequencies of the variants in the Qatar exome/genome dataset as well as the GME datasets. Analysis of genetic variants with significant allele frequency differences revealed a total of 94 variants that were significantly different (Fisher's exact, p-value <0.01) in one or more populations or subpopulations. Upon analysis of these 94 variants, we discovered 4 variants mapping to the G6PD gene, associated with G6PD deficiency or favism, a prevalent disease in the Middle east. To provide a common standard for comparison of the pathogenicity annotations for the variants, we reclassified them according to the ACMG/AMP guidelines for interpretation of sequence variants [35]. Two of the four variants were found to have evidence to qualify being annotated as pathogenic or likely pathogenic. The variants, annotations and allele frequencies in each of the population and sub-populations are summarised in Table 1.
For further analysis, we additionally compiled all known pathogenic or likely pathogenic variants in the G6PD gene reported in Clinvar as well as other public databases. This resulted in a final list of 44 unique variants. We further analysed allele frequencies for these variants from across the four databases. Of the 44 variants, 21 had known allele frequencies in one or more of the four population datasets and 6 had allele frequencies in either Qatar, GME or both of the middle eastern population datasets.
Further analysis of these 6 variants revealed that the variants had allele frequencies ranging from 0.0 to 0.187 across different sub populations. Some of the variants showed significant population specificity, as in the case of rs5030872, which had allele frequencies in GME_ALL (0.001) and GME_NWA (0.01) but has not been reported in any of the Qatar or 1000 genome datasets. Similarly, rs2515904 has allele frequencies reported in Qatar (0-0.187) and 1000genomes (0-0.173) population datasets, but not in the GME and ExAC populations, indicating differences even among the Qatar and GME populations. Some variants showed intrapopulation variation, for example rs2515904, which had allele frequencies in QALL

Conclusions
In summary, DALIA fills the need for a relevant, manually curated and annotated database of variants associated with genetic diseases in the Arab populations. The DALIA database provides searchable access to over 2700 genetic variants and associated information published from the region. Apart from being a ready reference to clinicians and clinical geneticists, we also suggest using examples how such a database could provide insights into genetic epidemiology of diseases as well as clues on prevalent genetic variants in subpopulations which is pertinent in developing approaches for cost effective diagnosis and screening of genetic diseases. With population scale genome programmes now underway in the region [14,15,36] it is imperative that DALIA would become an important resource enabling appropriate analysis and interpretation of genomic data as well as a central resource for genetic epidemiology of diseases. Similarly with the advent of whole exome and whole genome sequencing becoming prevalent in clinical settings for the diagnosis of genetic diseases [36][37][38][39] it is expected that the accelerated discovery of genetic variants in diseases would also benefit the DALIA database in years to come.
In summary, to the best of our knowledge DALIA is one of the largest compilations of disease associated genetic variants from the Arab populations.  Table). This plot depicts the iHS as well as pairwise Fst scores along the gene loci, for all the known variants for the gene ABCA4. The array of lines at the bottom represent all known variants, and the exon structure of the gene is shown beneath it.