Investigation of mutations in the HBB gene using the 1,000 genomes database

Mutations in the HBB gene are responsible for several serious hemoglobinopathies, such as sickle cell anemia and β-thalassemia. Sickle cell anemia is one of the most common monogenic diseases worldwide. Due to its prevalence, diverse strategies have been developed for a better understanding of its molecular mechanisms. In silico analysis has been increasingly used to investigate the genotype-phenotype relationship of many diseases, and the sequences of healthy individuals deposited in the 1,000 Genomes database appear to be an excellent tool for such analysis. The objective of this study is to analyze the variations in the HBB gene in the 1,000 Genomes database, to describe the mutation frequencies in the different population groups, and to investigate the pattern of pathogenicity. The computational tool SNPEFF was used to align the data from 2,504 samples of the 1,000 Genomes database with the HG19 genome reference. The pathogenicity of each amino acid change was investigated using the databases CLINVAR, dbSNP and HbVar and five different predictors. Twenty different mutations were found in 209 healthy individuals. The African group had the highest number of individuals with mutations, and the European group had the lowest number. Thus, it is concluded that approximately 8.3% of phenotypically healthy individuals from the 1,000 Genomes database have some mutation in the HBB gene. The frequency of mutated genes was estimated at 0.042, so that the expected frequency of being homozygous or compound heterozygous for these variants in the next generation is approximately 0.002. In total, 193 subjects had a non-synonymous mutation, which 186 (7.4%) have a deleterious mutation. Considering that the 1,000 Genomes database is representative of the world’s population, it can be estimated that fourteen out of every 10,000 individuals in the world will have a hemoglobinopathy in the next generation.


Introduction
Understanding the relationship between phenotype and genotype in the clinical setting is one of the main objectives of traditional research [1]. However, studies on a large number of mutations are problematic, primarily due to the experimental analyses. In contrast, in silico analysis PLOS ONE | https://doi.org/10.1371/journal.pone.0174637 April 5, 2017 1 / 9 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 is faster and easier to execute, yields more results, and costs less, thus making it more efficient. This type of analysis is based on alterations in the sequences of nucleotides and/or amino acids and their comparison with the native sequence to correlate the effect of these alterations on the phenotype of the individual [1,2,3,4]. Mutations in the HBB gene, which is located on chromosome 11 p15.5 [5], are responsible for several serious hemoglobinopathies, such as sickle cell anemia and β-thalassemia. Hemoglobinopathies are a set of hereditary diseases caused by the abnormal structure or insufficient production of hemoglobin. Sickle cell anemia and β-thalassemia can lead to serious anemia and other life threatening conditions [6]. Sickle cell anemia is one of the most common monogenic diseases worldwide. It is estimated that 312,000 people are born with sickle cell anemia every year, and the majority of these individuals are native to Sub-Saharan Africa [7]. Thus, it is important for the public healthcare system to detect heterozygous carriers of hemoglobinopathies, as they can produce homozygous and double heterozygous individuals with serious clinical conditions [8].
The 1,000 Genomes Project is an international consortium organized with the objective of sequencing a large number of individual genomes representative of the world's population. The consortium has the objective of better characterizing the sequence variation of the human genome and enabling the investigation of the relationship between genotype and phenotype. Thus, the 1,000 Genomes Project enables a more precise study of variants in genome-wide association studies (GWAS) and the best localization of variants associated with diseases in different population groups [9].
The objective of this study is to track variations in the β-globin gene (HBB); to describe the frequencies of mutations in different population groups using the 1,000 Genomes databank, which provides a comprehensive resource of human genetic variation [9] relative to the HG19 reference genome [10]; and to investigate the pattern of resulting pathogenicity.

Methodology
To perform this study, data from 2,504 samples deposited in the 1,000 Genomes database were used; these open-access sequences were aligned with the HG19 reference genome using the SNPEFF tool [11]. This program provides and records the effects both of genetic variations as well as amino acid alterations. The resulting data were visualized in the Integrative Genomics Viewer (IGV) [12], a high-performance visualization tool for the interactive exploration of genomic datasets. The mutations were tracked at the nucleotide and amino acid levels, and the population frequencies with which these mutations occur, the type of mutation, and the respective positions were recorded.
To investigate pathogenicity these mutations, five different prediction tools, including POLY-PHEN [13], SIFT [14], PROVEAN [15], PANTHER [16], and E MUTPRED [17], and three databanks, including CLINVAR [18], dbSNP [19] and HbVar [20], were used, as shown in Fig 1. Each predictor uses distinct characteristics to determine the effect of the mutations in relation to the information obtained regarding the structure and function of the protein. It is important to highlight that the results of all predictors provide additional evidence of pathogenicity; thus, five predictors were analyzed to improve accuracy. The determination of the pathogenicity of each mutation is based on four pieces of evidence: (i) CLINVAR, (ii) dbSNP, (iii) HbVar, and (iv) predictors.  Table 1. All observed mutations were heterozygous and already had SNP IDs. The mutations with the highest allelic frequencies were as follows: (i) rs334 had total frequency of 0.0274 (African and American populations); (ii) rs33930165 had a frequency of 0.0034 (only in the African population); and (iii) rs33950507 had a frequency of 0.0028 (Eastern and Southern Asian populations), as shown in Table 2.
Synonymous mutations were encountered in 16 (7.6%) samples and were excluded from the investigation of pathogenicity performed by the database predictors because they do not alter the amino acid sequence.

Discussion
Mutations in the HBB gene are distributed unevenly among the different population groups. The African population was the most affected, with 73.2% of individuals having mutations in this gene, while the European population was least affected, with 4.3% of individuals having such mutations.
The three mutations with the greatest frequency were (1) rs334 (AFR and AMR); (2) rs33930165 (AFR); and (3) rs33950507 (EAS and SAS). The rs334 mutation is responsible for hemoglobin S, known as HbS, which causes sickle cell anemia. The rs33930165 mutation is responsible for hemoglobin C, or HbC [41], which is more frequent in the African population [42,43]. In addition, the rs3395057 mutation is responsible for hemoglobin E, or HbE [41], which is involved in β-thalassemia described in Asian populations [44].
The available data show that variants rs33986703, rs63750783, and rs281864900 are responsible for β-thalassemia and are described in Asian populations [45,46,39]. Variants rs11549407 and rs33971634 are also β-thalassemia mutations but are common in European populations [47,24]; rs33971440 and rs35578002 are commonly found in populations of the Mediterranean region [48,49,34].
Although the HBB gene is well studied, there are some mutations in this gene that are not well known and poorly described in the literature. This is the case of the variants rs111645889, rs33958637, rs1135071, rs33943001 and rs33912272, for which no scientific papers were found discussing their epidemiology. CLINVAR [18] is one of the most widely used databases in clinical and pathological analyses related to mutations. However, not all mutations of the HBB gene (rs35578002) are registered in this database, and conflicting results have been observed when comparing predictors with the CLINVAR, dbSNP and HbVar databases to estimate the pathogenicity of each mutation, or more specifically, the clinical significance of mutations rs111645889, rs33946267, rs33958637, rs35578002 and rs33912272.
It is important to emphasize that all samples deposited in the 1,000 Genomes Project, an international consortium aimed at producing a public catalog of human genetic variability, belong to individuals without clinical manifestations of any disease.
The SNP rs35578002 is not available in CLINVAR and has no information on clinical significance in the dbSNP database. Predictors consider this variant as benign, but the HbVar database classifies it as a damaging mutation. This variant is the β-thalassemia mutation Cd29 (C> T), which in homozygosis causes hemolytic anemia and ineffective erythropoiesis [34]. This mutation was described in Mediterranean populations. One possible explanation for the inconsistent information about the clinical significance of this variant is that it is a synonymous mutation in the splice region that is critical for RNA processing, causing thalassemia as described in HbVar. Also noteworthy is the mutation rs33946267. According to the literature, this mutation leads to the formation of Hb D-Punjab. This mutation is generally asymptomatic but may occasionally cause moderate hemolytic anemia, similar to the manifestations of sickle  cell anemia when associated with other hemoglobin variants, such as HbS or β-thalassemia mutations. Its initial distribution suggests that it is more prevalent in the central region of Asia, but due to migration, it can be found in several other regions [50]. According to the results, 8.3% of the phenotypically healthy individuals of the 1,000 Genomes database have a mutation in the HBB gene in heterozygosis. This means that eighty out of 1,000 individuals have a mutant allele in the gene. The frequency of mutated genes was estimated at 0.042, so that the expected frequency of being homozygous or compound heterozygous for these variants in the next generation is approximately 0.002. In total, 193 subjects had a non-synonymous mutation, meaning that approximately 7.7% had a change that affects the sequence of amino acids. Of these, 186 (7.4%) have a deleterious mutation based on available data on the clinical significance of these mutations ( Table 3). Considering that the 1,000 Genomes database is representative of the world's population, it can be estimated that fourteen out of every 10,000 individuals in the world will have a hemoglobinopathy in the next generation.
Independently, new studies are needed to validate the clinical consequences of the mutations with undefined pathogenicity. Considering the absence of physiopathological knowledge relative to the newly identified mutations, the use of in silico predictors (in an orderly and criteria-based manner) emerges as a possible tool to aid in decision-making with respect to diagnostic, preventative, and treatment measures.