Comprehensive analysis of mitochondrial and nuclear DNA variations in patients affected by hemoglobinopathies: A pilot study

The hemoglobin disorders are the most common single gene disorders in the world. Previous studies have suggested that they are deeply geographically structured and a variety of genetic determinants influences different clinical phenotypes between patients inheriting identical β-globin gene mutations. In order to get new insights into the heterogeneity of hemoglobin disorders, we investigated the molecular variations on nuclear genes (i.e. HBB, HBG2, BCL11A, HBS1L and MYB) and mitochondrial DNA control region. This pilot study was carried out on 53 patients belonging to different continents and molecularly classified in 4 subgroup: β-thalassemia (β+/β+, β0/β0 and β+/β0)(15), sickle cell disease (HbS/HbS)(20), sickle cell/β-thalassemia (HbS/β+ or HBS/β0)(10), and non-thalassemic compound heterozygous (HbS/HbC, HbO-Arab/HbC)(8). This comprehensive phylogenetic analysis provided a clear separation between African and European patients either in nuclear or mitochondrial variations. Notably, informing on the phylogeographic structure of affected individuals, this accurate genetic stratification, could help to optimize the diagnostic algorithm for patients with uncertain or unknown origin.

Introduction Worldwide, the most frequent genetic diseases are represented by a very heterogeneous subgroup of congenital hemolytic anaemias, which includes β-thalassemia and sickle cell anaemia [1]. Individuals with β-thalassemia have a diverse range of clinical manifestations, from an asymptomatic state to a transfusion-dependent form, depending mainly on the β-globin gene mutations [2,3]. However, individuals with identical β-thalassemia genotypes can exhibit variable clinical severities [4], because of the high complexity of the genetic background associated with the disease [5,6]. The term sickle cell disease (SCD) encompasses a group of disorders characterized by the presence of at least one hemoglobin S allele (HbS; p.Glu6Val in HBB) and a second HBB allele carrying pathogenic variant resulting in abnormal hemoglobin polymerization. SCD (HbS/HbS) caused by the homozygous HBB variant p.Glu6Val is the most common cause of SCD in the US. SCD caused by compound heterozygous HBB pathogenic variants includes sickle-hemoglobin C disease (HbS/HbC) and two types of sickle β-thalassemia (HbS/β + -thalassemia and HbS/β˚-thalassemia). Other β-globin chain variants such as D-Punjab, O-Arab, and E also result in SCD when inherited with HbS. The marked variability in phenotype severity is due to the different genotypes of the disease and the co-presence of other independent genetic factors that may worsen or alleviate the phenotype [7,8]. Single nucleotide polymorphism (SNPs) in several genetic modulators and cis-regulatory elements involved in the regulation of human HbF, i.e. HBG2, B cell CLL/lymphoma 11A (BCL11A), Kruppel-like factor 1 (KLF1) and MYB, can delay fetal-to-adult hemoglobin (Hb) switch, ameliorating the severity of β-thalassemia and sickle cell disease [9][10][11][12][13][14][15][16][17][18][19][20].
In the last decades, mitochondrial DNA (mtDNA) has been proven to be an effective tool to reconstruct the maternal origins of individuals and geographically map samples, but it also plays a key role in various human diseases. Due to its maternal inheritance and the lack of recombination [26,27], its differentiation is generated only by the sequential accumulation of new mutations from a female ancestor that makes the mtDNA a molecular register along maternal lines. Over millennia, the molecular differentiation gave rise to groups of mtDNAs that share a set of variants (called haplotypes) acquired from the same common female ancestor. Even though whole genomic approaches are now answering to questions related to the origin and diversification of species, the mtDNA continues to be an important source of information to reconstruct the genetic history of specific populations. Studies based on sequence variation especially in the noncoding control region have allowed the classification of distinct geographic and ethnic affinities [28][29][30][31][32][33][34].
Nowadays, a detailed worldwide mtDNA phylogeny is available, where different haplogroups are defined by specific mutational motifs and restricted to specific areas and population groups [35]. The development of phylogeography is closely related to mtDNA analysis, which was described as one of the best tools to study population variability [36].
The consequences of mitochondrial mutations have been also evaluated in human ageing as well as inherited and acquired diseases [37,38], however the only study focusing on the association of mtDNA variations with β-thalassemia claimed that mitochondrial mutations may increase disease susceptibility or the development of phenotype-associated disease in affected individuals, although concluded that there was no strong proof for the association of these polymorphisms with β-thalassemia [39]. Doubtless, as mentioned above, there should be a correlation between hemoglobinopathies and geographic origins of patients [40,41], that could be deepened also by considering the mtDNA variability.
The aim of this pilot study was to provide a first comprehensive analysis of hemoglobin and uniparental marker variants in 53 patients with hemoglobinopathies coming from different countries, in order to depict the phylogeographic structure of affected individuals.

Results and discussion
The study was carried out on 53 patients (M/F: 38/15) affected by hemoglobinophaties who referred to our laboratory at the University of Perugia/Hospital Santa Maria della Misericordia, in Umbria a region of Central Italy.
Except for the first patient for whom we do not know the origin, 16 patients (30.2%) were European, 28 (52.8%) African, four (7.6%) Asian and four (7.6%) American. Complete information about geographical characteristics and genotypes were reported in Table 1, while clinical phenotypes were detailed in S1 Text and S1-S4 Tables.
All sickle cell disease patients carried the same β genotype, while the presence of the -α 3.7 / αα genotype was demonstrated in five patients and the -α 3.7 /-α 3.7 genotype was highlighted in one patient. Sixteen out of 20 (80%) were African patients, two were Americans (10%) and two were Europeans (10%) ( Table 1).
Moreover, seven out of eight patients included in the cohort of compound heterozygotes who carried HbS/HbC genotype (patients 46-51, 53), while the other patient bore HbC/ HbO-Arab genotype (patient 52). -α 3.7 /αα genotype was reported in two HbS/HbC cases. All eight patients were African (Table 1).  Table).

Hemoglobin phenotype modulating SNPs: Geographical distribution
Interestingly, we observed that frequencies of the five analysed SNPs were related to the geographical origin of patients (Fig 1). Although in our cohort there were only four Asian cases, all investigated SNPs were detected. Conversely, in the other patients (i.e. Europeans, Africans, Americans), the frequency of each SNP was considerably different. HBS1L-MYB rs9399137 T>C was not detected in both our African or American populations, while HBG2 rs7482144 C>T was not detected in European patients (Fig 1, Table 2, S1A and S1B Fig).
In particular, HBG2 rs7482144 [C>T] was found in nine patients, five African, three Americans and one Asian (S1A Fig, Table 2). The absence of this variation in European subjects did not completely fit with the allele frequencies retrieved from GnomAD database, which reported T alleles in the European population, although in a lower frequency (3839 T alleles and 10893 C alleles) (S1B and S1C Fig).

PLOS ONE
in 33/53 patients: 14 Africans, ten Europeans, four Asians, four Americans and no precise ethnic information were available for one patient (S1A Fig, Table 2). The allele frequencies of both BCL11A rs1427407 and BCL11A rs10189857 were in agreement with those reported in GnomAD database (Fig 1B and 1C). HBS1L-MYB rs28384513 [A>C] was identified in ten African patients, nine Europeans and two Asians, while it was not detected in American patients ( Fig 1A, Table 2). Notably Gno-mAD database reported the G allele also in this population, although in a lower frequency (9164 T alleles and 5824 G alleles) (S1B and S1C Fig). The low number of American patients included in our cohort (i.e. three pts) could explain the discrepancy between our results and the database.
Finally, the HBS1L-MYB rs9399137 [T>C], was detected only in 6/53 analysed patients, five European patients and one Asian (S1A and S1B Fig, Table 2). No African patients with HBS1L-MYB rs9399137 [T>C] were found in our cohort although this represented the largest group. Allele frequencies obtained by GnomAD database reported C alleles in American and African populations, although at lower frequencies (S1B and S1C Fig).
In conclusion, for three out of five SNPs our results were in agreement with GnomAD database. The only discrepancy regards HBG2 rs7482144 [C>T] which was not detected in European patients and HBS1L-MYB rs9399137 [T>C] which was not detected in Africans although these patients represented 30.2% and 52.8% of the cohort, respectively. These discrepancies may be due to a selection bias as European patients represented only Italian and Albania geographical areas, while 24/28 African patients came from seven Sub-Saharan countries and only 4/28 came from two North Africa countries. (Fig 1)

Analysis of mtDNA sequence variation and integration with hemoglobin marker variants
Mitochondrial control region sequences ranged from nucleotide position (np) 16024 to np 210, thus encompassing the HVSI and most of the HVSII ( Table 2). The overall sequence alignment of samples revealed 99 polymorphic sites (S), represented by ten transversions and two indel (one deletion and one insertion). Nucleotide diversity (π) across all samples was estimated at 0.0173. The haplotype diversity was very high (Hd = 0.993) as well as the average number of nucleotide differences (k = 13,052).
When aligned to the revised Cambridge Reference Sequence, a total of 44 different haplotypes were obtained and classified through their mutational motifs into 13 different macrohaplogroups: B, D, G, JT, L0, L1, L2, L3 � , M � , N9, R0, UK and X (Fig 2A). Seven haplotypes were shared between two or three samples, but five couples of those individuals declared to be maternally related. For the remaining two haplotypes, one (HT02) was found in two brothers from Algeria and one sample from Morocco the other (HT04) was detected in one patient from Ivory Coast and one from Benin.
In order to better define the phylogenetic relationships between samples, we have constructed the network illustrated in Fig 3. Haplogroup labels are reported according to the Phylotree nomenclature [42]. The network topology showed two major mtDNA clusters, corresponding to super-haplogroups R and not-R. Within R, JT and H haplogroups prevailed. The latter formed a star-like cluster mainly represented by sub-haplogroups H1, H2 and H7. R0a, HV, B and UK were also represented. The not-R cluster included mostly L1, L2 and L3 � sub-lineages, followed by L0 and D; also N9, M, X and G were revealed.
It is known that macro-haplogroup L represents the most ancestral mitochondrial lineage of all living modern humans and, if excluding the derived L3 branches M and N, it represents the majority of the typical sub-Saharan mtDNA variability. Within L, L2 is the most common haplogroup in Africa and it has been previously observed throughout the continent [43]. Our findings agree with the haplogroup frequency and origin of our samples specifically from Nigeria and Ivory Coast (but also Congo and Benin), except for Syria, from which two samples (brothers) derived. L3 � was found in two samples from Nigeria, one from Guinea, one from Ghana and one from Italy. Even if L haplotypes account for <1% of mtDNAs in Europe, with L1b being the most common haplogroup, frequencies widely change, ranging from 3% in Southern Europe to 0.7% in Central Europe and 0.5% in Northern Europe.
Within R0, only one sample R0a and one HV1 were found, while the remaining patients were H, thus corresponding to a total of 11/53 (21%). Haplogroup H is the most frequent mitochondrial lineage in Europe (~40%) with a declining pattern from Western Europe towards the Near East and Caucasus (~10-20%). It was also found in Africa, especially the North region, with a peak of H1 in Maghreb and this is in accordance with the origin of our H samples (Italy, Albania, Morocco and Algeria).
As for the macro-haplogroup M, samples originated from Venezuela and Cambodia and belonged to the D4 sub-branches (D1f1 and D4b2d), which exactly reflect the geographic maximum distribution of these lineages (South America and East Asia, respectively) [44]. Even patient 53 from Morocco, belonging to M1a2a mitochondrial haplogroup, was congruent with the frequency distribution of the lineage, unlike African patients 28 and 29 (brothers) and 40, which belonged to G1a and M30c respectively, both sub-lineages typical of Asia.
The East Asian B4 haplogroup revealed in patient 27 from Dominican Republic could result odd, but we should keep in mind that a subclade of B4b is one of the five haplogroups found among indigenous people of the Americas, thus explaining a lower distribution also in Central America.
Patient 13 from China belonging to N9a10a perfectly reflects the East Asian origin and the distribution peak of the lineage [45].
After evaluating relationships between mitochondrial phylogeny and the geographic origin of individuals, we moved to a different level of resolution, including also the nuclear SNP data in a new network (S2 Fig, Table 2). We considered the mitochondrial control-region haplotypes, whose phylogenetic relationships are illustrated in the Fig 3, together with zygosity of the five variable genes, trying to detail the patients' genotype. Patients 5, 14, 29 and 45 were the only homozygous wildtype for all variables: two from Italy (both belonging to the R0 haplogroup) and two from Nigeria (carrying L0 and G mitogenomes). A total of 41/53 patients showed the prevalence of wildtype homozygosity, while only six individuals (patients 6, 10, 21, 25, 34 and 44) were heterozygous for three polymorphisms. Fourteen patients were homozygous mutants for one SNP, while only patient 15 was homozygous mutant for three variables. It is to be highlighted that all individuals within L mitochondrial macro-haplogroup were homozygous wildtype for HBS1L-MYB rs9399137 [T>C], as patients harbouring the JT mitochondrial lineage were homozygous wildtype for HBG2 rs7482144 [C>T]. Conversely, C/ EBPE rs45496295 [C>T] was always found as homozygous wildtype, thus it was excluded from the representation.
A further analysis was performed, trying to obtain a graphical overview of the possible integration between mtDNA control region variability and nuclear mutations associated to hemoglobinopathies subgroups ( Table 2, S2 Text and S5 Table). In S3 Fig, different phenotypical subgroups were highlighted in the mitochondrial phylogenetic tree: β-thalassemia (β-thal) patients, sickle cell/β-thalassemia (HbS/β + or β 0 ), sickle cell disease (HbS/HbS) and compound heterozygotes (HbS/HbC and HbO-Arab/HbC). The overall network revealed a predominant distribution of the hemoglobinopathies subgroups within clades (S3 Fig). In particular, except for three haplotypes belonging to samples from Albania (patient 26) and South America (patiens 27 and 41) that lied in the R clade, sickle cell disease seemed to be common among not-R haplogroups, especially African lineages (L0, L1, L2 and L3 � ). It is to be noticed that no sickle cell disease patients belonged to macro-haplogroup R0 (inclusive of R0a, H and HV haplogroups) (Fig 2C). A similar distribution was observed for compound heterozygotes patients, all Africans; they lied in not-R clade (M1, L2 and L3 � haplogroups), except for patient 52, from Morocco, that belonged to H1 lineage and it was the only patient characterized by HbC/ HbO-Arab genotype. An opposite distribution could be observed for thalassemic patients, which spread among R haplogroups, except for four samples from South America (patient 7), Syria (patients 11 and 12, brothers) and Cambodia (patient 15). In particular, the R0 clade seemed to be overall characterized by β-thalassemic individuals. On the other hand, sickle cell/ β-thalassemia patients (HbS/β + or β 0 ) seemed to be widespread in the network.
When we separated data in four different networks, one for each hemoglobinopathies subgroup coloured with the geographic origin of the patient, it was clear that Asian samples were all thalassemic, but clustered far from European thalassemic patients (S3E Fig). No African patient was affected by thalassemia (with the exception of a couple of brothers from Algeria). If excluding two samples with sickle cell/β-thalassemia patients (HbS/β + or β 0 ) ( To resume, we observed that European patients fall prevalently into the thalassemic (β + /β + , β 0 /β 0 and β + /β 0 ) or sickle cell/β-thalassemia (HbS/β + or β 0 ) subgroups, while Africans into the remaining two (i.e. HbS/HbS and HbS/HbC) (S4 Fig). The clear structuring and sub-division between African and European clades could be detected also in the principal component analysis (PCA) performed by considering all variable sites (both nuclear and mtDNA SNPs) for the 53 patients (Fig 4). Moreover, American samples lied in two quarters, close to Europeans. The graph showed also three European samples close to the African quarters (left quarters): Albanian patients 17 (X haplogroup) and 26 (JT), and patient 35 (L3 � ) from Italy. On the other hand, five African samples stand close to the vertical line that separates left and right quarters: patient 24 (L3 � ) from Guinea; patients 29 (G), 43 (L3 � ), and 45 (L0) from Nigeria; and patient 44 (L0) from Congo. Interestingly, all these patients together with three Europeans (patients 17,26,35) carried at least HbS allele, which is very frequent in the Sub Saharan Africa. The prevalently European quarter (the first quarter on the right) shows four African patients (patients 3 and 4 siblings, and patients 48 and 52). The two most distant patients from African samples are from North Africa (Morocco and Algeria) confirming the closeness to European ones. Intriguingly, only two thalassemic samples lie on the left half of the PCA: they are patients 11 and 12, the only samples from West Asia, while the other two Asian samples were from China and Cambodia, and fall into the right half of the graph.
Even if the past and recent migrations could partially alter the picture, a high level of population and geographic structuring still persist.

Conclusions
This study represents the first comprehensive analysis of nuclear and mitochondrial DNA marker variations in patients with hemoglobinopathies and showed a clear separation between African and European patients allowing for depicting the phylogeographic structure of affected individuals. Further investigations on a larger cohort of patients, as well as on additional mitochondrial and nuclear variants using standard and Next Generation Sequencing methodologies, will allow to establish possible correlations between genetic data and clinical behaviour. Taking all together, this pilot study based on hemoglobin and uniparental marker variants could represent an integrated and powerful tool for enliven the multifaced pictures of hemoglobinophaties.
Peripheral blood samples were collected during routine medical examinations and genomic DNA was isolated through salting-out procedure [47].
At diagnosis and/or during follow-up all patients or their parents/guardians provided informed consent for sample collection, storage and for research collaborative studies. Protocols were approved by the Ethics Committee for Clinical Experimentation of the University of Perugia (protocol no. 2017-01) and analyses were performed in agreement with the Helsinki declaration.  [48].

Genomic DNA analysis
The SNP frequency in the different populations were obtained by GnomAD database (Genome Aggregation Database, https://gnomad.broadinstitute.org/).
Sequences were assembled and aligned to rCRS using Sequencher™ 5.10 (Gene Codes Corporation). Whenever electropherograms showed ambiguities, new PCR amplifications and sequencing reactions were performed.
Final assembling encompassed at least the entire Hyper Variable Sequence HVSI and part of HVSII (from np 16024 to np 210). All mitochondrial control-region sequences were deposited in GenBank with accession numbers MT176183-MT176235.
For each sample, haplotypes were registered and classified in haplogroups according to the respective mutational motifs by referring to HaploGrep2 software based on PhyloTree database build 17, www.phylotree.org [42].
Genetic variation indices were estimated by using DnaSP 5.1 software [50]. The evolutionary relationships among haplotypes were visualized through the construction of a median-joining network using Network 10 software (www.fluxus-engineering.com).
By default, all polymorphisms relative to rCRS were weighted by a value of ten, except for the highly variable site 16519 and the variations in the poly-C stretch around np 16189 that were down-weighted to one. Conversely, triple weights were assigned to the mutational motifs 16223 and 73 in order to strengthen the topology of the network.
Principal component analyses (PCA) were performed using Excel software implemented by XLSTAT, as described elsewhere [51] in order to explain the variance of multivariate data (mitochondrial and nuclear DNA variation, considered separately in each sample).