Epidemiology of Mycobacterium tuberculosis lineages and strain clustering within urban and peri-urban settings in Ethiopia

Background Previous work has shown differential predominance of certain Mycobacterium tuberculosis (M. tb) lineages and sub-lineages among different human populations in diverse geographic regions of Ethiopia. Nevertheless, how strain diversity is evolving under the ongoing rapid socio-economic and environmental changes is poorly understood. The present study investigated factors associated with M. tb lineage predominance and rate of strain clustering within urban and peri-urban settings in Ethiopia. Methods Pulmonary Tuberculosis (PTB) and Cervical tuberculous lymphadenitis (TBLN) patients who visited selected health facilities were recruited in the years of 2016 and 2017. A total of 258 M. tb isolates identified from 163 sputa and 95 fine-needle aspirates (FNA) were characterized by spoligotyping and compared with international M.tb spoligotyping patterns registered at the SITVIT2 databases. The molecular data were linked with clinical and demographic data of the patients for further statistical analysis. Results From a total of 258 M. tb isolates, 84 distinct spoligotype patterns that included 58 known Shared International Type (SIT) patterns and 26 new or orphan patterns were identified. The majority of strains belonged to two major M. tb lineages, L3 (35.7%) and L4 (61.6%). The observed high percentage of isolates with shared patterns (n = 200/258) suggested a substantial rate of overall clustering (77.5%). After adjusting for the effect of geographical variations, clustering rate was significantly lower among individuals co-infected with HIV and other concomitant chronic disease. Compared to L4, the adjusted odds ratio and 95% confidence interval (AOR; 95% CI) indicated that infections with L3 M. tb strains were more likely to be associated with TBLN [3.47 (1.45, 8.29)] and TB-HIV co-infection [2.84 (1.61, 5.55)]. Conclusion Despite the observed difference in strain diversity and geographical distribution of M. tb lineages, compared to earlier studies in Ethiopia, the overall rate of strain clustering suggests higher transmission and warrant more detailed investigations into the molecular epidemiology of TB and related factors.

Introduction Tuberculosis (TB) is a chronic infectious disease caused by species of the Mycobacterium tuberculosis complex (MTBC). Except for Mycobacterium tuberculosis (M. tb), which is the primary cause of human TB, other members of the MTBC are believed to have adapted to different animal hosts and therefore they may have reduced fitness to cause human infection [1,2]. Beside environmental and socio-economic factors, the biology and epidemiology of human TB has likely been shaped by the historical interaction between MTBC members and its host [2,3]. The genetic variation between MTBC species contributes to the ambiguities concerning disease presentation, frequency of transmission and clinical progress [2,4]. This is particularly true for M. tb, where the interaction of genotypic variation among different strains with human genetic polymorphism play a prominent role in the epidemiology of TB diseases [4][5][6][7]. The overall epidemiology of MTBC species is influenced by the environment, with its frequency and distribution being dependent on social, economic, and ecological causes [4,8].
Although, there are no well-established classical factors that are known to be strongly associated with disease phenotype, immunological studies have suggested that some M. tb strains and lineages are more virulent and/or more infectious than others [9]. It has been stated that some strains that belong to the modern MTBC Lineages are more capable of inducing higher inflammatory response than lineages of the same clade (Haarlem, high; Beijing, low) [10]. However, difference in pathogenicity and lineage specific rate of transmission are important only when considered together with the host genotype and geographical location [11].
Although, it is still challenging to investigate the influence of bacterial and host genotype on the development of different forms of TB in humans, disease phenotype seems to be associated with a bacterial genotype [2,6]. According to other published reports, L4 seemed more likely to be associated with Pulmonary TB (PTB) while L2 and L3 were linked with extra-pulmonary TB (EPTB) disease, such as TB meningitis and TB in cervical Lymph Nodes (TBLN) [12][13][14][15].
Another comparative study showed that strains of the East African Indian (L3) and Euro-American (L4) lineages were negatively associated with extra thoracic disease as compared to strains of the East Asian lineage (L2) [16]. These studies thereby suggest that species diversity and their interaction with host biology affects the pathophysiology and natural course of TB disease [2,17]. For example, a study conducted in Tanzania has shown that chronic signs of TB disease, such as weight loss, have been more associated with L4 strains than with strains of the Indo-Oceanic (L1) lineage [18]. In addition to factors associated with human genetics such as ethnicity, biological and clinical determinants of an individual, such as HIV and body mass index, have shown significant difference on disease phenotype and rate of transmission across major M. tb Lineages [16, [19][20][21].
Different alternative molecular identification methods have been used to estimate rates of disease transmission, which is generally inferred by comparing genotypic clustering between patient isolates from a given epidemiological setting [10,22]. In other words, successful transmission of particular genotypes has been reflected through an increase in the frequency and consistency of strain domination over time in defined populations [16,23]. However, despite recently developed advanced molecular diagnostic tools, both the nature of genotype variations and the characteristics of the host immune response to certain types of M. tb strains are largely unknown in many TB high burden settings [24,25]. Particularly in countries like Ethiopia, where there is high prevalence and high transmission rate and a diversified population of bacterial species [26][27][28][29], molecular identification of the agents can be an important component of the knowledge base required to improve on previous achievements of the national TB control program. Taking all this into account, the present study investigated factors associated with M. tb lineage predominance and rate of strain clustering within the context of urban and peri-urban settings in Ethiopia.

Study design and setting
A multi-centre health facility based cross-sectional study was conducted in Ethiopia during 2016 and 2017. As part of the Ethiopia Control of Bovine Tuberculosis Strategies (ETHICO-BOTS) project, four hospitals, two private clinics, and fourteen health centers located in urban and peri-urban areas, were purposively selected from four different regions of Ethiopia. Addis Ababa was the largest study site and constituted of Addis Ababa city and the surrounding special zone of Oromiya region while the remaining three study sites were located in the regional urban cities of Mekele in Tigray, Gondar in Amhara, and Hawassa in Southern Nations Nationalities, and Peoples' region.

Study population
Recruitment of participants at selected health facilities was carried out according to the national guideline standard case definition criteria. All presumed TB cases were initially considered as potential source of the study population. Then those patients clinically diagnosed with PTB or TBLN were asked for informed consent and enrolled consecutively. Recruitment of PTB cases was done at all selected governmental health facilities. TBLN patients were enrolled from all four study sites; however they were only recruited from the Pathology Units of three governmental hospitals and two private clinics because of lack of diagnostic facilities and skilled professionals for fine-needle aspirate (FNA) cytology examination at governmental health centers. Included cases from both groups were those eligible for first-line Anti-TB treatment. Known MDR (multi drug resistant) TB cases and EPTB patients other than those with TBLN were excluded in this study.

Data collection
Clinical and demographic information was collected from recruited TB cases using a pretested structured questionnaire. Following the routine care service, consented PTB and TBLN participants were requested to provide spot sputum and FNA samples, respectively. Care providers (nurses) working at directly observed therapy (DOT) centres collected sputum specimens using sterile containers. FNA specimens were collected from the selected hospitals and private clinics by experienced pathologists who performed FNA cytology examination as part of their routine diagnostic service. According to the standard procedure, FNA collection was performed using a 21-gauge needle attached to a 10 ml syringe and specimens were collected into cryo-tubes with sterile phosphate buffer saline (PBS). Samples were kept at -20˚C at remote study sites until transported on ice boxes to the Armauer Hansen Research Institute (AHRI) TB laboratory where the clinical samples were stored at -80˚C until processed for mycobacterial culture. Clinical sample handling and laboratory procedures were performed according to a previously published protocol [27].

Mycobacterial culturing
Samples collected in the study were processed and cultured for mycobacteria using standard procedures established at the AHRI TB laboratory [27,30]. Specimen samples were inoculated on Löwenstein-Jensen (LJ) medium slants supplemented with either glycerol or pyruvate and incubated at 37˚C. The slopes were examined weekly for up to eight weeks for any visible growth. Bacterial colonies identified as Acid-Fast Bacilli by ZN staining [27] were saved as frozen stocks in 20% glycerol as well as heat-inactivated in 500μl distilled H 2 O at 80˚C for 60 min; the latter samples were used for subsequent molecular identification.

Genotype analysis and comparison with global databases
Spoligotype patterns were converted into binary and octal formats and compared with previously reported strains in the international SITVIT2 database [8] hosted by Institute Pasteur de la Guadeloupe. Here, spoligotypes shared by more than one strain were designated as shared types and were assigned a shared international type (SIT) number according to the SITVIT2 database, while patterns that were not recognized in the latest online version of the database were labelled as "New" if the pattern was identified for more than one strain and "Orphan" if the pattern was unique to only one strain. Further lineage classification for corresponding nomenclature was done using the 'Run TB-Lineage' online tool from linked databases (http:// www.miru-vntrplus.org/MIRU/index.faces and http://tbinsight.cs.rpi.edu/run_tb_lineage. html). Here, major lineages were predicted using a conformal Bayesian network (CBN) analysis while knowledge based Bayesian network (KBBN) analysis was used to predict the corresponding sub-lineages.

Data management and statistical analysis
All genotype outputs from the computer assisted analyses were imported to SPSS and merged with clinical and demographic data. The final clean dataset was exported to STATA and Rsoftware to perform further statistical analysis. Two of the main outcome variables, clustering rate and M. tb lineages, were categorized as binomial scale of measurement. In the first category, "clustered" referred to two or more isolates sharing identical spoligotyping patterns while isolates that did not have shared patterns was defined as "unique". Here, three different logistic regression analysis methods were performed to identify and compare factors associated with strain clustering. The first Bivariable analysis was performed to estimate a crude (unadjusted) odd ratio for each independent categorical variable while the second multivariable logistic regression analysis was used to estimate adjusted odd ratio (AOR with 95% CI) that better reflect the likelihood of included variable associated with rate of strain clustering. The third model (hierarchical logistic regression) was preferred to adjust for the effect of regional variations, the first level factor that often attributed with strain clustering, where host-related clinical factors and spoligotype-based M. tb lineage classification were considered as second level factors. Variables included in the second model were reconsidered and used to compare the corresponding adjusted estimates (AOR with 95% CI) generated from the third (Multilevel) model which was done using STATA software with the recommended (melogit) command. The multivariable logistic regression was used to determine the clinical characteristics or disease phenotypes associated with dominant M. tb lineage. In both cases, R-package Software commands were used to perform bivariable and multivariable logistic regression. Before running the multivariable logistic regression analysis, stepwise backward elimination technique was applied to select independent variables. Initially, all clinically relevant factor variables were included in the full model. Then using the specific statistical command (Step) under R-studio, the software program automatically generated all possible alternative models having lists of dependent and independent variables. Finally, according to the Likelihood Ratio-test and to minimize the effect of confounding variables, a relatively better fitted model with potential explanatory variables that has the lowest akaki information criteria (AIC) was selected. Independent relationship of variables was decided based on different cut-off point for statistical significance level (α: < 0.05; < 0.01 and < 0.001) and interpretation of key findings was reported using the adjusted estimates (AOR with 95% CI).

Ethical considerations
This study was part of the ETHICOBOTS project, which obtained ethical clearance from the Federal Ministry of Science and Technology (Ref. No: 301/001/2015), the AHRI/ALERT Ethics Review Committee (Project Reg. No: PO46/14) and from University of Gondar Institutional Review Board (Review number: O/V/P/RCS/04/45/2016). Support letters were obtained from Regional State Health Bureaus and health facilities. Enrollment of study participants was done after written informed consent was secured and signed agreements were received from all participating health facilities. Detailed information about the risks and benefits of the study as well as confidentiality of the research data was a prerequisite for study participation.

Genetic diversity of Mycobacterium tuberculosis lineages
All 258 isolates provided in the S1 Table were genotyped by LSP as M. tb while being intact for RD9. When the isolates were spoligotyped 84 different patterns were identified, of which 58 SIT patterns were already recognized in the SITVIT2 database (accounting for 231/258 (89.5%) of the isolates). Among these patterns, 32 M. tb isolates were singletons while 25 designated shared patterns, each with 2 to 40 isolates, accounted for 85.7% (198/231) of all isolates with identified SIT patterns. The remaining twenty five unique orphan patterns and two isolates with a new shared spoligotype pattern (Table 2), representing 27 (10.5%) of the total isolates, were not yet recognized by the SITVIT2 database. As presented in Table 3, over half of the isolates 145/258 (56.2%) were represented by five of the dominant SIT patterns, including SIT25 (n = 40), SIT149 (n = 36), SIT53 (n = 32), SIT26 (n = 17), and SIT37 (n = 11).
The alternative KBBN classification showed a predominance of the CAS (34.9%) sub-lineage among strains defined as L3. T (15.9%), T3-ETH (15.1%) and Haarlem (10.9%) were the most common sub-lineages of L4. There was a significant difference in geographical distribution between strain types; all LAM families of L4 (LAM, LAM3 and LAM5) were observed in the northern part of the country (Gondar and Mekele). Similarly, the CAS families (L3), which were highly dominant in the Gondar area, were rather rare around Hawassa. The Manu, Haarlem and T families (all of L4) accounted for the majority of strains identified in the Hawassa region (Fig 2).

Factors associated with strain clustering and predominance
The overall clustering rate aggregated from 26 (25 SIT and one new) shared patterns was 77.5% (200/258). Our multivariable analysis (  isolates of a unique pattern could be considered to have resulted from reactivation of latent infection or were else presumably acquired outside of the study population [33]. Considering that hierarchical logistic regression analysis was performed to minimize the observed heterogeneity due to geographical location. After controlling for the effect of regional variations adjusted estimates generated from the final model showed that the rate of strain clustering was inversely associated with TB-HIV co-infection and comorbidity with other chronic illnesses. As shown in Table 4 A second multivariable analysis was performed in relation to the clinical characteristics of the two most predominant lineages (L3 and L4). As shown in Table 5, in comparison to L4 strains of M. tuberculosis, the odds for TBLN cases infected with L3 was three and half fold Table 3

Discussion
Despite the observed difference in strain diversity and distribution of M. tb lineages across regions, high percentage of shared patterns suggested a substantial overall strain clustering rate around urban and peri-urban settings in Ethiopia. Altogether, a predominance of known SIT patterns resulted in an overall strain clustering rate of 77.5% in the current study, with a range of 69-88% across the study regions ( Understandably, at national level, some population groups have likely contributed more to such TB incidence rate than other groups. Particularly, the risk of TB transmission around urban areas is known to be higher than among sparsely populated societies and rural communities [24,29]. Because of the simultaneously

PLOS ONE
ongoing expansion of urbanization and emerging socio-economic conditions around urban areas in Ethiopia (increasing population size and density e.g. through expanding slums, congregation into condominiums, growing manufacturing and service sector), the pattern of TB transmission among those living and working in the urban and peri-urban areas is postulated to differ in strain diversity and clustering, compared to that of the general population [29], the majority (85%) of which are rural communities. Despite previous achievements in reducing national TB morbidity and mortality [35], summarized reports of data from the global burden of TB diseases in the last two decades have shown a declined rate in reducing the prevalence and mortality ratio in Ethiopia. Essentially, there has been a higher rate of new TB cases (incidence) in the last few years than what was expected from the previous trend [35 , 36]. Accordingly, a diverse range of strains of M. tb lineages, many previously not registered in spoligotyping databases, continue to circulate and maintain a high rate of transmission of TB in Ethiopia. Similarly, as would be expected, the observed diversified type of M. tb strain and

PLOS ONE
lineage distribution in the current study closely matched with studies analyzed in the two most recent TB reviews that showed specific lineage predominance across different geographical locations in Ethiopia [29,34]. This means, the same two major lineages, L4 and L3 (Fig 1), were predominant [29, 30, 34], as were the five most common SIT patterns (Table 2) [14, 29, 37, 38]. As shown in Figs 1 and 2, the observed significant difference in proportions of strain types across the four study sites, has also been noted from previous studies in Ethiopia [29,34]. Those less prevalent M. tb lineages, which included the Ethiopian (L7), the Beijing (L2), and the IO (L1) lineages, were identified from samples collected at sites located in the northern regions (Gondar and Mekele). Strains of L7, which was first reported by Firdessa et al [14, 28,  37, 39] and that seem highly confined to Ethiopia, remain more prevalent in the north of the country. The two SIT patterns (SIT1729 and SIT910) that we identified in this region are the same as for those strains that were previously classified as L7 [8,14].
Taking into account the observed geographical difference, the current study investigated the contribution of bacterial genotype and host related factors associated with rate of strain clustering. While comparing clustered genotyping patterns of the two most predominant M. tb lineages, a relatively higher percentage of shared L3 patterns were identified as compared to clustered patterns that belonged to L4. Despite limited discriminatory power of the spoligotyping method, an increased rate of M. tb transmission is generally inferred by comparing clustered genotyping patterns of clinical isolates from a given epidemiological setting [10]. In contrast, cases with isolates of a unique pattern could be considered to have resulted from Table 4. Conventional and hierarchical (multi-level) logistic regression modeling methods were used to identify factors associated with strain clustering based on spoligotyping.  observe any difference in clustering rate with respect to site of infection. This might be because of limited power of the study that could not control for all possible effects of confounding factors. Although, the differences in strain virulence and immunogenicity have been investigated in experimental studies, whether this phenotypic variation plays a role in human disease remains unclear [3,6]. Therefore, it is believed that investigating the clinical epidemiology of dominant M. tb lineages among host populations would allow understanding of possible host-pathogen interaction. In this regard, one of the findings that emerged from this study is that clinical factors, which are often associated with host immunity, appeared to differ significantly between L3 and L4, the two most dominant lineages. According to the multivariate analysis (Table 5), the likelihood of detecting L3 among TBLN cases and HIV co-infected patients was significantly higher than for L4. However, a summary report generated from the updated version of the international Mycobacterium tuberculosis spoligotyping global database has shown a higher rate of CAS (L3) infection among HIV co-infected cases than other widely prevalent sub-lineages [8]. The observed discrepancy might be due to the interaction effect of sub-lineages or the possibility of co-infection within the same host. Our analysis was performed based on major M. tb lineage classification. Although it is often associated with host immunity, Osório et al.

Factor variables Proportion of cases n (%) Three logistic regression analyses
(2018) stated that due to selective advantage of extrinsic factors, within-host bacterial diversity seems to contribute to difference in disease progression [4]. For example, certain groups of L4 strains are found to be more virulent in terms of disease severity and to display higher rates of human-to-human transmission, but only at some specific geographical locations [2]. In favour of that, and as compared to L4, the current study identified significantly lower rate of L3 strains among TB cases diagnosed with other concomitant chronic illnesses (Table 5). Certainly, any immune-compromised condition and HIV interferes with bacterial virulence might lead to endogenous reactivation [20, 25,41], suggesting that less virulent MTBC species could progress to active TB disease in immune-compromised patients. For example, TB patients infected with M. africanum were more likely to be older, HIV infected, and severely malnourished than those infected with M. tb [42]. Although the mechanisms are not yet clear, the influence of bacterial and host genotype on the development of different forms of TB in humans is well documented. In this regard, the findings observed in this study seem to agree with others that suggested a possible relationship between L3 and EPTB disease [12,38]. Correspondingly, a significantly higher rate of PTB was often associated with L4, while more EPTB disease, such as TB meningitis and TBLN, was attributed to L3 [13, 15, 38].
Generally, because of a complex network related with many other proximal and distal determinants, M. tb strain clustering or lineage specific effects on disease presentations may not always be fully explained by some particular risk factors and it is difficult to quantify the biological effect using numerical estimates [43]. As a result of that, most of the previously reported epidemiological studies in humans have come up with inconsistent findings [2]. It is known that heterogeneity is a defining feature of TB, which is certainly common in molecular studies [43]. However, although the need for additional clinical evidence is obvious, disease phenotypes can possibly be determined by genotype features of specific strains, suggesting that different M. tb lineages could be more frequently present in specific clinical phenotypes and disease presentations than in others [2].

Limitation
Spoligotyping has its limitations and may not truly detect ongoing changes (genetic differences) in a population and thereby not be the best tool for investigation of transmission networks [22]. Alternative molecular diagnostic tools, such as MIRU-VNTR and especially whole genome sequencing, have shown to have better discriminatory power for investigating strain clustering and to confirm the ongoing rate of active TB disease transmission [14,22]. Similarly, the fairly small sample size, uneven representation of strains from the study sites, and further categorization into different levels of factor variables, have reduced the power of our statistical analysis. Hence, the numerical estimates may not truly imitate the biological interaction or effect modification on host-related factors and specific M. tb lineages. Not only systematic and measurement errors, but the current study also recognized selection and recall bias where selected isolates were subjected for spoligotyping based molecular analysis. However, we have tried to minimize some of the anticipated measurement errors and known confounding effects. For instance, alongside with internal quality control procedures for the identification of lineages, SITVIT patterns were compared with alternative lineage classifications generated from linked databases (KBBN and CBN) and further verified using SNP based predictions. In addition, the multivariate analysis has considered and used to adjust the expected effect of regional variation on TB lineage predominance and related strain clustering.

Conclusion and recommendation
Despite differences in geographical variations, the overall clustering suggested higher transmission of TB disease among human populations living around urban settings in Ethiopia. This Spoligotyping-based investigation showed that the rate of strain clustering was relatively higher among patients infected with L3 strains of M. tb as compared to L4. Regarding hostrelated factors, strain clustering rate was inversely associated with patients diagnosed with TB-HIV co-infection and comorbidity with other chronic illnesses. On the other hand, as compared to M. tb L4, active TB disease due to L3 strains was three times higher among TBLN patients and it was more likely to be associated with TB-HIV co-infection, while inversely associated with other concomitant chronic disease.
Altogether, the current findings add up to previous indications and contribute to evidence base on the continuous flux in the spectrum of TB infection and disease progression. Although it is difficult to be conclusive on a fixed categorical relationship between strain sub-lineages and disease type, as there is some other supportive evidence, disease phenotypes can possibly be determined by genotypic features of specific strains. Considering the complex pathogenesis of human TB disease and the interaction effect of other predisposing environmental factors, it seems that active infection due to specific M. tb lineages might be associated with specific clinical phenotypes and disease presentation.
Generally, considering the ongoing shift and heterogeneity of TB disease, clinical and public health interventions should be alongside with molecular evidence for targeting high-risk groups based on location, social determinants, disease comorbidities and related bacterial strain predominance. However, as the dynamics of socioeconomic transformations exert pressure on how people live and interact, large scale studies using advanced molecular techniques, like whole genome sequencing, should further reveal the degree to which the genetic variation influences disease epidemiology and phenotype in different population groups over time.
Supporting information S1