Can You Judge a Disease Host by the Company It Keeps? Predicting Disease Hosts and Their Relative Importance: A Case Study for Leishmaniasis

Zoonoses are an important class of infectious diseases. An important element determining the impact of a zoonosis on domestic animal and human health is host range. Although for particular zoonoses some host species have been identified, until recently there have been no methods to predict those species most likely to be hosts or their relative importance. Complex inference networks infer potential biotic interactions between species using their degree of geographic co-occurrence, and have been posited as a potential tool for predicting disease hosts. Here we present the results of an interdisciplinary, empirical study to validate a model based on such networks for predicting hosts of Leishmania (L.) mexicana in Mexico. Using systematic sampling to validate the model predictions we identified 22 new species of host (34% of all species collected) with the probability to be a host strongly dependent on the probability of co-occurrence of vector and host. The results confirm that Leishmania (L.) mexicana is a generalist parasite but with a much wider host range than was previously thought. These results substantially change the geographic risk profile for Leishmaniasis and provide insights for the design of more efficient surveillance measures and a better understanding of potential dispersal scenarios.


Introduction
Zoonoses are an important class of neglected [1,2] or emerging infectious diseases [3][4][5][6], accounting for more than 60% of human infectious diseases. Wildlife species that are hosts for pathogens play a fundamental role in zoonoses, threatening domestic animal and human health and global biodiversity. Although for particular zoonoses some hosts have been identified [7][8][9][10][11][12][13][14], there have been few systematic empirical studies carried out to identify the host range and the relative importance of the different hosts within that range for a given zoonosis.
Additionally, most work on disease ecology over the last 20 years has focused on singlehost, single-agent systems. Recently however, there has been increasing interest in the more complex case of multi-host systems [15][16][17][18][19][20], with the realization that many zoonoses have potentially ample host ranges.
The relative importance of a species as a disease host will be highly multi-factorial, with risk factors covering many different scales, from the micro to the macro. However, there are two particularly important elements that come into play: host competence (the ability to transmit the parasite to a new host or vector) and the frequency of contact between host and pathogen or, in the case or vector borne diseases, host and vector [21].
Although it is intuitively clear that the "relative importance" of a host will depend on both its competence and its frequency of contact with the vector, it is a somewhat ambiguous concept in that it depends, for instance, on whether we are talking about human transmission or of maintaining an enzootic transmission cycle. We will take here relative importance to be associated with the probability of infection of an individual of a potential host species. Abstractly, this is a highly multi-factorial function P(C | X 1 , X 2 ,. . ., X N ), where, for instance, X 1 could represent host competence and X 2 frequency of host-vector contact. Although a host may be highly competent, if it only has infrequent contact with any disease vector, then the frequency of infected individuals will be low. Conversely, a host and a vector may have frequent contact, but the host may have low competence. All else being equal, the most important hosts will be those that have frequent contact with the disease vectors and are competent. Unfortunately, gathering information about these two aspects, especially for emerging or neglected diseases, is difficult and resource intensive [7,8,22,23]. Furthermore, how these fundamental aspects interact in multi-host systems is quite distinct from their single-host analogues. For instance, the fact that host-pathogen competencies may differ greatly among the hosts can potentially lead to a dilution effect [24][25][26].
Another important differentiating factor is that, multiple-host systems provide for much richer and complex scenarios for the dispersion of a disease from one geographic region to another [27]. As the characteristics of the host range play a crucial role in the emergence risk of a novel human pathogen and of the optimal interventions for combating the zoonosis [16] the importance of predicting and identifying potential disease hosts has been widely recognized [28][29][30]. To do so by exhaustive, systematic search through all possible hosts would be prohibitively resource intensive. At the same time, good data often only exist for a few (presumed) focal species. As it is unknown what part of the host range has already been discovered, the undiscovered part constitutes a type of 'epidemiological dark matter' [31].
An early attempt at systematising the search for potential hosts [32], in the case of Ebola, considered a heuristic approach based on expert knowledge, which was used to then filter the list of potential candidates. As such, it is both subjective and subject to model bias. More recently, other methods have appeared: In [33] a small group of four suspected hosts was used as a starting point for including biotic effects indirectly by calculating the fundamental niche of these four mammal species and considering the geographical correspondence with the niche distributions of the vectors. This paper was more concerned with including information about a particular set of potential hosts into corresponding risk maps rather than identifying new hosts per se. In contrast, in [34], a classification model using a supervised learning technique was used to predict other potential rodent reservoirs based on the predictive value of a set of potentially distinguishing characteristics of already known ones. Note that this paper was concerned with the potential hosts of a large number of pathogens considered all together and therefore could not discriminate against potential hosts of one disease versus another. In this case only categorical and no spatial information was used. Moreover, as it is based on supervised learning it can be affected by bias in the data defining both the class and in the predictors. This is in evidence in that the most predictive factor found was the number of literature citations for a given species. In [35], the authors considered biotic factors as potential predictive variables for describing the geographic range of Ebola rather than trying to predict which mammals are the most likely hosts.
In contrast, in Stephens et al [36], a general framework was presented using Complex Inference Networks based on the degree of co-occurrence between different species, for inferring potential biotic interactions. The framework is also capable of including in other variable types, at distinct spatial resolutions, such as environmental layers normally associated with abiotic variables [37] allowing for a comparison of the relative importance of biotic versus abiotic factors.
The methodology differs from those of [32][33][34][35] by using as model inputs only purely spatial data, using point collection data to proxy spatial distributions of taxa and co-occurrences to infer potential biotic interactions. In particular, it uses no auxiliary information, such as expert knowledge, as in the case of [32]; fundamental niche distributions of taxa, as in the case of [33]; or specific categorical data associated with the relevant taxa, as in the case of [34].
Networks are an important tool in ecological studies [38][39][40][41]. However, their local structurein the sense of two nodes and a link as the base element-represents an already known relation, such as in a food web [38], or in a contact network representing ticks, vertebrates and pathogens, as in [42]. In this case the local structure of the network, i.e., the individual nodes and links, only represent what is known. However, the global properties of the network can lead to new insights from an eco-systemic or community viewpoint and also to specific predictions. In contrast, Stephens et al. [36] use the local structure of networks to infer and discover previously unknown relations, such as the relation between vector and host. Although nodes are taxa, the local structure of the network is different to a traditional ecological network in that the links represent the degree of overlap between the distributions of the corresponding taxa with the idea that statistically significant degrees of co-occurrence can be an indicator of potential biotic interaction between the associated taxa, such as between a host and a vector.
In determining the host range of a zoonosis, an exhaustive empirical analysis of all potential hosts is prohibitively difficult, hence the importance of theoretical models, such as that presented in [36], for guiding observation and experiment. Although consistent with known results, it is important to note that the theoretical predictions of [36] have not previously been tested experimentally. A theoretical model needs to be validated by experiment, as this is the only way to truly determine if the model works. In this vein, most ecological modelling of zoonoses remains untested, in that theoretical predictions are not validated using a suitable experimental validation framework. This paper presents the results of an interdisciplinary study carried out to experimentally test the predictions of [36] using, as a test, the case of Leishmaniasis in Mexico.
Leishmaniasis is a significant, yet neglected tropical disease, with 350 million people in 98 countries worldwide living at risk of developing one of the many forms of the disease [43]. It is caused by infection with one of several different species of protozoan parasites of the genus Leishmania, which maintain their life cycle through transmission between an insect (sandfliesgenus Lutzomyia) and a mammalian host. In Mexico, the most epidemiologically important species is Leishmania (L.) mexicana, though the presence of other species has been confirmed. Eleven species of Lutzomyia are considered to have potential medical importance. Of these, three are known vectors of either cutaneous or visceral Leishmaniasis, while four others have been found infected with L. (L.) mexicana [44].
Besides the clinical and social importance of Leishmaniasis [45] and the acknowledgment of its zoonotic nature [46], the identification of wildlife hosts for these parasites is sparse and non-systematic. Prior to the present study, only 8 mammalian species had been identified as hosts in Mexico [47][48][49]. This potential lack of knowledge of parasite hosts greatly increases the difficulty of formulating theoretical approaches to explaining and predicting disease spread or for planning better and sustainable control measures [50].

Rationale for the modelling methodology
The general modelling methodology of [36], is based on the idea that biotic interactions can be inferred from the locations of taxa as a function of space and time. Although biotic and ecological interactions in general are very complex, it is reasonable to state that the spatio-temporal distributions of taxa, or other ecological variables, reflect all of the factors and their causal interactions that determine them. In [36], the degree of co-occurrence between taxa was taken as an observable measure with which potential interactions could be inferred. Although cooccurrence is not equal to biological interaction, a significantly non-random co-occurrence distribution is a necessary condition for a biotic interaction between taxa, and as such it can be used to formulate hypotheses that can be checked experimentally. However, it is clearly not a sufficient condition. In the spirit of niche modelling, a biotic variable that co-occurs with a target taxon can be understood as being a niche component in the same sense as any abiotic variable, such as temperature. In fact, one would generally expect a closer causal relation between biotic variables than with abiotic variables. For example, the distribution of prey species for a predator, such as a carnivore, should clearly influence the latter's distribution more significantly than temperature or precipitation.
In the case of many zoonoses, the predominant interaction between vector and host is due to the former feeding on the latter. This obviously requires a coincidence in space and time. Species that offer blood meals can maintain the presence of vector populations independently of the capacity to harbour a given pathogen. In other words, the interaction between host and vector is a necessary but not sufficient condition for the transmission of the pathogen. The total number of encounters between vector and potential host depends on many factors, including the abundance of both species. However, a key factor is the geographical overlap between them, as the greater the overlap the greater the probability of an encounter. Thus, for two host species, identical in all respects except their relative geographical overlap with the vectors, the host species with the larger overlap will be epidemiologically more important. Thus, vector-mammal geographical overlap is a necessary but not sufficient condition for both a feeding interaction and a pathogen transmission interaction.
Of course, there may be geographical overlap between species due to other reasons than a direct biotic interaction. Even if species distributions were random there would be overlap. It is therefore necessary to measure overlap relative to a null hypothesis, such as that associated with random distributions. Additionally, it may occur that there is a non-random overlap due to the existence of one or more confounding factors; for example, an abiotic variable, such as temperature. This can only be quantified by controlling for the presence of such a factor. As it is obviously infeasible to control systematically for every potential factor, a logic must be presented for considering a particular candidate. In summary: although geographical overlap is not a sufficient condition for biological interaction, it is necessary, and as such can be used to construct models that can then be checked explicitly by experiment to see to what extent it is predictive.
The explicit example considered in [36] was the identification of potential hosts for Leishmaniasis by studying co-occurrences between the vector species and the potential host species. A Complex Inference Network summarising the co-occurrence distributions was deduced that showed the most important potential mammal hosts for each sandfly species. Although the full network contains a great deal of structure and information, in terms of experimental validation each network observable requires an experimental protocol to be able to measure it. In particular, to work at the species level for the vectors, and associate and confirm hosts for a given vector species, would require collecting sandflies and genotyping their blood meals, as well as collecting potential host species and confirming the presence of the pathogen. In the present experimental study, we restricted attention to only potential host species and tested them for the presence of the pathogen considering the vector at the genus level only.

Inferring vector-hosts interactions
The explicit model for predicting potential hosts was created using a database of point collections for one Class, Mammalia, and one genus, Lutzomyia, of the class Insecta. The mammal data set contains 37,297 unique point collections from geo-referenced localities for 419 terrestrial mammals occurring in Mexico-the full potential host range (GBIF; www.gbif.org, and CONABIO; www.conabio.gob.mx). For Lutzomyia, there were 270 collections points taken from published literature and from national collections: Instituto de Diagnóstico y Referencia Epidemiológica (InDRE, Mexico City), the Colección Entomológica Regional, Universidad Autónoma de Yucatán (UADY, Mérida) and the Laboratorio de Medicina Tropical at the Universidad Nacional Autónoma de México (UNAM, Mexico City).
First, we divide up a geographic region of interest into spatial cells, x α ,-in the present case Mexico-here we used a uniform grid of 3,337 rectangular cells of size 25km x 25km. The choice of an appropriate cell size is known in geography as the "modifiable areal unit problem". In terms of forming a spatial grid, there are at least two important considerations: The sizes of the statistical samples of the variables and their degree of correlation. Too fine a grid and there will be no co-occurrences, too rough and there will be little to no discrimination. It was checked explicitly in [36] that the relative ranking of mammals by the model was quite insensitive to the cell size over the range 5km to 100km. See also [51]. One then counts co-occurrences in each spatial cell between different taxa, or other variables. In the present case, the co-occurrences are between the presence of Lutzomyia, B i , and the presence of each distinct mammal species, I k .
We take the taxon distribution, B i (Lutzomyia), and a subset of potential niche variables. We are interested in the probability P(B i | I 0 ) = NB iAND I 0 /N I 0 , where N BiAND I 0 is the number of spatial cells where there is a co-occurrence of the taxon B i and the niche variables I 0 , which we take here to be biotic variables, and N I 0 is the number of cells where the niche variables take their stated values. The niche profile I 0 (x α ) associated with a spatial cell x α then determines the probability of the distribution variable, B i (x α ), in that cell, and one now has a predictive model. The problem of calculating P(B i | I') directly is that both N Bi AND I 0 and N I 0 are likely to be zero when the number of taxa or niche variables considered simultaneously is large, as there will tend to be no co-occurrences of so many variables. This can be ameliorated by considering a reduced number of both class and feature variables. For instance, P(B i | I k ) is determined by the number of co-occurrences of the taxon Bi and the particular niche variable I k and, in principle, allows us to find the most important statistical associations between the niche variables and the taxa distributions. However, P(B i | I k ) being a probability does not account for sample size. For example, if P(B i | I k ) = 1, this may be as a result of there being a coincidence of B i and I k in one spatial cell or 1,000. Obviously, the latter is more statistically significant. To remedy this we consider the following test statistic a binomial test which measures the statistical dependence of B i on I k relative to the null hypothesis that the distribution of B i is independent of I k and randomly distributed over the grid, i.e., P(B i ) = N B i =N, where N B i is the number of grid cells with point collections of species B i and N is the total number of cells in the grid. The sampling distribution of the null hypothesis is a binomial distribution where, in this case, every cell is given a probability P(B i ) of having a point collection of B i . The numerator of eq (1) then, is the difference between the actual number of co-occurrences of B i and I k relative to the expected number if the distribution of point collections were obtained from a binomial with sampling probability P(B i ). As we are talking about a stochastic sampling the numerator must be measured in appropriate "units". As the underlying null hypothesis is that of a binomial distribution, it is natural to measure the numerator in standard deviations of this distribution and that forms the denominator of eq (1). In general, the null hypothesis will always be associated with a binomial distribution as in each cell we are carrying out a Bernoulli trial ("coin flip"). However, the sampling probability can certainly change.
The quantitative values of ε(B i |I k ) can be interpreted in the standard sense of hypothesis testing by considering the associated p-value as the probability that |ε(B i |I k ) | is at least as large as the observed one and then comparing this p-value with a required significance level. In the case where N I k > 5-10 then a normal approximation for the binomial distribution should be adequate, in which case ε(B i |B k ) = 1.96 would represent the standard 95% confidence interval. When a normal approximation is not accurate then other approximations to the cumulative probability distribution of the binomial must be used.
As ε increases monotonically with the frequency of co-occurrence, we interpret a statistically significant positive correlation as inferring a potential biotic interaction. Here, between sandflies and the corresponding mammal, which in this ecological setting one would naturally interpret as the mammal being a blood source for the sandfly, and therefore a potential host. The higher the value of (P(B i | I k )-P(B i )) the greater the degree of spatial overlap between the species distributions and therefore the greater the risk posed by the corresponding mammal. Negative values of ε correspond to spatial overlaps that are less than one would expect from the null hypothesis.
The 419 mammal species were ranked according to ε. The resultant list serves as a predictive risk model, with the hypothesis that the highest ranked mammals correspond to the most important hosts, where, in the absence of other information, we assume that host competence is equal and importance is associated with the degree of spatial overlap between sandflies and mammal. All else being equal more overlap means more vector-host encounters. It should be noted that the method is not determining the physiological capacity of a mammal species to be a host but, rather, its potential epidemiological importance given that presence of mammal hosts is a necessary condition for the presence of the pathogen.

Constructing risk maps from inference networks
A corresponding biotic geographic risk model can be computed by calculating the probabilities P(B i |I 0 ), or proxies thereof, for each spatial cell. When I 0 is of high dimension, this can be done using different classification models, such as neural networks, discriminant analysis, etc. A particularly transparent, simple and effective approximation is the Naive Bayes approximation: where, in the first equality, Bayes rule has been used, and in the second it has been assumed that the niche variables I k are independent. The product here is over the N niche variables under consideration as conditioning factors for B i . In the case of the relationship between Lutzomyias and mammals, N represents the number of mammal species. A score function that can be used as a proxy for P(B i |I 0 ) is where B i is the complement of the set B i . For example, if B i is the set of cells with presence of taxon B i then represents the set of cells without presence. S(B i | I 0 ) is a measure of the probability to find the distribution variable B i when the niche profile is I 0 . It can be applied to a spatial cell x α by determining the niche profile of the cell, I 0 (x α ). As an example, for two biotic niche variables, B 2 and B 3 , that take values 1 (corresponding to the fact that there is a point collection associated with that cell) and 0 (there is no point collection associated with the cell), the four possible biotic niche profiles of any cell are (B 2 , B 3 ) = (0,0), (0,1), (1,0) and (1,1). The score contributions of each biotic variable are S(B i |B 2 ) and S(B i |B 3 ), calculated using the above formula. Hence, S(B i | I 0 ) = S(B i | B 2 , B 3 ) = S(B i |B 2 ) + S(B i |B 3 ). Thus, for any given spatial cell x α one can assign a niche profile, i.e. values of B 2 and B 3 , from whence it is possible to assign a corresponding score. If there is no statistical association between B i and B 2 or B 3 then the corresponding score contributions are zero.
An overall zero score signifies that the probability to find B i is the same as would be found if B i were distributed randomly. If the score is positive then there is a higher than random probability to find B i present and on the contrary if the score is negative. As each niche factor is treated separately in ε(B i |I k ) or S(B i |I k ) we can thus evaluate the relative contribution of any given niche factor and compare it to the contribution of any other. By determining the set of presence/no presence attributes in a spatial cell, eq (3) can be applied to each cell thereby determining the relative risk of that cell for the presence of Lutzomyias. As taking the ranked list as a predictive model involves several important assumptions; it is essential to test the model with experimental data. Obtaining the relevant data required sampling the spatial grid by collecting mammals and sandflies from different geographic points and then testing them for the presence of the parasite.

Sampling design for model testing
The sampling sites were selected as follows: a 25 x 25 km grid (as discussed above) was superimposed on a map of Mexico and this was used as template to determine the sampling sites. The sampling was stratified according to altitude so that only grid squares at < 2000 masl were used. Also, we excluded any grid square with >50% of water or urban cover. Sometimes the 25 x 25 km grid square selected was sub-divided in order to have more than one sampling site per square. Once the grid was established we selected 52 localities at random in 10 Mexican states and covering many different eco-regions and associated with a large selection of vegetation types. A random sample of spatial cells was necessary in order to fairly validate the model. If we had targeted the sampling to only those cells predicted to be highest risk then we would not be able to discriminate between high and low risk in the validation. Of course, once the model is validated it can be used with confidence in the future to preferentially identify those regions of highest risk where, for example, surveillance efforts should be targeted. The sampling was carried out by four field groups who, over a period of two years, collected 922 taxonomically identified specimens of 70 distinct species (for more details of animal sampling see S1 Text).

Laboratory methodology
Tissue samples were taken to the Laboratorio de Inmunoparasitología of the Unidad de Investigación en Medicina Experimental of the National Autonomous University of Mexico, UNAM, where PCR tests were carried out to identify the presence of the pathogen Leishmania (L.) mexicana.
DNA purification. DNA from animal tissues was purified from approximately 25 mg of starting material. DNA extractions were done with a DNeasy Blood and Tissue kit (QIAGEN, Germany), following the manufacturer's instructions. Genomic DNA was used for PCR-based amplification.
Oligonucleotides. To determine the presence of Leishmania we used oligonucleotides based on the Leishmania mini-circle kinetoplast DNA sequences that are conserved among species. The primer sequences used for amplification were 5´-CTRGGGGTTGGTGTAAAA TAG-3´(L.MC-1S) and 5´-TWTGAACGGGRTTTCTG-3´(L.MC-1R) [52]. For the species Leishmania (L.) mexicana we took primer IR1, designed by [53], which corresponds to the 32 final nucleotides of the conserved sequences from the 3' region of the small subunit of the 18S ribosomal gene as our forward primer: IR1 (5'-GCT GTA GGT GAA CCT GCA GCA GCT GGA TCA TT-3'), the reverse primer was LM17 (5 0 -CCC CTC TCC TCC TCC CC-3 0 ) [54].
Polymerase chain reaction amplification. The PCR was performed using 50 μl of the following reaction mixture: Taq PCR Master Mix (QIAGEN) (provides a premixed solution containing Taq DNA Polymerase, PCR buffer, MgCl2 and dNTPs), 100 ng of the corresponding oligonucleotides, and DNA from tissues, we used 1 μl of tissue extract corresponding to 100 ng of DNA. The amplification was performed in a Perkin Elmer 2720 thermocycler using different conditions, which depended on the oligonucleotides used. For L.MC-1S/ L.MC-1R Leishmania genus we used the PCR amplification with Leishmania mini-circle kinetoplast DNA specific primers was performed with 30 cycles of denaturation (95°C for one minute), annealing (55°C for one minute), and polymerization (72°C for one minute). For IR1/LM17 L. (L.) mexicana we used 35 cycles of 1 min at 94°C, 1 min at 65°C and 1 min at 72°C. In all cases the cycles were preceded by another cycle at 94°C for 5 min and a final extension cycle of 72°C for 7 min.
An analysis of the sensitivity of the primers L.MC-1S/ L.MC-1R was made with DNA from cultured promastigotes of L. (L.) mexicana using: 10 ng, 1 ng, 100 pg, 10 pg, 1pg, 100 fg, 10 fg and 1 fg DNA. PCR products were analyzed using electrophoresis in 1.5% agarose gels in TAE 1 × at 80 V. Gels were stained with 0.5 μg/ml ethidium bromide and photographed under a UV light source.
The PCR amplification products from two positively identified animals to genus and species were sequenced. The sequences were then compared and aligned using the National Center for Biotechnology Information, U.S. National Library of Medicine, Basic Local Alignment Search Tool (BLAST).

Results
In Table 1, we show the list of collected species with the number of individuals that tested positive for the presence of the parasite Leishmania (L.) mexicana and the number that tested negative. Of the 70 mammal species collected, approximately 1/6 of all species present in Mexico, 24 (34%) had one or more samples that tested positive for the presence of the Leishmania (L.) mexicana parasite. Thirteen species of bats, and one of squirrel, were identified for the first time as Leishmania hosts in Mexico. Of the total number of collected individuals (N = 922), 62 tested positive, yielding an average infection rate across all species of 6.7%, although infection rates varied greatly, both temporally and spatially, exhibiting considerable heterogeneity, from 0% to 60%, across distinct collection sites and season of the year. In addition to the percentage that tested positive we also include the 95% confidence interval limits using the Wilson score interval [55,56] for that percentage relative to the null hypothesis that all mammal species had the same baseline infection rate of 6.7%. However, for those species where the number of positives is zero we also calculate the exact probability to obtain this result, (1-p) N , where p is the baseline infection rate and N is the number of negative collections. Fig 1 shows the ranked values of ε, the statistical measure of degree of co-occurrence used to infer potential biotic interactions (see Eq (1) of the Methods section) for all mammal species as determined from the complex network exhibited in [36] (S1 Table) and considering Lutzomyia as a genus. The horizontal axis represents the null hypothesis that sandflies are distributed randomly with respect to mammal distributions, in other words P(B i | I k ) = P(B i ) and hence ε = 0. To determine the extent to which collection biases can influence the overall distribution we randomly redistributed all collections over those spatial cells that had at least one collection. This has the effect of removing correlations between one species and another while at the same time preserving any bias associated with under sampling of certain geographical areas. This random re-assortment was repeated 50 times and average values for ε determined for each species.
To test the predictive power of the risk model, and to better visualize the relationship between ε and the percentage of species that tested positive, we group the list of mammal species ranked by ε into deciles [57] each decile corresponding to 10% of the list, and compute the average value of ε for each decile. The result can be seen in Fig 2. The relative proportion of positives, P, is a strongly increasing function of ε. Note that this regression is only to demonstrate the predictive power of the underlying classification model based on ε, i.e., the statistical significance of co-occurrence. Similarly, in Fig 3 we see the relative correlation between the average value of ε and the percentage of individuals identified as positive. Once again, the relative proportion of positives is a strongly increasing function of ε. Of course, P is a multi-factorial function, P(X 1 , X 2 ,. . ., X N ) that depends on many factors, such as host competencies. Essentially, here we are considering a regression model for P(X 1 , X 2 ,. . ., X N ) with respect to the variable X 1 = ε and ignoring the rest as we do not have the relevant information to include them. Seen as a logistic regression at the species level, the associated relation is: Logit P = -1.415 + 0.186 Ã ε, with a p value of 0.03 on the regression coefficient. This confirms the statistically significant relation between ε as a statistical measure of geographical overlap and the probability to be a host of Leishmania (L.) Mexicana.
In Fig 4 we show a graph of prevalence at the species level versus ε for the 24 positive species, while in Fig 5, we compare the disease risk maps determined by using only the eight previously confirmed hosts of Leishmania versus one using the set resulting from our analysis, where by risk we mean probability of presence of the vector. A clear distinction can be seen between the areas of higher risk between the two models with the present model indicating a much higher degree of risk of presence of Lutzomyia in other than the southeast of Mexico.

Discussion
Host range is an important factor in the dynamics of a zoonosis, both as a variable that affects the overall risk of presence of the disease as well as in terms of determining optimal interventions. An essential factor in determining the importance of a host is its co-distribution with the disease vector. In this paper, we reported the results of an extensive, interdisciplinary, empirical investigation carried out to test the predictions of a model to predict the relative importance of mammal hosts for the pathogen Leishmania (L.) mexicana associated with the emerging disease Leishmaniasis in Mexico. Once again, we emphasise that importance here is an "all else being equal" notion, i.e., that the greater the overlap between species the greater the probability of a vector-host interaction and therefore a greater number of infected individuals and a greater probability of transmission. Of course, many other factors-host competence, host abundance etc. will influence the epidemiological importance of a given host.   As the distribution of Fig 1 predicts the most important potential mammal hosts of Leishmania (L.) mexicana, the deviation of the set of real ε values from the two random benchmarks considered, the ε = 0 line and the average random ε curve, shows that the distribution of Lutzomyia is strongly, positively correlated with a large number of mammal species, and, further, that this fact is not explainable by collection biases. Furthermore, the strong asymmetry of the distribution, with 149 potential hosts associated with statistically significant positive correlations with sandflies, compared with only two species with statistically significant negative correlations, is consistent with a hypothesis that sandflies are generalists that are capable of feeding on, and potentially infecting, whatever potential mammal species are available, as has been suggested for several mosquitoes [58]. If sandflies were specialists, associated with a few focal species, one would expect to see high ε values only for those particular species, due to the fact that the focal species are an important and necessary biotic element in the niche of the sandfly, while the other mammals would be incidental and therefore one would expect to see either random association or a positive association mediated by, say, abiotic variables.
The most important results of this paper are in Figs 2 and 3, where we clearly see the significant correlation between the probability for a species or individual to be a host and ε as a measure of the statistical significance of the degree of overlap between vector and host distributions. This is evidence that although many other factors, such as species abundance, species competence etc., enter, the fact that a vector and host must co-occur in order to interact leaves a significant predictive imprint. In other words, the more overlap the more opportunity for host-vector interactions. The multi-factorial nature of the complex relation between host and vector is implicit in that the relation between host probability and ε, although statistically significant, is not characterized by a very high value of R 2 . Especially noteworthy is the decile 1 result, where 3 out of 7 species were identified as hosts-Baiomys taylori, Peromyscus maniculatus and Peromyscus eremicus. These were collected from sites in Jalisco and Nuevo Leon, states from where collections of Lutzomyia were previously scarce or non-existent. However, recently, it has been confirmed that various species of Lutzomyias are relatively common in these areas. Thus, we believe that a part of the low ranking of these species is due to a systematic bias in the historic collection of Lutzomyias towards the southern part of Mexico. The impact of this on the relation between the percentage of positive host species or individuals and ε is substantial. A regression using only the first 9 deciles yields an R 2 = 0.92 versus 0.44 for all 10 deciles for the species-ε relation, while for the individuals-ε relation the corresponding figures are 0.65 and 0.39. Figs 2 and 3 also illustrate what we mean by host importance. The fact that the percentage of infected individuals decreases as a function of epsilon is consistent with the fact that in the higher deciles (higher ε values) there is a higher probability of a vector encountering an infected host than in the lower deciles (lower ε values). This does not imply, of course, that the pathogen is transmitted. It is however, once again, a necessary if not sufficient condition.
Besides validating both the general methodology as well as the specific model for Leishmaniasis of [36], our results also provide an extensive list of new hosts for Leishmania (L.) mexicana in Mexico that substantially changes what we know about the transmission cycle of the pathogen and the potential efficacy of interventions and/or surveillance efforts. As 33% of collected species tested positive for the presence of L. (L.) mexicana our results are completely consistent with the prediction that L. (L.) mexicana is a very generalist pathogen and that Lutzomyia is a very generalist genus. Furthermore, if we consider the probability that a species is a true negative at the 95% confidence level with respect to the null hypothesis of a 6.7% infection rate across all species, then we see that there are no true negatives. The closest is Heteromys desmarestianus with N = 30 and a probability of 88% of being a true negative. Interestingly, though, Heteromys desmarestianus has previously been identified as a host [49,59]. We also note that, of the 24 identified host species, only four-Phyllostomus discolor, Peromyscus eremicus, Glossophaga commissarisi and Glossophaga soricina-are associated with a prevalence that, with 95% confidence according to the Wilson score interval, is greater than the average prevalence across all species. Moreover, the first two species were associated with only one collection so that one would not expect the Wilson score interval to be reliable. The exact probability for an observed 100% prevalence with N = 1 is 6.7%.
Our principle result, as stated previously, is the validation of the methodology and the explicit model of [36] for Leishmaniasis. However, we may also further analyse our experimental results. We have noted that they are quite consistent with the hypothesis that most host species have prevalence values that are compatible with the null hypothesis of a constant prevalence of 6.7% across species. For those positive species with N > 20 in fact prevalence is very stable. Of course, heterogeneities are to be expected. This can be due not only to intrinsic differences in host competence but also, potentially, to spatial heterogeneity associated with epidemiological "hotspots" where many variables together may be favourable for transmission. In Fig 4 we see that there is no noticeable relation between prevalence and ε. We would argue that there should be a dependence if prevalence is averaged for a given species over several geographical locations that systematically sample the range of that species. However, the sampling intensity of the present data is not capable of showing such effects at a species by species level. The fact that the species by species infection rates are consistent with a relatively constant 6.7% prevalence is compatible with the hypothesis that the competence of the different mammal host species is relatively homogeneous. However, as can be seen in Fig 4, the degree of dispersion of the data is large. Partly this is due to the fact that several of the observed prevalences are associated with very small sample sizes. For instance, the species with prevalence of 100% are associated with only one individual. These can be considered as outliers. This heterogeneity in sample size at the species level is one reason why a coarse grained analysis, as observed in Figs 2 and 3, is more appropriate. Taken at face value, the relative homogeneity of prevalence would also argue against any potential dilution effect due to higher biodiversity as this depends on strong competence heterogeneity amongst hosts. In fact, these results would be consistent with the fact that Lutzomyias do not differentiate very much among different potential sources of blood meal [60].
The results of this study stand in stark contrast to our previous understanding of the hosts of different species of Leishmania: Of the more than 2000 mammal species on the American continent only about 50 (2.5%) have been identified as hosts of Leishmania [44], while in Mexico, 8 out of 419 (2.1%) have been identified as hosts [but see 61]. If our result that 33% of collected mammal species are hosts is extrapolated to other non-collected species then potentially hundreds of mammal species could be implicated as hosts of Leishmania. Of the collected species that tested positive, 13 were bats, identified for the first time as hosts of Leishmania (L.) mexicana in Mexico [61]. We also identified the grey squirrel as a peri-domestic host species with a high degree of contact with human settlements [62].
If sandflies are generalists, in that they feed off a large number of species across many genera, and pathogen competencies are relatively uniform, as is consistent with our observations, then one would also hypothesize that Leishmania (L.) mexicana is also a generalist in that it can infect a large number of potential host species. In the case of Chagas disease, the impressive genetic plasticity and associated adaptability of Trypanosoma cruzi has been amply studied [63][64][65], as well as its epidemiological implications. Our results suggest that L. (L.) mexicana should also demonstrate a high degree of genetic plasticity and adaptability to be able to infect such a wide array of mammal species. Recent work on the genetics of Leishmania seems to be consistent with this viewpoint [66,67].
Given that previously there were only eight confirmed host species of Leishmania in Mexico, the results of this paper change the risk landscape for this neglected disease in Mexico, both geographically, and in terms of its control or elimination, with similar potential consequences for other countries. Although it is known that Lutzomyias have an ample geographic range and therefore are a risk element for Leishmaniasis in many areas of Mexico, what the present work demonstrates is that there are a large number, and potentially very many more, of hosts involved in the transmission cycle of Leishmaniasis. This complicates both interventions and surveillance. In terms of surveillance we would argue that our model yields a good first approximation as to which host species to survey-those that have been identified as hosts and have the highest ε values. Of course, further field and laboratory work should be carried out to better understand the underlying factors influencing host competence, both at the species level and across different geographical locations for the same species. It is also necessary to test species that are high on the model list but up to now have not been checked. In terms of interventions, with such a large host range, that spans both sylvatic and peri-domestic species, it will be very difficult to eliminate any enzootic transmission cycle, with consequential difficulties in the long term elimination of the vector. Geographically, as can be seen in Fig 5, a risk map derived from the distributions of the eight host species known before the results of this study is completely different to a risk map associated with the 30 species of host that we have now confirmed indicating much higher risk in states such as Jalisco and Nuevo Leon that have until recently been considered low risk compared to the south east.
Our results show that within species collection data there is a great deal of useful information about interspecific interactions and community structure that may be deduced with innovative modelling techniques, such as complex inference networks, and applied to important problem areas such as multi-host diseases. It also shows the importance of the systematic and unbiased collection of data associated with the distributions of potential vectors and potential reservoirs. For instance, the models created showed a higher risk for presence of Lutzomyia in the north of Mexico, along the border with the US, than had previously been the case. In fact, recent human cases have been reported in this region, and recent field work has shown that the presence of Lutzomyia is more extensive than had been previously thought [44,68].
Supporting Information S1