Using Biotic Interaction Networks for Prediction in Biodiversity and Emerging Diseases

Networks offer a powerful tool for understanding and visualizing inter-species ecological and evolutionary interactions. Previously considered examples, such as trophic networks, are just representations of experimentally observed direct interactions. However, species interactions are so rich and complex it is not feasible to directly observe more than a small fraction. In this paper, using data mining techniques, we show how potential interactions can be inferred from geographic data, rather than by direct observation. An important application area for this methodology is that of emerging diseases, where, often, little is known about inter-species interactions, such as between vectors and reservoirs. Here, we show how using geographic data, biotic interaction networks that model statistical dependencies between species distributions can be used to infer and understand inter-species interactions. Furthermore, we show how such networks can be used to build prediction models. For example, for predicting the most important reservoirs of a disease, or the degree of disease risk associated with a geographical area. We illustrate the general methodology by considering an important emerging disease - Leishmaniasis. This data mining methodology allows for the use of geographic data to construct inferential biotic interaction networks which can then be used to build prediction models with a wide range of applications in ecology, biodiversity and emerging diseases.


Introduction
A fundamental underlying goal of biology is to model the distribution of biota and identify their interactions, thus permitting both an understanding of current distributions and the possibility of predicting future ones [1]. Such models have important applications, such as in biodiversity [2] and emerging diseases. Networks offer an important tool for understanding and visualizing biotic interactions and have been used in a variety of contexts [4,5,6]. They are constructed by linking nodes of the network, usually species that have a known interaction, such as in trophic webs. However, as it is not feasible to exhaustively track the large numbers of ecological interactions, the question arises: can biotic interaction networks be constructed other than by direct observation, using other available data?
There is evidence that the evolutionary dynamics of interspecies interactions create rich geographic mosaics [7]. Moreover, phylogenetic research has shown that species are conservative when it comes to the taxa with which they interact, both spatially and temporally. As an example relevant to this paper, blood sucking insects have evolved phenotypic traits to optimize hostseeking and feeding [8]. Co-distributions of host and parasite will then reflect the strong biotic relation that exists between them. Similarly, as a reflection of the potential confrontation of species, co-occurrence could also engender an interaction in the absence of a pre-existing one [9]. We are thus led to consider distributional data for constructing inter-species interaction networks.
Collection data offer an important proxy for modeling distributions. Here, we show how such data can be used to infer potential inter-species interactions, construct an associated network and, further, show how that network can be used to construct prediction models. Collection data are already widely used in biodiversity informatics [10,11], and have been principally used for constructing species distributions from abiotic niche variables only. The data are taxonomic in nature and georeferenced, the set of point collections of a species in a geographical region giving a sampling for the distribution of the species in that region. Of course, there is an important question of sample bias in the data [12,14] (see also the Materials and Methods section), though its extensive use and utility, even in areas where data are scarce [13], is testament to the fact that it can yield important information if treated carefully. Additionally, in the case of urgent problems of great social impact, such as that of emerging diseases, it is important to try to leverage the data that actually exist, at least until better, more bespoke, data become available.

Results
Dividing up a geographic region into spatial cells, x a , we take as our underlying variable of interest, B i (x a ), a measure of the distribution of the ith taxon in the spatial cell x a . The specific form of B i is determined by the available data -relative or absolute abundance, presence/absence or presence only. A fundamental object of interest is P(B i (x a )|I(x a )), the probability that the distribution measure B i (x a ) takes a certain value in the spatial cell x a conditioned on, I(x a ), which is composed of, in principle, all biotic and abiotic variables that affect species distributions, and which constitute the biotic and abiotic profiles of the corresponding niche [10]. An example of interest would be that B i (x a ) represents presence of the ith species in the spatial cell x a .
As we have no underlying theory with which to construct P(B i (x a )|I(x a )) we will use a data mining approach to estimate it, using point collection data as a proxy for the actual distribution of taxa. Point collection data here represent the set of georeferenced localities (latitude, longitude and date) of museum voucher specimens. It is important to remember that the distribution of taxa is a direct result of the past and present interactions of all relevant causative factors -climactic, phylogenetic, co-evolutionary, ecological etc. Hence, part of the task of any analysis is to determine, out of the myriad of factors that contribute to I, which ones are the most predictive in determining a particular distribution. An immediate problem is that, as every spatial cell is unique, for each x a one has a statistical sample of size one and hence P(B i (x a )|I(x a )) = 0 or 1.
To overcome this, one first constructs the relationship between B i and I, such as P(B i | I), via a sampling of all spatial cells in order to obtain the relationship between a given distribution measure and the associated niche variables. With this in hand, a ''profile'' of any given spatial cell x a can be constructed in terms of the biotic and abiotic niche variables and the relationship between B i and I used to determine P(B i (x a )|I(x a )) (see Materials and Methods section).
As P(B i | I) involves counting the number of spatial cells where there is a co-occurrence of the ith species with a particular configuration of the niche variables I, if I is of high dimension then the number of cells where there are co-occurrences will be small or zero. We thus restrict attention for the moment to the case where I is a single variable, I j , so that P B i jI j À Á~N Bi&Ij N Ij , where N Bi&Ij is the number of cells with a co-occurrence of the distribution variable B i and the niche variable I j , and N Ij is the number of cells with niche variable Ij. In the case where Ij is also a taxon distribution, and we consider presence, then P (B i |B j ) measures the probability of presence of taxon B i given the presence of taxon B j and is thus a measure of the statistical association between B i and B j . As P (B i |B j ) does not take into account statistical confidence however, we consider rather which also measures the degree of confidence one can have in the statistical association between B i and B j relative to the null hypothesis, P(B i ), that the distribution of B i is independent of B j and distributed with this probability over the region of interest. Essentially, this is a one-sided binomial test where the null hypothesis is that the distribution of B i is random over the sample space; in this case the cells of the region of interest. It can, of course, be useful to consider other null hypotheses. For instance, one could use as null hypothesis P(B i | A) where A represents a set of abiotic factors, or the result of a niche-model such as GARP or MaxEnt [15,16]. Values of |e(B i |B j )| greater than a certain threshold (see Materials and Methods section) measure the degree to which the data is consistent with the null hypothesis. In the case where the binomial distribution associated with P(B i ) can be approximated by a normal distribution then values of |e(B i |B j )|.2 would indicate an inconsistency between the data and the null hypothesis to the 95% confidence level. For any pair of taxa, B i and B j , taken as nodes of a network, a link between them, whose ''strength'' is given by e(B i |B j ), or P (B i |B j ), can be graphed. The resulting interaction network then offers a visualization of the inferred statistical dependencies between different taxa. Note that, contrary to networks that are common in the literature, that represent known interactions, such as between predator and prey in a trophic web [15], this network represents statistical associations from which inferences about real causal interactions can be made and then tested. Higher order statistical associations, such as P (B i |B j B k ), can also be examined. Such an interaction could be represented by three nodes, with links from B j and B k to B i , and would represent the degree of statistical dependence of taxon B i on the co-occurrence of the taxa B j and B k . From the network, for a given node B i , a ranked list of values of e(B i |B j ), or P (B i |B j ), can be taken as a model for predicting the most important potential biotic interactions of the species B i . To determine P (B i |I 9 ) when I 9 is high dimensional, a statistical model must be used to approximate it. A very useful and transparent one, that can be deduced using only the properties of the network, is the naive Bayes approximation [22] (see the Materials and Methods section), wherein a score function, S (B i | I9), that is a monotonic function of P (B i | I 9 ), can be constructed. The score consists of a sum of contributions from each niche variable, both biotic and abiotic, from which it is possible to observe which are the most important niche variables.
As an example of the general methodology we consider an important emerging disease -Leishmaniasis -a vector borne disease widely distributed in tropical regions that is estimated to affect 12 million people in 88 countries. Since Leishmaniasis is a zoonotic tropical disease, sylvan reservoirs are crucial to the maintenance of the parasite in ecological communities and, further, are intimately associated with human transmission [18]. Reservoirs of Leishmania can be classified as primary and incidental, according to their importance in the long-term transmission of the parasite, being considered incidental if they are dead ends that do not transmit to vectors [19]. Although direct experiment could determine to which type a given reservoir belongs, when there are many potential reservoirs other alternatives, such as that presented here, are more feasible.
We used collection data points for 427 terrestrial mammal species occurring in Mexico as potential or confirmed reservoirs and 11 species of Lutzomyia as confirmed or potential vectors for Leishmania. The description of the data set can be found in the Materials and Methods section. Lutzomyia is a genus of ''sand flies'' that in the New World is responsible for the transmission of the Leishmania parasite. Only females suck blood for egg production. In Mexico there is little information about which vectors are involved in transmission of the parasite in different geographic regions. The only confirmed vector is Lutzomyia olmeca olmeca [20]. However, several species have been found with the parasite -Lutzomyia olmeca olmeca, Lutzomyia cruciata and Lutzomyia ovallesi [21]. With respect to transmission of the visceral form of the disease the principle vector Lutzomyia longipalpis has been collected in Mexico but has not been reported with the parasite. For the secondary vector Lutzomyia evansi, there exists only one collection in the state of Chiapas which was without infection [22]. In Mexico, there are only eight mammal species found infected with Leishmania mexicana parasites, responsible for the cutaneous form of the disease, identified in the state of Campeche in southern Mexico [23,24,25]; a very small number when compared to the total number of potential reservoirs. It is important, therefore, to be able to predict which currently unidentified mammals are most likely to be important as actual or potential reservoirs for the disease. As a measure of statistical association we consider P(v i |m j ) and e(v i |m j ), where v i represents the ith vector and m j the jth potential reservoir. There are 4697 potential vector-reservoir pairs. In Figure 1 we show the 241 most important positive associations (highest values of e) between Lutzomyia as vectors and mammals as suspected and confirmed reservoirs for Leishmania. The vector species are marked as red nodes, while the confirmed reservoirs are marked as green. The darker the link, the stronger is the associated statistical dependence between the associated Lutzomyia and mammal.
The connectivity of the network is related to the geographical distribution of the different species and has consequences for the way in which a parasite could propagate across the network from one geographical region to another. The separated subnetwork corresponds to L. anthophora, a species indigenous only to the north of Mexico and the United States. For Lutzomyia nodes, the vertex degree dictates with how many mammals a given vector shares important positive statistical associations, while, for mammal nodes, the vertex degree tells us how many vectors are potentially exploiting the mammal. A high vertex degree for a given vector shows that it could potentially exploit many different mammals.
Moreover, if there are many connections to mammals that are not connected to other vectors, then all else being equal, it would be evolutionarily suboptimal for the vector not to exploit them. L. cruciata and L. longipalpis, in particular are associated with large numbers of mammals that have no statistical relation with other vectors. On the other hand, L. olmeca, L. ovalesi, Lutzomyia shannoni and Lutzomyia panamanensis are all within a highly connected part of the network that corresponds geographically to the peninsula of Yucatan, where many mammals are associated with several different vectors. In such circumstances, a vector may adopt a strategy of specializing to a smaller group of species in order to avoid competition. Interestingly, four of the eight infected rodent reservoirs -Peromyscus yucatanicus, Ototylomys phyllotis, Reithrodontomys gracilis and Heteromys gaumeri, all restricted to the peninsula of Yucatan, have very high vertex degrees, a fact that associates them with higher risk, as potentially many different vector species can exchange parasites with them.
Besides offering substantial insight into the ecological interactions between potential vectors and reservoirs of a disease, the interaction network can also be used to obtain predictive models.
Here we consider two such models -one for directly predicting the most important potential disease reservoirs and another for predicting a measure of disease risk for a given geographic area. Turning first to the prediction of potential reservoirs: with e(v i |m j ) in hand, for a given vector v i , we can construct a ranked list, from maximum to minimum value, of e(v i |m j ), over all pairs (v i ,m j ), i.e., a ranking of the links of a given node according to their strength. Those mammals with the highest values of e are predicted to correspond to the most important potential reservoirs for that vector. In Table 1 we show the results for the highest 150 values of e(v|m j ), where to obtain the list we have grouped together the different Lutzomyia species into one group v to form a list of 427 values of e(v|m j ) as a function of j. The highest ranked mammals have the highest degree of statistical correlation with Lutzomyia, with the implication that these mammals are the most important potential reservoirs for Leishmania. By grouping together the different Lutzomyia species we are considering association between a given mammal species and the different species of the Lutzomyia genus present in Mexico, rather than with individual species, thus increasing the sample size and allowing for more robust statistics. A secondary logic for this is also that the biomass of parasite that can pass from vector to mammal in a given spatial cell depends on the number of different vector species that are present in that cell. Thus, a mammal with a high probability of co-occurrence with more than one Lutzomyia will, all else being equal, present a higher degree of risk of having the Leishmania parasite transmitted to them than one that has a high degree of occurrence with only one species.
Such a ranked list provides a general model for predicting the most important likely reservoirs for any given disease. Note that, of the eight infected reservoirs of Leishmania in Mexico, six of them, including the four confirmed, appear in the top 7% of ranked predictions of most important potential reservoirs. If we take as null hypothesis that the confirmed reservoirs are distributed randomly in the ranked list, then the probability that they appear with their actual rankings is less than 10 28 , thus showing that the model's results are statistically significant and that the model predicts very well, especially given the relative lack of information on which it is based, in that at no point was information on confirmed reservoirs used to ''train'' the model. Of course, one could argue that, all else being equal, there should be a higher degree of co-occurrence between Lutzomyia species and those mammals that are most widespread, as these will have had a higher probability of having being tested as potential reservoirs. Of course, if this were true, it would greatly reduce the predictive power of the model. We tested this hypothesis on a subset of 360 mammal species where distribution data was readily available. The positions of the confirmed reservoir species ranked according to their area of distribution were: 25, 152, 154, 200, 224, 230, 249, 255 and 257; while when ranked according to our prediction model the positions were 4, 6, 8, 22, 27, 28, 40, 88 and 130. As a simple statistical comparison one can compare the mean rank from both methods using an independent two sample t-test. The test statistic value is 5.4 corresponding to a p value of less than 0.001 clearly indicating that the predictive power of our model cannot be explained by assuming that those species with larger distributions are more likely to be confirmed reservoirs.
The third step we will take is to construct a predictive model to quantify disease ''risk'' in any given geographic cell. Here we take as risk measure the probability that disease vectors are present, while the prediction itself is based only on biotic factors, i.e., the presence of potential mammal reservoirs. Explicitly, a score function, S (B i | B), for predicting class membership is constructed, where B i is associated with the ith vector species and B represents the presence of mammal species, B 1 , …, B N , and related to the posterior classifier probabilities, P (B i | B), using the naive Bayes being associated with directed links from vector to reservoir in the network. The advantage of this approximation is that the contribution of each biotic niche variable, B i , is independent of the rest, so that, in the case where abiotic variables are also explicitly included, the relative importance of both biotic and abiotic factors can be studied. As one would expect in the present case, biotic variables play a more important role than abiotic ones, due to the direct dependence of a vector on its associated reservoirs. With S(B) in hand, the biotic niche profile of any geographical area can be determined using a ranked list of niche characteristics and allows one to see at a glance which species are playing an important role.
In Figure 2 we see the results for the grid partition of Mexico we used earlier. The redder/whiter the area the higher/lower the predicted probability for finding Lutzomyia based only on cooccurrence with mammal species, the mid-range being associated with the probability P(B i ) associated with the null hypothesis. Also shown is the georeferenced set of point collections of Lutzomyia. As can be seen, the agreement is good, though there are one or two outliers. Finally, on the map we also see those geographical regions where cases of Leishmaniasis have been reported. The shaded regions correspond to ''municipios'' (municipalities) where Leishmaniasis cases have been reported in the last 40 years. Note that the area of different municipios can vary greatly. In regions where there is no cross-hatching there are no cases that have been reported to the Secretaria de Salúd Pública (Governmental Public Health Agency) in Mexico. This does not necessarily imply that there are none, as there is no obligatory reporting of cases of Leishmaniasis in Mexico. In this sense reported cases are the equivalent of presence data, while no reported cases does not imply ''absence''. A noteworthy feature of the map is that there are no areas with reported cases where the model does not predict a higher than random probability for presence of Lutzomyia. In interpreting the apparent overprediction several comments are in order: First of all, as mentioned, the quality of reporting data of cases of Leishmaniasis varies significantly from state to state in Mexico. Secondly, the map is of degree of risk due to biotic factors only; the output being a score that measures the probability of Lutzomyia being present in a given spatial cell. In that sense, it is a map associated with only one type risk factor, all be it an important and necessary one for the presence of the disease in the human population which, obviously, depends on many other factors. By including such factors, for example, abiotic or sociodemographic variables, more complex risk models can be simply created using our methodology.

Discussion
The main contribution of this paper is to show how biotic interaction networks may be constructed inferentially using a data mining approach applied, in this case, to point collection data, rather than by direct observation, and to show that these networks can be used, not only to understand and visualize potential interspecies interactions, but also to formulate prediction models. The important area of emerging diseases was used as a test bed to show the utility of the approach. The main logic of this methodology is that current distributions of biota, as proxied by point collection data for the example given here, adequately reflect all causal influences, both biotic and abiotic. The task, for a given set of input variables, is to discriminate which ones are of greater influence for a particular distribution. In this paper we used only biotic variables. A statistical dependence between two species infers, but does not prove, a direct biotic causal relationship. Thus, for a pair of nodes the strength of the link between them measures the degree to which two species tend to co-occur. If they co-occur in a statistically significant way we are prompted to identify as a plausible explanation a vector-reservoir interaction.
In the case of Lutzomyia and mammals this understanding comes from the natural potential direct causal relationship there: that the Lutzomyia feed on the corresponding mammal. The properties of the corresponding biotic network show to what extent a given vector is exploiting its potential food sources, evolutionary dynamics giving a logic as to why this usage should be optimal. From the network, the corresponding list of predicted reservoirs for a given Lutzomyia is not based on the physiological possibility that a given mammal is a reservoir but, rather, on the fact that a mammal with a high fraction of co-occurrences is more likely to be an important food resource for Lutzomyia than one with a small fraction and, therefore, that there is greater transmission of the parasite from one to the other. Moreover, as e(v i |m j ) increases as the range of the mammal m j grows, then this measure also predicts the degree of importance of the reservoir, a reservoir of small range being of less potential impact, all else being equal, than one of ample range. As mentioned, the utility of the model is clearly in evidence, given that all known reservoirs in Mexico are highly ranked in the complete list of 427 possible candidates.
To create spatial prediction models we used a model that utilized only information that came from the biotic interaction network. The associated score is a measure of the probability that Lutzomyia are present, which we can take as a proxy for the probability that the disease is present. To relate this to the number of cases in a more sophisticated model would require the inclusion of socio-economic and socio-demographic variables among others.
The results of this paper clearly lead us in the direction of making corresponding hypotheses that can be verified by further empirical research. Our ranked list of potential reservoirs is, as emphasized, based on the relative importance of the potential reservoir in terms of what biomass of parasite can potentially be harbored in a given spatial cell rather than what mammals are physiologically capable of being reservoirs. To test this, the following scenario may be envisaged: consider the known distribution of a given mammal from the list; select spatial cells at random from this distribution; in each cell capture the chosen mammal species and test for the presence of Leishmania. The appropriate metric is the proportion of spatial cells in which specimens were found with the parasite or, alternatively, if sufficient statistics may be obtained, e(cells with specimens with parasite | total cells with specimens). This would be repeated for different mammal species. The hypothesis is that a highly ranked species will yield higher values for these two metrics than a low ranked one. To facilitate testing the hypothesis, the most appropriate species would be those chosen from different points in the ranked list that are common in a given geographical region and easy to capture. Of course, many mammals simply do not have any geographical overlap with the vectors. Strictly speaking one should consider these mammals too and test for presence of the parasite. Common sense would dictate that for those species far away from the known distribution of the vectors there is effectively zero probability of finding the parasite thus obviating the need to explicitly check these areas. Work is currently being planned to undertake these tests.

Materials and Methods
The data set consisted of point collection data associated with one Class, Mammalia, and one genus -Lutzomyia. The mammal data set consisted of 37,297 point collections from georeferenced localities for 427 terrestrial mammals occurring in Mexico. The data were obtained from museum voucher specimens from national and international museum collections, public electronic databases (MaNIS; www.manis.gob.mx, and CONABIO; www. conabio.gob.mx) and published records [27,28]. For Lutzomyia, there were 270 point collections, taken from published literature and from national entomological collections (Instituto de Diagnóstico y Referencia Epidemiológica (InDRE, Mexico City), the Colección Entomológica Regional Universidad Autónoma de Yucatán (UADY, Mérida) and the Laboratorio de Medicina Tropical at the Universidad Nacional Autónoma de México (UNAM, Mexico City), associated with 11 species. For both data sets, each locality was georeferenced to the nearest 0.01 degrees of latitude and longitude using 1:250,000 topographic maps (INEGI; www.inegi.gob.mx, Instituto de Geografía, Universidad Nacional Autónoma de México; www.igeograf.unam.mx). Point collection data was, of course, not collected in order to provide an unbiased sampling of underlying species abundance and therefore must be considered carefully to understand potential statistical biases that might be present. The utility and limitations of point collection data have been amply discussed in [12,14].
With respect to the data set for Mexican mammals, this data has been collected over a period of more than 100 years with a consequently large number of collectors [24,25]. Hence, although the data has not been collected systematically, it has probably led to an adequate sampling. Additionally, mammals are the best known and collected group in Mexico. In the case of Lutzomyia the coverage is less but still represents the best available. With the registered cases of Leishmaniasis, unfortunately, there is no compulsory reporting of these in Mexico. So one can infer where the disease is present but not where it is absent. In problems of great social impact, such as that of emerging diseases, it is important to try to leverage the data that actually exists, at least until better more bespoke data becomes available. Parasite detection studies in potential reservoirs have been carried out principally in the state of Campeche. Van Wynsberghe et al [21] analyzed the evolution of the infection using parasitological methods in 29 naturally infected rodents. The mammals belonged to four species: Sigmodon hispidus (2), Oryzomys melanotis (12), Ototylomys phyllotis (9) and Peromyscus yucatanicus (6). In a second study [22], infection by Leishmania mexicana was detected in eight mammal species using two methods -in culture and PCR. The Leishmania parasite was confirmed by both methods in six species: O. phyllotis, Heteromys gaumeri, O. melanotis, P. yucatanicus, S. hispidus, and Heteromys desmarestianus. In the other two species it was confirmed using only via one of the methods: in culture for Marmosa mexicana and by PCR for Reithrodontomys gracilis.
As collection data is fundamentally tied to a taxonomic classification, it is natural to describe the biota in terms of taxa and consider the spatio-temporal distribution of a species for example. For a data set that covers a spatial area A and time interval T one may divide the space and interval into spatiotemporal cells, (x a , t b ) which form a mesh that partitions both the geographic region and time interval. The labels x a and t b simply indicate the particular spatio-temporal cell we are considering. A point collection associated with this cell is such that it corresponds to a latitude and longitude within the spatial cell x a and to a collection date in the temporal cell t b . We can consider the distribution of the set of species, B(x a , t b ) = (B 1 (x a , t b ), … B NB (x a , t b )), where B i (x a , t b ) is a measure of the distribution of the ith taxon in a spatial cell x a , in the time interval t b . A natural realization of B i (x a , t b ) would be the abundance of the taxon i in the spatial cell x a , in the time interval t b as measured by its frequency or relative frequency. A less discriminating realization for B i (x a , t b ) would be a function that indicates only presence or presence/absence in the geographic region x a in the time interval t b . As B i (x a , t b ) is a stochastic variable, the distribution of any taxon B i (x a , t b ) is described by a probability distribution, P(B i (x a , t b )), whose evolution, in principle, depends on both biotic factors, B j (x r , t s ), associated with other species, and abiotic factors, A(x r , t s ) = (A 1 (x r , t s ), …, A NA (x r , t s )) , such as temperature, precipitation etc., where we consider cells x r , t s that may be different to a given cell x a , t b to indicate that, in principle at least, there may be statistical associations between a given spatio-temporal cell and others. The full ecological niche at x a and t b can be described by a vector I(x a , t b ) = (A 1 (x a , t b ),…, A NA (x a , t b ); B 1 (x a , t b ),…, B NB (x a , t b )).
A full model would consist of determining P (B i (x a , t b )) = F (I (x r , t s )), relating the distribution of a subset of biota at one place and time to all biotic and abiotic factors considered at different places and times. Of course, there are no underlying fundamental principles on which to build the function F. We therefore adopt a non-parametric ''data mining'' approach, modeling the distribution directly using available data, rather than constructing an a priori parametric model. An advantage of this approach is that the observed distribution is a direct result of the past and present interactions of all relevant causative factors -climactic, phylogenetic, co-evolutionary, ecological etc. Nothing is omitted. However, an observation of P (B i (x a , t b )) in itself does not provide a predictive model. To create such a model we consider the problem as a classification task, relating a class, such as the class of cells with presence of a given species, to a feature vector I using the conditional probabilities P (B i | I). Converting the problem to one of classification is very natural from the point of view of presence or presence/absence. In the case of abundance a coarse graining of the abundance data in a given spatial-temporal cell is required. This can be achieved in many ways, depending on how many classes are posited and the criterion by which a given abundance fits in a given category. For example, one might classify abundance into three categories -Low, Normal and High -where Low is any abundance at least one standard deviation below the average and High is any abundance at least one standard deviation above the average. One can then naturally consider the conditional probability that a High abundance of species B i is found given a High abundance of species B j . Of course, in order to do this, one requires abundance data in the first place. As this is less common than presence or presence/absence data, and simply not available in the context of emerging diseases such as Leishmaniasis, we will here focus on the latter. For the same reason, in the following, we will also restrict attention to the spatial dependence of the distributions and ignore the temporal aspect, as the data simply is not capable of reliably describing temporal changes.
The class, we will take to be a taxon distribution, B i , while the feature vector set is taken to be a subset of niche variables I'(I. In this case, I 9 , represents a niche profile with both biotic and abiotic components which constitute the biotic and abiotic profiles of the niche. For a given taxon, B'(B, and niche variables, I'(I, our chief object of study is the probability P(B i | I' 9 ) = N BiAND I 9/N I 9, where N BiAND I 9 is the number of spatial cells where there is a cooccurrence of the taxon B i and the niche variables I 9 , and N I 9 is the number of cells where the niche variables take their stated values. The niche profile I 9 (x a ) associated with a spatial cell x a then determines the probability of the distribution variable, B i (x a ), in that cell, and one now has a predictive model. Note that, although we concentrate on biotic variables in the present paper, in the current approach, all niche variables can be treated on a democratic footing. The problem of calculating P (B i | I9) directly is that both N Bi AND I 9 and N I 9 are likely to be zero when the number of taxa or niche variables considered simultaneously is large, as there will tend to be no co-occurrences of so many variables. This can be ameliorated by considering a reduced number of both class and feature variables. For instance, P (B i | I k ) is determined by the number of co-occurrences of the taxon B i and the niche variable I k and, in principle, allows us to find the most important statistical associations between the niche variables and the taxa distributions. However, P(B i | I k ) being a probability does not account for sample size. For example, if P(B i | I k ) = 1 this may be as a result of there being a coincidence of B i and I k in one spatial cell or 1,000. Obviously, the latter is more statistically significant. To remedy this we consider the following test statistic which measures the statistical dependence of B i on I k relative to the null hypothesis that the distribution of B i is independent of I k and randomly distributed over the grid, i.e., P B i ð Þ~N Bi =N, where N Bi j is the number of grid cells with point collections of species B i and N is the total number of cells in the grid. The sampling distribution of the null hypothesis is a binomial distribution where, in this case, every cell is given a probability P(B i ) of having a point collection of B i . The numerator of equation (1) then, is the difference between the actual number of cooccurrences of B i and I k relative to the expected number if the distribution of point collections were obtained from a binomial with sampling probability P(B i ). As we are talking about a stochastic sampling the numerator must be measured in appropriate ''units''. As the underlying null hypothesis is that of a binomial distribution, it is natural to measure the numerator in standard deviations of this distribution and that forms the denominator of equation (1). In general, the null hypothesis will always be associated with a binomial distribution as in each cell we are carrying out a Bernoulli trial (''coin flip''). However, the sampling probability can certainly change. For instance, one could take as null hypothesis a binomial distribution with sampling probability P(B i |M = 1) = N Bj /N M = 1 , where M here is a binary variable associated with the fact that a niche-variable model, such as GARP or MaxEnt, says whether the species B i is present or absent. N M is then the number of cells where the niche model says there is presence. Taking P(B i |B k ,M) relative to the null hypothesis P(B i |M) tells us how the presence of species B j is associated with the presence of B k in the context of cells where a niche model has indicated the presence/absence of B k . In other words, how B k affects the distribution of B i in those places where the niche model says B k is present/absent.
The quantitative values of e(B i |B k ) can be interpreted in the standard sense of hypothesis testing by considering the associated p-value as the probability that |e(B i |B k )| is at least as large as the observed one and then comparing this p-value with a required significance level. In the case where N Bj §5{10 then a normal approximation for the binomial distribution should be a decent approximation and in this case e(B i |B k ) = 2 would represent the standard 95% confidence interval. In the case where a normal approximation is not accurate then other approximations to the cumulative probability distribution of the binomial must be used.
In the case where I k = B k , another taxon, then P(B i |B k ) and e(B i |B k ) are measures of the statistical association between the two taxa, e(B i |B k ) having the added advantage of having built into it the degree of statistical confidence that one may have about the association. Note that such a statistical association does not necessarily prove that there is a direct ''causal'' interaction between the two taxa. Rather, it allows for a statistical inference that may be validated subsequently.
From either P(B i |B k ) or e(B i |B k ), an inferential interaction network between taxa can be constructed where the nodes are the taxa and the links represent the degree of statistical dependence of one on the other. The links must represent the degree of interaction as otherwise one has a uniform fully connected network. This can be done, for instance, by only showing the principle interactions above a certain threshold of e or P, or by having the link width or size depend on their values. Note that such an interaction network, being based on point collection data, is inferential with respect to real biotic interactions between the taxa. This is distinct to other networks where network links are determined observationally. P (B i |B k ) and e(B i |B k ) are measures of pair-wise dependencies between taxa. They can be generalized to take into account higher order interactions. For instance, e(B i |B k B m ) measures the statistical interaction between the joint presence of taxa B m and B k and that of taxon B i .
Probabilities P (B i |I 9 ), where I 9 is of high dimension, can be constructed using different classification models, such as neural networks, discriminant analysis etc. A particularly transparent, simple and effective approximation is the Naive Bayes approximation [26] with where, in the first equality, Bayes rule has been used, and in the second it has been assumed that the niche variables I k are independent. The product here is over the N niche variables under consideration as conditioning factors for B i . In the case of the relationship between Lutzomyia and mammals, N represents the number of mammal species. A score function that can be used as a proxy for P (B i |I 9 ) is where B i is the complement of the set B i . For example, if B i is the set of cells with presence of taxon B i then B i represents the set of cells without presence. S(B i | I 9 ) is a measure of the probability to find the distribution variable B i when the niche profile is I 9 . It can be applied to a spatial cell x a by determining the niche profile of the cell, I 9 (x a ). As an example, for two biotic niche variables, B 2 and B 3 , that take values 1 (corresponding to the fact that there is a point collection associated with that cell) and 0 (there is no point collection associated with the cell), the four possible biotic niche profiles of any cell are (B 2 , B 3 ) = (0,0); (0,1), (1,0) and (1,1). The score contributions of each biotic variable are S(B i |B 2 ) and S(B i |B 3 ), calculated using the above formula. Hence, S(B i | I 9 ) = S(B i | B 2 , B 3 ) = S(B i |B 2 )+S(B i |B 3 ). Thus, for any given spatial cell x a one can assign a niche profile, i.e. values of B 2 and B 3 , from whence it is possible to assign a corresponding score. If there is no statistical association between B i and B 2 or B 3 then the corresponding score contributions are zero. An overall zero score then signifies that the probability to find B i js the same as would be found if B i were distributed randomly. If the score is positive then there is a higher than random probability to find B i present and on the contrary if the score is negative. The geographical region of interest for the data of the present study is Mexico. Within this specified region there is an important question of how to choose an appropriate mesh size. The right degree of coarse graining is essentially governed by the size of the data set available relative to the data necessary to construct a given probability function. For instance, to calculate P(B i , B k ), where B i represents presence of species i in grid cell x a : If the mesh size is too small then the probability of a co-occurrence of species i and k is very small. On the other hand, if the mesh size is too big then, as well as a lack of statistical significance, discrimination will also be lost. A reasonable estimate of the appropriate cell size can be determined by assuming that the N collections are distributed randomly in an area A. An appropriate cell size is then A 1/2 /N, which corresponds to having, on average, one collection per cell. Given that we are emphasizing here pairwise associations between species, the appropriate value of N is the average number of collections for any species. A more sophisticated methodology is to consider the number of co-occurrences as a function of cell size and look for the maximum of this function. This can be done for a particular pair of species, or one may consider an average over different pairs. For our study we used 3,337 square cells of linear size 25 km which corresponds to an average number of point collections of about 20.
Checks were made with other cell sizes of 5 km, 10 km, 50 km and 100 km to assure the robustness of our conclusions. In Table 2, for the ranked list of potential reservoirs we see how the average position in the ranked list changes as a function of cell size. This shows that the relative ranking is quite insensitive to the cell size, as the z-scores of the average rank of six of the known reservoirs relative to the expected average rank if the distribution were random are highly statistically significant. In other words, the predictions as to which species are most likely to be reservoirs are robust to large changes in the cell size. In general, the absolute values of epsilon will change as a function of cell size, principally due to the effect of reducing the number of co-occurrences as one passes to large cell sizes or to very small cell sizes. However, relative values of epsilon will remain quite stable.