Minimal genetic differentiation of the malaria vector Nyssorhynchus darlingi associated with forest cover level in Amazonian Brazil

The relationship between deforestation and malaria in Amazonian Brazil is complex, and a deeper understanding of this relationship is required to inform effective control measures in this region. Here, we are particularly interested in characterizing the impact of land use and land cover change on the genetics of the major regional vector of malaria, Nyssorhynchus darlingi (Root). We used nextera-tagmented, Reductively Amplified DNA (nextRAD) genotyping-by-sequencing to genotype 164 Ny. darlingi collected from 16 collection sites with divergent forest cover levels in seven municipalities in four municipality groups that span the state of Amazonas in northwestern Amazonian Brazil: São Gabriel da Cachoeira, Presidente Figueiredo, four municipalities in the area around Cruzeiro do Sul, and Lábrea. Using a dataset of 5,561 Single Nucleotide Polymorphisms (SNPs), we investigated the genetic structure of these Ny. darlingi populations with a combination of model- and non-model-based analyses. We identified weak to moderate genetic differentiation among the four municipality groups. There was no evidence for microgeographic genetic structure of Ny. darlingi among forest cover levels within the municipality groups, indicating that there may be gene flow across areas of these municipalities with different degrees of deforestation. Additionally, we conducted an environmental association analysis using two outlier detection methods to determine whether individual SNPs were associated with forest cover level without affecting overall population genetic structure. We identified 14 outlier SNPs, and investigated functions associated with their proximal genes, which could be further characterized in future studies.


Introduction
In the Amazon region, there is an increasing understanding that land use and land cover (LULC) changes caused by agricultural activity, logging, and road construction are modifying PLOS   privately owned or protected, and the field studies did not involve protected or endangered species. For each collection site, forest cover in a 1 km radius around the site was calculated; this radius was selected to reflect the approximate maximum flight range of Ny. darlingi [39]. Forest cover was calculated using European Space Agency (ESA) Sentinel 2A satellite imagery from the closest possible date to the field collection, minimizing cloud coverage. Radiometric and atmospheric corrections were performed using the ESA's Sen2Cor software v2.4.0 [40]. Maximum likelihood supervised classification was used to assign each pixel as forest or nonforest using spectral bands 2, 3, and 4. The percentage forest cover was calculated for each collection site using the formula % Forest Cover ¼ P n j¼1 a ij A 100 ð Þ, where a ij corresponds to the forest area and A corresponds to the total area of the landscape.
For each municipality group, collection sites where at least 10 Ny. darlingi were collected were selected to cover the maximum possible range of forest cover percentage within each municipality group. Collection sites were split approximately into quintiles by forest cover percentage (level 1 =~0-20% forest cover, level 2 =~20-40% forest cover etc.). Nyssorhynchus darlingi were available from all 5 forest cover levels for two municipality groups (Lábrea and Cruzeiro do Sul-area), and from 3 levels for the remaining two municipality groups (Presidente Figueiredo and São Gabriel da Cachoeira), for a total of 16 collection sites across the four municipality groups. Genomic DNA was extracted from individual Ny. darlingi using the Qiagen DNeasy Blood & Tissue Kit (Hilden, Germany), and DNA concentrations measured using a Qubit Fluorometer (Life Technologies, Carlsbad, CA, USA). For genotyping, 15 Ny. darlingi with DNA concentration �0.5 ng/μL were selected from each level 1 and level 5 collection site, and 10 from each level 2, 3, and 4 site, for a total of 190 Ny. darlingi (Table 1).

nextRAD genotyping-by-sequencing
Genomic DNA from the 190 individuals was converted into nextRAD genotyping-bysequencing libraries by SNPsaurus, LLC as in [41]. Genomic DNA was first fragmented with Nextera reagent (Illumina, Inc., San Diego, CA, USA), which also ligates short adapter sequences to the ends of the fragments. The Nextera reaction was scaled down to fragment 3 ng of genomic DNA (the kit is optimized to fragment 50 ng). Fragmented DNA was then amplified using the Phusion Hot Start Flex DNA Polymerase (New England Biolabs, Inc., Ipswich, MA) for 25 cycles at 75˚C, with one of the primers matching the adapter and extending 8 nucleotides into the genomic DNA with the selective sequence TGCAGGAG. Thus, only fragments starting with a sequence that can be hybridized by the selective sequence of the primer will be efficiently amplified. The nextRAD libraries were sequenced on a HiSeq 4000 with two lanes of 150 bp reads (University of Oregon). All sequencing reads were uploaded to the NCBI SRA database (BioProject ID: PRJNA545461).

Sequence processing
Raw sequence reads were analyzed using STACKS v2.3b [42,43]. Briefly, the STACKS pipeline collects raw sequencing reads together into matching stacks, then builds a catalog of putative consensus RAD loci, which span the length of the amplified RAD fragments, by combining stacks from multiple individuals. The pipeline then matches individuals against the catalog of loci, and calls SNPs for each individual at each locus based on a maximum likelihood framework. Finally, the set of loci is filtered based on their frequencies in the study populations. Low-quality reads were dropped using the STACKS process_radtags program, and ustacks was used to align reads into stacks, with the minimum depth of coverage required to create a stack set to 3, the maximum distance allowed between stacks set to 4, and the maximum distance allowed to align secondary reads to primary stacks set to 6. A catalog of putative loci was built using 24 representative Ny. darlingi individuals (S3 File) collected from Brazil and Peru, including 5 individuals from the current study, 4 from [44], 2 from [33], and the remainder from Chu et al. (manuscript submitted). All 24 individuals were sequenced, and the reads processed in ustacks, using the methods described above. The catalog was built using the STACKS cstacks program, allowing 4 mismatches between stacks, with gapped alignments enabled. The catalog loci were mapped to the Ny. darlingi genome (AdarC3) [45] using BWA MEM with default parameters [46]; only loci mapping to the genome were retained. After processing in ustacks, stacks from the 190 individuals in the current study were searched against the catalog, and SNPs called, with the STACKS sstacks, tsv2bam, and gstacks programs using default settings.
To control for quality and sequence coverage variation among individuals, only individuals that matched to at least 10,000 catalog loci were included in the analysis (n = 164, including at least 6 individuals from each municipality group/deforestation level combination; Table 1). The STACKS populations program was used to select the first SNP from each RAD locus found in all 4 municipality groups, and in at least 50% of the individuals in each municipality group, with the minimum minor allele frequency set to 0.02 and the maximum observed heterozygosity set to 0.7. A bash script including the STACKS parameters used is included as S4 File, and the STRUCTURE file used for subsequent analyses is included as S5 File.
Population structure analysis STRUCTURE v2.3.4 [47] was run using the program StrAuto [48] for 10 replicates each of K = 1 to 8, with a burn-in of 100,000 generations and an MCMC chain of 1,000,000 generations. The Evanno method [49] implemented in STRUCTURE Harvester [50] was used to determine the optimal number of genetic clusters. CLUMPP v1.1.2 [51] was run using default settings, and STRUCTURE plots visualized, using the R v3.5.2 [52] package pophelper v2.2.7 [53]. As a less computationally intensive method to investigate substructure within municipality groups, fastStructure [54] analysis was run for the full dataset (to confirm that results were comparable to STRUCTURE results) and then for each municipality group separately using default settings for five replicates each of K = 1 to 10. The replicate runs were combined using CLUMPP and visualized in pophelper as above.
A hierarchical Analysis of Molecular Variance (AMOVA), with individuals grouped into forest cover levels within municipality groups, was calculated using the poppr.amova() function in the R package poppr v2.8.2 [59]. Pairwise F ST values with confidence intervals using 999 bootstrap samples were calculated using the stamppFst() function in the R package StAMPP v1.5.1 [60]. Isolation by distance was examined by plotting the geographic distance between each collection site vs. Prevosti's genetic distance (calculated using the dist.genpop() function in adegenet) and calculating a Mantel test to compare the two distance matrices using the mantel.randtest() function in ade4.

Outlier analysis
Two methods were used to identify SNPs associated with forest cover. First, latent-factor mixed modelling (LFMM) [61] was run using the R package LEA v2.4.0 [62]. In preparation for LFMM, a sparse non-negative matrix factorization (sNMF) [63] analysis completed using LEA was run with ten repetitions of K = 1 through 10. The optimal number of populations, where the cross-entropy curve was at a minimum, was four. The sNMF results were used to impute missing genotypes using the LEA impute() function. Five repetitions of LFMM were run on the imputed dataset, with a burn-in of 5000 and 10000 iterations for each, adjusting for four latent factors (as suggested by optimal number of populations from the sNMF analysis), with forest cover percentage as the explanatory variable. The z-scores from the five runs were combined, and the p-values adjusted for multiple testing using the Benjamini-Hochberg procedure as recommended by the authors [61]. SNPs with an adjusted p-value of 0.05 were considered outliers.
As a second method to identify SNPs associated with forest cover, bayenv2 [64] was run, following conversion of the vcf file in PGDSpider v2.1.1.5 [65], with each collection site as a population. Three replicate covariance matrices were computed, using 100,000 iterations. As no significant differences between the three matrices were detected using the cortest() function in the R package psych v1.8.12 [66], the first matrix was used. bayenv2 was run using the correlation matrix and the standardized forest cover percentage for each population, for 100,000 iterations, using the -c flag to include non-parametric tests. SNPs that were within both the top 5% of Bayes factors and the top 10% of Spearman's ρ values were considered outliers. The final set of outlier SNPs includes only SNPs identified using both LFMM and bayenv2.

Gene ontology analysis
Genes in the annotated Ny. darlingi genome scaffolds [67] located within 100 kb of each outlier SNP were investigated for gene function. A wide search window was selected because this was intended to be a broad, exploratory investigation with the goal of hypothesis generation. Drosophila orthologs of all genes were investigated. In addition, a gene ontology (GO) enrichment analysis was performed using the R package topGO v2.34.0 [68] to determine whether particular GO terms were enriched among these genes compared to the rest of the genome. Separate Fisher tests were calculated for each sub-ontology (BP: biological process, CC: cellular component, MF: molecular function) using the weight01 algorithm and a cut-off p-value of 0.01.

Population structure
STRUCTURE Harvester analysis determined that the optimal number of genetic clusters was two, though the estimated natural log probability of the data (lnPr(X|K)) started leveling off at K = 5 (Fig A in S1 File). STRUCTURE results for K = 2-6 are shown in Fig 2, with individuals grouped by municipality group and forest cover level. Overall, there is evidence for structure among the four municipality groups, particularly between the Cruzeiro do Sul area and the other three municipalities, but no evidence of substructure among forest cover levels within each municipality group. fastStructure results were consistent with the STRUCTURE results (Fig B Panel A in S1 File), and separate fastStructure analyses for each municipality group confirmed that there was no evidence for microgeographic structure within each municipality group (Fig B Panels B-E in S1 File).
A PCA similarly showed separation among CRU, PRE, and LAB/SAO along the first two dimensions, and between LAB and SAO along the third dimension (Fig C in S1 File). Though the Bayesian Information Criterion (BIC) of the k-means clustering algorithm used in preparation for DAPC indicated that the optimal number of clusters was one, the Akaike Information Criterion (AIC) indicated that the optimal number of clusters was three or four (Fig D in S1  File). Setting the number of clusters to four discriminated perfectly among the four municipality groups, while setting the number of clusters to three collapsed SAO, LAB, and one individual from PRE into a single cluster (Fig 3).
By hierarchical AMOVA ( Table 2), 95% of the total genetic variation was within or between individuals, 4.5% was between municipality groups, and only 0.5% was among forest cover levels within municipality groups, supporting weak genetic differentiation between the municipality groups and very little differentiation among forest cover levels. Pairwise F ST values between municipality groups ranged from 0.026 (Lábrea vs. São Gabriel) to 0.075 (Presidente Figueiredo vs. Cruzeiro do Sul area) ( Table 3). There was evidence for isolation by distance (Mantel r = 0.84, p = 0.001, Fig E in S1 File).

Outlier and gene ontology analyses
To determine whether there were individual SNPs associated with forest cover percentage, we conducted environmental association analyses using the 5,561 SNP dataset. LFMM detected 67 SNPs associated with forest cover percentage, and bayenv2 detected 139. Of these, 14 SNPs were detected by both methods (Figs F-G in S1 File). In the Ny. darlingi genome scaffolds, 150 genes are located within 100 kb of these 14 SNPs. These genes have a variety of functions in numerous structural and cell signaling pathways (S6 File). One (ADAC006142) encodes a venom allergen. The topGO analysis found six GO terms significantly enriched among these 150 genes compared to the rest of the Ny. darlingi genome (Table A in S1 File): GO:0035278 (miRNA mediated inhibition of translation), GO:0018149 (peptide cross-linking), GO:0005 576 (extracellular region), GO:0003810 (protein-glutamine gamma-glutamyltransferase activity), GO:0052689 (carboxylic ester hydrolase activity), and GO:0035091 (phosphatidylinositol binding).

Discussion
We did not find evidence of microgeographic genetic structure among 164 Ny. darlingi collected from multiple forest cover levels within four Amazonian Brazil municipality groups based on model-and non-model based analyses of genotypes at 5,561 SNP loci. This could indicate that there are high levels of gene flow across populations of Ny. darlingi in areas with different forest cover levels, consistent with studies in other insects in South America [69,70] (but see [71]). Our results are not consistent with a previous study in Ny. darlingi [23], that found microgreographic structure between two municipalities in Brazil with divergent forest cover levels based on analysis of~2,000 SNP loci generated using ddRADseq. It is possible that the structure between the two municipalities in the Campos et al. study was due to unmeasured differences between the two municipalities not related to forest cover, such as differences in breeding site ecology or vector control activities. The current study more comprehensively  investigates the relationship between forest cover and population structure, as it includes intermediate forest cover levels, as well as four replicate municipality groups. It is also possible that the adaptive response to forest cover differences varies among Ny. darlingi populations in different locations due to other factors in the external environment.
Anthropogenic changes in forest cover may produce adaptive phenotypic changes in mosquito populations not reflected in genomic studies, particularly among Ny. darlingi, as this species has been shown to display a high degree of plasticity in life history traits [72] and biting behavior [44]. Deforestation has been shown to affect the survival and reproductive fitness of An. gambiae [73], An. arabiensis [74], and An. minimus [75]. Additionally, it is possible that whole genome sequencing, or identification of structural variants [76,77] could identify genetic adaptation to forest cover within Ny. darlingi populations not reflected in our SNP dataset.
We identified weak to moderate genetic differentiation (F ST = 0.026-0.075) among four municipality groups across Amazonian Brazil separated by 800-1,600 km. This is consistent with a previous study using microsatellites that found similar F ST values comparing Ny. darlingi collected from localities in central and western Amazonian Brazil separated by comparable geographic distances to the current study [34]. However, it contrasts with previous findings that Ny. darlingi from Amazonian Brazil belong to a single genetic population [33,35,36]. This discrepancy could be the result of a combination of increased resolution provided by nextRAD genotyping-by-sequencing [23], different geographic scales of analyses of these studies, and the use of different collection sites. It is clear that more research into the population genetic structure of Ny. darlingi at both continental and regional scales is needed.
The differentiation that we identified between the municipality groups could be the result of isolation by distance (IBD), isolation by barrier (IBB), isolation by resistance (IBR), or isolation by environment (IBE). It is not possible to differentiate between these possibilities with the sampling scheme of the current study because of the lack of intermediate sampling points between these municipality groups. The overall low level of differentiation between municipality groups, with the highest F ST (0.075) between the two most geographically separated municipality groups (Presidente Figueiredo and Cruzeiro do Sul area), in combination with a significant Mantel test, is suggestive of IBD. Several previous population genetics studies of Ny. darlingi using nuclear and mitochondrial markers detected IBD [35,36,78,79], including one study that genotyped Ny. darlingi at eight microsatellite loci from municipalities in Amazonian Brazil at a similar geographic scale to the current study [34]. However, other studies of Ny. darlingi have not found evidence of IBD [33,80,81]. Additionally, there are geographic barriers between the municipality groups in the current study, including rivers, primary forest, and extensive areas of unsuitable habitats. It is possible that the particular pattern of barriers Ny. darlingi population genetic structure across forest cover levels separating the municipality groups could explain the fact that individuals from SAO and LAB are more similar than other pairs of municipality groups that are separated by a similar distance (Table 3). Future studies could explore the effects of these barriers on genetic divergence between these Ny. darlingi populations.
We identified 14 SNPs associated with forest cover percentage using two outlier detection methods. These SNPs were located within 100 kb of 150 genes, among which 6 GO terms were over-represented compared with the rest of the Ny. darlingi genome. We acknowledge that this type of analysis is limited both by the highly fragmented nature of the current Ny. darlingi genome assembly [45], and by the possibility of false positives [82]. Therefore, we present these results cautiously, with the intention of generating hypotheses for future studies.

Conclusions
Using nextRAD genotyping-by-sequencing, we report weak genetic structure among four municipality groups in the Amazonian Brazil, but a lack of microgeographic structure across forest cover levels within these municipality groups. These results do not preclude an adaptive response of Ny. darlingi to deforestation in the Amazon, but indicate that such an adaptive response was not associated with genome-wide differentiation. Additional studies using whole genome sequencing and an improved Ny. darlingi genome assembly should be undertaken to further explore this topic. Table A