Positive Selection Pressure Drives Variation on the Surface-Exposed Variable Proteins of the Pathogenic Neisseria

Pathogenic species of Neisseria utilize variable outer membrane proteins to facilitate infection and proliferation within the human host. However, the mechanisms behind the evolution of these variable alleles remain largely unknown due to analysis of previously limited datasets. In this study, we have expanded upon the previous analyses to substantially increase the number of analyzed sequences by including multiple diverse strains, from various geographic locations, to determine whether positive selective pressure is exerted on the evolution of these variable genes. Although Neisseria are naturally competent, this analysis indicates that only intrastrain horizontal gene transfer among the pathogenic Neisseria principally account for these genes exhibiting linkage equilibrium which drives the polymorphisms evidenced within these alleles. As the majority of polymorphisms occur across species, the divergence of these variable genes is dependent upon the species and is independent of geographical location, disease severity, or serogroup. Tests of neutrality were able to detect strong selection pressures acting upon both the opa and pil gene families, and were able to locate the majority of these sites within the exposed variable regions of the encoded proteins. Evidence of positive selection acting upon the hypervariable domains of Opa contradicts previous beliefs and provides evidence for selection of receptor binding. As the pathogenic Neisseria reside exclusively within the human host, the strong selection pressures acting upon both the opa and pil gene families provide support for host immune system pressure driving sequence polymorphisms within these variable genes.


Introduction
The pathogenic Neisseriae possess two obligate human pathogens; Neisseria gonorrhoeae (Gc), the etiological agent of gonorrhea, and Neisseria meningitidis (Mc), which can cause bacterial meningitis [1]. For optimum infectivity, both neisserial pathogens need to form pili [2,3] as well as express Opa protein which serves as an afimbrial adhesion [4,5]. Both pathogenic species are capable of changing these surface antigens by one of two means; either by phase variation which causes on/off switching of gene expression, or via bona fide gene variation where the chemical composition of the surface antigen is modified [6]. In fact, the opa and pil genes of the pathogenic Neisseria spp. are perhaps the best studied prokaryotic systems of both phase and antigenic variation [6]. Both phase and antigenic variation can occur within the pil and opa systems, as multiple gene copies are maintained on the chromosome. Also, the Neisseria are naturally competent, allowing for the introduction of new alleles to a population. As these variable genes encode proteins that are surface exposed, each serves as an antigen that encounters immune surveillance. Consequently, it has been speculated that diversification of these multiple chromosomal genes is driven by selection pressures emplaced by the host immune system [7]. Due to this, a pathogen's genes encoding antigenic proteins and the antigen-binding sites of immune genes are speculated to be under similar selection pressures, as each are in an evolutionary "arms race" for survival [8].
Type IV pili are primarily composed of PilE polypeptides encoded by the pilE expression locus [9]. The pilE gene varies through unidirectional recombination with one of several noncoding pil genes known as pilS that reside elsewhere on the chromosome. Neisseria harbor variable numbers of pilS loci, which may contain several pil gene copies at each locus (Gc strain MS11 contains 19 pil gene copies) [10][11][12]. Each pilS gene copy lacks the 5 0 150 bp of pilE, yet contains both constant and variable regions, with the variable regions (termed minicassettes; of which there are 6) being responsible for the formation of PilE antigenic variants following recombination of pilE with pilS [13]. Minicassette mc2, located towards the 3 0 end of the gene, is regarded as the hypervariable segment of the pil genes, with this encoded segment of the PilE polypeptide experiencing the greatest exposure to the host immune system [14]. Consequently, the pilus filament, which is composed of thousands of copies of PilE polypeptide, is assembled in such a way that any conserved amino acid residues are shielded from the external environment [15].
As both Opa proteins [32,33] and PilE polypeptide [34,35] constantly encounter the immune system, it has been assumed that variation of these gene sequences would be driven by strong immune selection pressure [8]. Thus, by maintaining nonsynonymous polymorphisms within the surface-exposed portions of antigen-encoding genes, and, by harboring multiple gene copies containing these distinct nonsynonymous polymorphisms, positive selection for such changes should allow these pathogens to better evade the host immune response. However, the presence of conserved regions within the genes, and the need to maintain protein structural integrity, may cause selection pressure to differ across the genes themselves. Additionally, Mc typically resides on the extracellular surface and must traverse the mucosal barrier of the nasopharynx in order to disseminate and colonize the meninges [36], while Gc characteristically inhabit the urogenital tract and rarely disseminate. Therefore, the difference between the extracellular and intracellular lifestyles of these pathogens may cause differential evolution of their extracellular proteins. In this study, the origins of polymorphisms in each system were investigated between species (i.e., between N. gonorrhoeae and N. meningitidis; termed interspecies), between multiple strains of a single species (either within N. gonorrhoeae or N. meningitidis; termed interstrain), as well as within a single strain of either N. gonorrhoeae or N. meningitidis (termed intrastrain). By differentially assessing the selective pressure on the variable genes from Neisseria, we demonstrate that immune pressure does indeed drive nonsynonymous changes within the variable gene sequences that encode the surface-exposed regions of the proteins, thus facilitating immune evasion.

Data acquisition
All sequences were obtained through the National Center for Biotechnology Information genome database (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/) between April 14, and June 23, 2015. This analysis included pil and opa genes from 20 fully and partially sequenced strains of Gc along with 20 fully and partially sequenced strains of Mc (S1 Table). Annotated nucleotide and amino acid sequences obtained from these strains were used as a database to conduct a BLAST homology search (10 −3 e-value cutoff) [37]. Collectively, the annotated and BLASTinferred sequences included 381 pil genes and 309 opa genes from Neisseria (Gc: pil n = 237, opa n = 138; and, Mc: pil n = 144, opa n = 17).

Amino acid and nucleotide alignments
Sequences were parsed based on characteristic elements of each gene (ie. the minicassettes of pil, the variable regions of opa). Such editing left 219 pil gene sequences (Mc: n = 71; Gc: n = 148), and 86 opa gene sequences (Mc: n = 17; Gc: n = 69). The strains and accession numbers of genes can be found in S1 Table and the sequences can be found in S1 and S2 Data. The parsed amino acid sequences were subjected to alignment utilizing MAFFT version 7.245 [38], allowing 2 alignments to be constructed between Pil and Opa sequences. These amino acid alignments were used to assemble nucleotide sequences into codon alignments through the use of PAL2NAL [39] (S1 Fig). For downstream analysis, phylogenies of the nucleotide alignments were constructed using FastTree version 2.1.8 which infers approximately-maximum-likelihood phylogenetic trees [40]. This program implements the general time reversible model (GTR) of nucleotide evolution [41] with the evolutionary rate heterogeneity determined through CAT approximation [42], and local support values calculated with the Shimodaira-Hasegawa test [43]. Although CAT likelihoods are typically higher than those calculated by the discrete gamma model, this method is considered an accurate estimator of evolutionary site rates for alignments >50 sequences. However, further inference utilizing this discrete gamma model, which rescales trees determined under the CAT approximation to optimize the Gamma likelihood under 20 different rate categories, revealed only slight differentiation among branch lengths between the CAT and gamma model trees. Additional construction of phylogenetic trees utilizing the GTR and HKY85 substitution models implemented in PhyML 3 [44] generated phylogenetic trees that were comparable amongst themselves, yet bore a slight variance in topology among trees constructed under Fasttree. As phylogenetic trees constructed under different methods in Fasttree and PhyML were comparable and only differed based on the program utilized in their construction, the datasets are presumed to be robust and differences in phylogenetic tree topologies are due to anomalies between the programs (S2 and S3 Figs). However, as FastTree is more accurate than PhyML 3 and other distance-matrix methods that are generally utilized for large datasets [40], trees generated by Fasttree were utilized for all downstream analysis.
Calculation of polymorphisms and informative sites, d N and d S Synonymous and nonsynonymous DNA substitution rates were assessed using gene alignments subjected to polymorphism analysis by DnaSP version 5.10.1 [45] and the PAML version 4.8 suite of tools [46]. DnaSP analysis included the use of a sliding window approach to visualize both synonymous and nonsynonymous polymorphisms across both intra-and interspecies alignments which identified the number of singleton and parsimony informative sites for each alignment. The method of Yang and Nielsen, YN00 [47], was employed in the PAML suite to calculate the number of synonymous polymorphisms per synonymous site (d S ) and the number of nonsynonymous polymorphisms per nonsynonymous site (d N ). This method is more reliable than those proposed by Nei and Gojobori [48] and Ina [49] as, unlike other approaches, both the HKY85 [50] and F84 [51] models are utilized to account for the transition/transversion rate bias and codon frequency when calculating the d S and d N through an approximation method. Additionally, baseml, another program within the PAML suite, was utilized to determine the rate of nucleotide substitutions [47]. Baseml analysis involved the application of seven different nucleotide substitution models (JC69, K80, F81, F84, T92, TN93, and REV) toward the pil nucleotide alignment in order to determine the best-fit model for the dataset by determining the 2δ of the upper limit of the log-likelihood from each derived tree.

Estimation of substitution saturation
As substitution saturation of one or more sequences causes a loss of phylogenetic information that can lead to erroneous conclusions, the test of substitution saturation implemented by the DAMBE program was applied toward both the opa and pil nucleotide alignments [52][53][54]. Substitution saturation occurs when nucleotide substitutions within a site occur so frequently that the position is said to be saturated with polymorphisms, when this occurs, a true phylogenetic tree can no longer be assembled from the dataset. This test of substitution saturation calculates a critical index of substitution (I SS.C ) which serves as the cut-off value in which sequences will no longer produce the true phylogenetic tree. The I SS.C value is calculated from the critical tree length (the tree length with a probability 0.95), the sequence length of the alignment, and the number of operational taxonomic units. The I SS.C can be compared to the index of substitution saturation (I SS ) calculated from the data, with a significantly smaller observed I SS value, as determined by a two-tailed Student's T test, indicating that the data is not affected by substitution saturation and is suitable for phylogenetic analysis. As the sensitivity of the method in detecting phylogenetic signals is reduced by unresolved sites and those containing gaps, only fully resolved opa and pil sites were used for the calculation of the I SS. To determine the operational taxonomic units (OTUs) within the opa and pil datasets, the 16S rRNA sequences from each Neisseria strain were aligned with MAFFT [38], and sequences that contained < 97% identity were considered as a separate OTU. This analysis identified a total of three different OTUs within the Neisseria dataset (data not shown).

Detection of recombination
To determine whether the detected polymorphisms arose due to frequent recombination, the standard measure of linkage disequilibrium, Dʹ, was estimated for each alignment [55]. The ratio of significant phylogenetically informative sites (p < 0.05) to the number of possible pairwise comparisons was used to determine if the number of detected sites would be expected under pure randomness. This test was performed with DnaSP [45]. Putative gene conversion events between pil and opa were detected using Geneconv v. 1.81 [56,57] and displayed with Circos [58]. Geneconv determines the length of fragments that contain similar quantities and patterns of polymorphisms between paired alleles. Polymorphic sites of paired alleles are then randomized 10,000 times to generate a random sample of fragments given the observed number of polymorphic sites. The observed fragments are then compared with the permuted lengths to determine whether they are longer than expected by chance and to compute p values. Statistically significant gene conversion fragments were further parsed to remove any detected recombination events that occurred entirely within constant regions.

Computational analysis for the determination of sites under selective pressures
To determine sequence divergence from neutral expectations, mathematical tests that apply the neutral theory of evolution as the null hypothesis were employed. These "tests of neutrality" can provide insights into intraspecific (Tajima's D) and interspecific (d N /d S, Poisson Random Field Model, phylogenetic analysis by maximum likelihood) selection pressures [59][60][61]. Tajima's test, the Poisson Random Field Model (PRFMLE) and phylogenetic analysis by maximum likelihood (PAML) were used to estimate the direction of selection pressures acting on the pil and opa genes.
The Tajima test uses the D T statistic, which is computed from the difference in the expectation (S) and the variance of the average number (k) of (pairwise) nucleotide differences between DNA sequences [62]. By applying the Tajima test via DnaSP, a sliding window approach allowed calculation of D T along various intervals in the alignments [45]. Additionally, the M7 and M8 site models of CODEML (in the PAML suite) were also employed [63][64][65]. All site models use the codon rather than a single nucleotide as the unit of evolution [63]. The M7 model cannot account for positively selected sites (d N /d S 1), while the alternative M8 model allows for positive selection (d N /d S > 1) . Positive selection is indicated if the M8 likelihood score is significantly better than that of M7. Likelihood ratio tests (LRTs) were performed on the log-likelihoods in order to compare the levels of significance between the null (M7) and alternative (M8) models [64]. The Poisson Random Field Model (PRF) [66], applied through the PRFMLE program, estimates the intensity of selection on synonymous and nonsynonymous sites [60,67]. The PRF model estimates the expected number of polymorphic sites in a population [66,68,69], permitting the values of the per locus mutation rate parameter (μ) and the scaled selection pressure (γ) to be estimated through a likelihood equation. The unscaled value of γ would be equal to the selective advantage of the cell expressing a given gene with the sign (+ or -) of the variable conveying the direction of selection pressure [7].

Structural prediction of conserved and variable regions
Protein structures of Pil and Opa were obtained from RCSB Protein Data Bank (PDB) [70]. These structures, along with the constructed protein alignments and phylogenetic trees, were used to visualize regions determined to be under positive selection with The Consurf Server [71][72][73][74]. The structures of Pil 2HIL [15] and Opa 2MAF [75] were used in this analysis. Protein structures were depicted with UCSF Chimera v 1.10.2 [76].

Calculation of solvent accessible surface areas
In order to determine solvent exposed regions of strain MS11 Opa (2MAF) [75] and strain C30 (a variant of strain MS11) Pil (2HIL) [15] proteins, GETAREA, a subroutine of the FANTOM program was utilized [77]. This program efficiently calculates the solvent accessible surface area through analysis of Cartesian coordinates stored in PDB format files. Additionally, the continuum solvation models implemented by this program were determined to produce protein conformations containing lower total energy values than those calculated from the empirically derived parameters of other methods [78,79].

Intra-and interspecies homology varies among Neisseria genes
Phylogenetic trees constructed from nucleotide sequence alignments can provide insights into gene divergence based upon sequence polymorphisms, insertions and deletions. Therefore, pil and opa phylogenies reveal a high degree of divergence between the two Neisseria species with both genes forming a species-specific cluster (Fig 1A and 1B). However, within a species, sequence conservation between isolates from similar geographic locations, disease manifestations (Gc), or serogroups (Mc) is relatively weak. Instead, strains of Gc and Mc from distinct locales share few homologous genes (third ring in Fig 1), implying that the observed polymorphisms within pil and opa have occurred independently, following the divergence of the two species, yet not before allowing polymorphisms to emerge prior to strain differentiation (colored boxes closest to the tree in Fig 1) and geographical isolation (third ring in Fig 1). The majority of pil and opa genes exhibit intraspecific homology, with a single pil gene of Gc and two pil genes of Mc (arrow 1; Fig 1A), along with two opa genes of Mc (arrow 3 in Fig 1B) serving as outliers.

Lack of substitution saturation validates phylogenetic analysis of opa and pil
Despite the high degree of polymorphisms identified within the variable opa and pil genes, analysis indicated that the index of substitution saturation calculated for both was significantly less (p < 0.0001) than the critical index of substitution saturation (S2 Table; corresponding to the closest number of OTUs determined for the analyzed pathogenic Neisseria strains, OTU = 4). Therefore, neither opa nor pil have experienced an excess of sequence polymorphisms required to drive sequences past the critical index of substitution saturation. This indicates that any conclusions drawn from the alignments and phylogenetic analyses are not rendered invalid due to substitutions present within these sequences.

A positive selection-dominated landscape in Neisseria variable genes
As pil [10,13] and opa [24,80] sequences are known to contain conserved and variable regions, protein alignments were constructed in order to delineate the conserved segments. Sliding window analysis revealed that the majority of detected polymorphisms occurred within two or more gene sequences and were therefore parsimony-informative (Fig 2). These parsimonyinformative polymorphisms were detected throughout the length of the opa (Fig 2A) and pil ( Fig 2B) alignments. Consequently, nucleotide sequence variation is not restricted to just the variable regions of the genes. The array of parsimony-informative sites within the opa and pil genes were then used to differentiate between synonymous and nonsynonymous polymorphisms. Determining the number of sequences that carry a synonymous or nonsynonymous polymorphism allows for the d N and d S variables for that sequence alignment to be calculated, thereby permitting the estimator of selection pressure, the d N /d S ratio, to be ascertained. As nonsynonymous polymorphisms result in a change to the resulting amino acid sequence, these polymorphisms are assumed to occur under selective conditions. Consequently, a d N /d S ratio greater than one is an indicator of positive selection. Examination of d N and d S in the opa and pil alignments revealed that there was a greater number of nonsynonymous than synonymous changes present in  Table 1). Therefore, positive selection appears to have influenced the divergence of these genes.

Differential biases in synonymous and nonsynonymous polymorphism locations
A sliding window approach was used to graphically depict the quantity of synonymous and nonsynonymous polymorphisms within the analyzed genes. The majority of polymorphisms (either synonymous or nonsynonymous) lie within the variable regions of opa (Fig 3A and 3B). However, compared to opa, relatively few polymorphisms are present within pil (as indicated by the height of the peaks in Fig 3C and 3D), yet, when present, they also occur within the variable regions, as seen in the minicassettes. Examination of interstrain polymorphisms revealed that while the amount and location of synonymous polymorphisms differ amongst species (as indicated by the strain-specific lines in Figs 4A, 4C, 5A and 5C), the extent and position of   (Fig 4B and 4D) and pil (Fig 5B and 5D) are fairly homologous (as seen by the similar location and intensity of polymorphisms of strain-specific lines). Further analysis of interstrain polymorphisms revealed that the nonsynonymous polymorphisms detected within Mc opa genes occurred almost exclusively within the three variable regions of the gene, with the majority occurring within HV1 and HV2 (Fig 4B). In contrast, Gc opa genes lacked any detected sequence polymorphisms at the 5 0 end of the alignment, even within the semivariable region (due to this Fig 4C and 4D begin downstream of the semivariable region at nucleotide 300 in the alignment, which allows greater expansion of the sliding window on the x-axis). Despite this, the nonsynonymous polymorphisms present within each Gc strain followed a similar pattern, with the majority of polymorphisms occurring within the two hypervariable domains. However, some nonsynonymous polymorphisms were observed within the membrane spanning region between the two HV regions. Overall, this analysis demonstrates almost exclusive selection for protein-altering substitutions occurring within the exposed hypervariable domains in both Mc and Gc, implying selection pressures within these regions (as evidenced in the d N /d S ratio of~1.79, Table 1).
Within pil genes, both Gc and Mc contain higher levels of synonymous substitutions near the 5 0 end of the variable regions (within mc5; Fig 5A and 5C) and display slightly elevated levels of nonsynonymous substitutions throughout the variable regions (Fig 5B and 5D). Therefore, the majority of base pair substitutions detected within the polymorphism analysis (Fig 2b)  are synonymous within the 5 0 end of pil. The comparative absence of nonsynonymous substitutions within the pil variable regions may be due to the high degree of heterogeneity, which may cause non-continuous alignments for analysis. Nonetheless, the number of detected nonsynonymous polymorphisms within pil is still sufficient to skew the d N /d S ratio in favor of positive selection as the calculated d N /d S ratio is~1.27 ( Table 1).
Analysis of linkage equilibrium within the variable genes of Neisseria spp.
Evidence exists for horizontal gene transfer occurring within opa [20] and intragenic recombination with pil [81,82]. Consequently, the standard measure of linkage equilibrium, D 0 , can be estimated. Alleles are said to be under linkage equilibrium if D 0 is less than the 0.05 proportion that would be expected under pure randomness. Estimation of D 0 for opa and pil using the two-tailed Fisher exact test revealed that frequent recombination does not account for the detected polymorphisms occurring among species, as both interspecies and interstrain analysis show evidence for linkage disequilibrium ( Table 2). The presence of constant regions within the gene sequences may have skewed the above analysis. Consequently, the variable regions were analyzed separately to determine if polymorphisms encountered within exposed regions are attributable to recombination. However, such independent analysis of the variable regions did not differ from the previous inference as opa SV, HV1-2, and pil mc1-6 were found to be under linkage disequilibrium among and within species (Table 2). However, significant D 0 values were detected between the majority of intrastrain opa (~88% of strains analyzed) and pil (~97% of strains analyzed) genes analyzed. This indicates that the majority of Gc and Mc strains contain opa and pil genes that are in linkage equilibrium (S3 and S4 Tables).
In order to detect any interspecies, as well as inter-and intrastrain horizontal gene transfer events between these variable genes, the Sawyer test was applied using Geneconv [56,57]. This analysis revealed evidence for recombination between the variable regions of both pil and opa (Fig 6), with the majority of these recombination events occurring between strains of Neisseria (interspecies =~8%, interstrain =~83%, and intrastrain =~9%). Despite the fact that opa contained more inter-and intrastrain recombination events (~81% of all interstrain and~76% of all intrastrain recombination events), pil presented more detectable interspecies recombination events (~89% of all interspecies recombination events that were detected). Furthermore, while the majority of these interspecies events were due to recombination of variable regions (~97% in opa and 100% in pil), a small portion of allelic exchanges were detected within opa (~3%). Therefore, these data indicate that pil is exchanged more often between species while opa transfer is more prevalent between and within strains. However, due to the fact that all interspecies and interstrain opa and pil genes are in linkage disequilibrium (Table 2), the majority of these detected recombination events (~91%) do not account for the polymorphisms detected in this analysis.

Confirmation of selection pressures within the variable genes of Neisseria
Various tests were employed to determine if aligned sequences fell under the neutral mutation hypothesis. Tajima's test was used to determine both inter-and intrastrain selection pressures. Analysis of entire alignments did not reveal significant evidence for selection, save for purifying selection within the opa genes of Gc strain e03.04 (data not shown). However, sliding window analysis revealed statistically significant (p < 0.05) positive selection within the interstrain The alignments of the entire length of pil (mc6-mc1) and opa as well as the individual pil minicassettes (mc1-6) and opa variable regions (SV and HV1-2) were analyzed for evidence of linkage disequilibrium. The number of possible pairwise comparisons and the number of statistically significant pairwise comparisons as determined by the two-tailed Fisher exact test were used to determine D 0 .
analysis of Gc pil (71% within minicassettes), while purifying selection was detected within the interstrain analysis of Mc pil (100% within constant regions) (Fig 7B and Table 3). All other statistically significant D T values (p < 0.05) denoted purifying selection of intrastrain opa (Mc = 1, Gc = 4) and pil (Mc = 5, Gc = 4). Of the sites predicted to be under the influence of purifying selection, less than half (opa = 18%, pil = 24%) lie within the predefined constant regions indicating a functional role for these sites (Table 3). In fact, previous data has identified the presence of pil-derived small RNAs (sRNAs) arising from experimentally verified non-   [83][84][85]. Further investigation into the Gc MS11 small transcriptome identified two non-canonical promoter motifs upstream of other pil-derived sRNAs (data not shown). Examination among the pil genes of pathogenic Neisseria revealed strong conservation of these promoter motifs (Fig 7C) as homologs were identified within every pil gene examined (n = 219). However, investigation into selection pressures utilizing non-coding nucleotide substitution analysis revealed heterogeneity (α = 0.12) among the entire pil alignment utilizing the general-time-reversible model (REV) as the best-fit model to our dataset (as calculated by the 2δ of the log likelihoods; data not shown). Therefore, the relative heterogeneity of pil within the locations harboring the motifs was analyzed independently. While motif 1 lies within a variable minicassette (mc5; shown by a black bar in Fig 7B), motif 2 is contained within a conserved region of the gene and, therefore, putative promoters within this region were found to be highly conserved (~97%; Fig 7C and indicated by a black bar between mc3 and mc2 in Fig 7B). Furthermore, although codon line), and the variable regions (gray boxes), and for opa the membrane spanning regions (red boxes). The majority of positively selected sites within pil and opa occur within the variable regions. Two pil non-canonical promoter motifs detected from small transcriptome studies of N. gonorrhoeae strain MS11 were found to be highly conserved among the 219 pil genes analyzed, as shown by sequence logos (c) in which the height of the nucleotides in the logo demonstrate their conservation among pil. Their locations within pil are shown by black bars in mc5 and the constant region between mc3 and mc2 (b). The relative percentage of amino acids present within the HV1 (d) and HV2 (e) variable regions of the Opacity proteins. The positively selected regions within HV1 and HV2 are shown as diamonds above their respective amino acid positions. As can be seen, nonsynonymous polymorphisms are highly prevalent within these positively selected regions with the amino acids present within these regions being highly varied.
doi:10.1371/journal.pone.0161348.g007 Table 3. substitution models were utilized, about 16% of variable sites determined to be under purifying selection contain promoter motif 1 (as indicated by arrow 1 in Fig 7B) all of which contain a perfectly conserved non-canonical promoter sequence (5 0 -AATATGT-3 0 ) indicating that this portion of the variable minicassette is most likely placed under purifying selection at the nucleotide level in order to maintain this promoter element [85]. The PRFMLE program was used to analyze full-length alignments and estimate the per locus mutation rate parameter (μ) and scaled selection pressure (γ). All gene alignments were tested for evidence of interspecies, interstrain (Table 4), and intrastrain (S5 Table) selection pressures along with those that may be dependent on geographical location. By assuming a Poisson distribution, PRFMLE detected evidence of positive selection in all interspecies and interstrain analysis of pil and opa (Table 4). Interestingly, opa and pil intrastrain analysis revealed positive selection within the majority of Gc strains (positive selection indicated withiñ 68% of pil and~44% of opa), with less prevalence exhibited within individual strains of Mc (positive selection indicated within~33% of pil and 0% of opa) (S5 Table).
Investigation into selection pressures using PAML corroborated the PRFMLE data. Two selection models were performed on interspecies alignments of pil and opa in order to perform a likelihood ratio test (2Δℓ) and compare the null model (d N /d S 1) with an alternative model (d N /d S > 1). The resulting likelihood values (ℓ) were used to calculate the log likelihood ratio. As the resulting p-values of each alignment were less than 0.00001, the null model (d N /d S 1), which states that polymorphisms arise under neutral conditions, was rejected and the alternative model was presumed to provide a better fit to the observed data. Genes were determined to be under positive selection by the d N /d S ratio of the alignments as established under the alternative model ( Table 5). The use of the Bayes Empirical Bayes method was then applied to ascertain the location of positively selected sites in the alignment. This analysis identified the interspecies alignment of pil, along with the interstrain alignments of pil and opa as being under positive selection. While the full-length interspecies alignment of opa did not have a A likelihood ratio greater than w 2 1% = 9.21 with d.f. = 2 shows statistically significant evidence for the existence of positively selected sites in these genes [64]. d N /d S ratio greater than one, investigation into the site classes revealed that a portion of sites showed evidence for positive selection (~17% had a d N /d S ratio equal to 4.06). The variable minicassettes contained the majority (~97%) of positively selected amino acids within pil. In opa, the hypervariable domains contained the majority of positively selected amino acids (~75% in hypervariable and~12% in semivariable) (Fig 7A) [19]. The fact that the majority of positively selected sites are harbored within the hypervariable regions of opa ( Fig 7A) conflicts with previous beliefs that the CEACAM binding domains located within HV1 and HV2 exist in specific combinations that are maintained to conserve receptor binding [86,87]. Therefore, further investigation into the hypervariable regions of Opa was performed to determine if nonsynonymous polymorphisms are biased and show preference for specific types of amino acids (ie. charged, polar, hydrophobic) within the HV1 and HV2 domains. Comparison of the amino acids encoded within the entire Opa alignment with those found within the hypervariable domains can provide insight into any patterns that may be present within these regions. Analysis revealed that the hypervariable domains exhibited variable types of amino acids, with the majority of nonsynonymous sites located within the positively selected regions (Fig 7D and 7E). Not surprisingly, hydrophobic amino acids are less prevalent within both hypervariable regions (analysis of all Opa proteins revealed that, on average,~22% occurred within the entire protein,~14% within HV1, and~16% within HV2) as these regions are exposed to the external environment. However, the positively selected regions within HV2 contain a relatively high amount of these amino acids (~23% hydrophobic aa). Conversely, positively selected regions within HV1 do not display the same propensity for hydrophobic amino acids (~7%), and instead contain a higher percentage of negatively (~10% in HV1 and~6% in Opa) and positively (~18% in HV1 and~12% in Opa) charged amino acids compared to the entire Opa protein. Finally, in both the HV1 and HV2 positively selected regions, the quantity of uncharged polar amino acids (~27% in HV1 and~28% in HV2) are higher than those found in the entire protein (~15%). Therefore, positive selection within both HV1 and HV2 differs, as amino acids within these regions are distinct from the entire protein and are additionally varied from the HV regions that were not determined to be under the influence of positive selection.
Solvent accessible surface analysis performed on the Opa 60 protein of strain MS11 within the PDB databank [75] revealed that relatively few amino acids are predicted to buried (onlỹ 26% of all amino acids) with an average solvent-accessible surface area ratio <20%. However, the majority of these buried amino acids are hydrophobic (~27% of all buried amino acids), with the membrane spanning regions (~37%) and the hypervariable domains (~5% in HV1 and~10% in HV2) containing nearly half of the buried hydrophobic amino acids. Conversely, about half of the amino acids in Opa are predicted to be solvent exposed (~42% of all amino acids) with an average solvent-accessible surface area ratio >50%. Interestingly, when these positions were analyzed against the entire Opa protein alignment, the majority of these solvent-exposed amino acids are hydrophobic (~42% of all types of amino acids are hydrophobic). However, as would be expected, a little over half of the hydrophobic solvent-exposed amino acids are found within membrane spanning regions (~55%). Nonetheless, the hypervariable domains also contain a relatively large amount of solvent-exposed hydrophobic amino acids, with a little less than half of them predicted to be under positive selection as determined by PAML (~11% present within HV1,~20% of which are under positive selection, and~17% within HV2,~52% of which are under positive selection) (S6 Table).

Structural prediction of selected sites
In order to visualize the location of sites predicted to be under the influence of selection, multiple sequence alignments were used to construct 3D protein assemblies via experimentally determined structures available through PDB. This provided further insights into the location of amino acids predicted to be under the influence of positive selection and into the forces acting on these virulence genes among species and strains. The majority of positively selected sites within Opa (Fig 8) were located within the variable regions SV, HV1, and HV2 [19,23,24,75]. Solvent accessible surface area analysis revealed that a little over half of these positively selected sites (~52%) lie within solvent-exposed regions, while the majority of the remaining positively selected sites contain variable surface-exposed ratios (~38% contain ratios between 20-50%) indicating that their surface-exposure remain undetermined (S6 Table). Corresponding to previous data [24], the more constant membrane spanning and periplasmic regions contained several positively selected sites. All sites determined to be under positive selection within PilE proteins (Fig 9) lay within the pre-determined variable regions [13]. Further solvent accessible surface area analysis of these sites revealed that a little over half of the positively selected sites (~54%) are predicted to be solvent exposed. However, similar to Opa, the majority of remaining amino acids contain solvent-exposed surface areas between those of buried and solventexposed amino acids (~36% contain ratios between 20-50%) (S7 Table).

Discussion
The maintenance of multiple, variable gene copies that encode a single antigenic determinant within pathogenic bacteria is believed to facilitate immune evasion. The evolution and maintenance of these variable genes has been attributed to inter-or intragenic transfer of alleles [7,87,88], mutation [81], or a combination of these driven by diversifying selection [81,88]. The three dimensional structure of Opa was obtained from PDB ID 2MAF [75]. The Consurf Server was used to align the amino acid multiple sequence alignment of the Opa proteins used in this study to the sequence of the 2MAF Opa60 structure [71][72][73][74]. Polymorphisms present in the Opa proteins from this study were depicted with UCSF Chimera, where cyan sites indicate the highest degree of variability and maroon sites indicate the most conserved regions [76]. The ribbons denoting the SV, HV1, and HV2 regions are shown larger than the other areas of the protein. The positively selected sites determined from the Bayes Empirical Bayes theorem implemented through PAML are shown as gray spheres [46]. This analysis shows that the three variable regions of the protein show strong evidence for being under positive selection, with the conserved regions showing relatively no evidence of sequence diversity or selection. Previous analyses of closely related opa sequences (11 opa gene sequences from 14 N. gonorrhoeae strains) have attributed the majority of sequence polymorphisms to intragenic recombination among existing alleles with little evidence to support the horizontal transfer of genes among strains, even within mixed gonococcal infections [87]. A separate analysis utilizing the same dataset of N. gonorrhoeae opa genes, in addition to N. meningitidis opa alleles (297 opa gene sequences from pathogenic and nonpathogenic strains of N. meningitidis; [86]) determined that Mc opa genes are distinct from Gc opa genes as they typically harbor genetically identical alleles [89]. Consequently, this implies that host immune pressures act to suppress the number of unique combinations of HV1 and HV2 regions in Mc Opa variants [86], but increase the number and diversity of opa alleles in Gc [89]. Similarly, investigation into pil sequence polymorphisms has been limited to only two N. meningitidis strains (1 pilE and 8 pilS sequences from strain Z2491 and 1 pilE and 8 pilS sequences from strain MC58) and provided strong evidence for sequence polymorphisms arising via intragenic recombination and horizontal gene transfer [81]. However, to date, no comprehensive analysis of sequence polymorphisms and selection pressures within the variable opa and pil genes of multiple species and strains of pathogenic Neisseria from distinct geographical locations has been determined. Therefore, this study sought to clarify the degree of genetic variability and deduce the presence of selective pressures acting upon available opa and pil sequences.
A proposed mechanism to describe the genetic variability displayed within the opa [20] and pil [81,82] alleles implicates horizontal gene transfer and subsequent recombination of opa and The three dimensional structure of Pil was obtained from PDB ID 2HIL [15]. The Consurf Server was used to align the amino acid multiple sequence alignment of the Pil proteins used in this study to the sequence of the 2HIL structure [71][72][73][74]. Polymorphisms present in the Pil proteins from this study were depicted with UCSF Chimera, where black sites indicate the highest degree of variability and white sites indicate the most conserved regions [76]. The ribbons denoting the variable minicassettes are shown larger than the other areas of the protein. The positively selected sites determined from the Bayes Empirical Bayes theorem implemented through PAML are shown as grey spheres [46]. This analysis shows that the variable regions of the protein show strong evidence for being under positive selection, with the conserved regions showing relatively no evidence of sequence diversity or selection.
doi:10.1371/journal.pone.0161348.g009 pil genes. While statistically significant horizontal transfer events of opa and pil genes among and between species and strains was detected (Fig 6), only intrastrain genetic transfer could account for the polymorphisms detected within the majority of strains determined to have significant D 0 values (within~88% of opa-encoding strains and~97% of pil-encoding strains) ( Table 2 and S3  and S4 Tables). Therefore, the preponderance of polymorphisms is attributable to linkage equilibrium within strains. Additionally, phylogenetic analysis revealed that divergence of these variable genes is only dependent upon the species, as the vast majority of opa and pil genes contain polymorphisms, insertions and deletions that are unique to either N. meningitidis or N. gonorrhoeae, and display no predisposition towards disease severity (ring 4 in Fig 1), serogroup (ring 4 in Fig 1), or geographical distribution (ring 3 in Fig 1) (as indicated by the solid lines above the tree in Fig 1). Consequently, this implies that, in addition to having diverged prior to geographical distribution, these genes are not linked to other virulence determinants and only intrastrain horizontal gene transfer events are evident in variant forms of these alleles.
While homologous recombination appears to account for the majority of polymorphisms detected within the variable genes of pathogenic Neisseriae, selection pressures have been presumed to drive antigenic variation. Efforts were taken to extrapolate whether outside forces or the intrinsic tendency for recombination and purifying selection of deleterious mutations drives the observed polymorphisms. Initial investigation indicated a role for positive selection in both the interspecies analysis of pil and opa and the Mc interstrain analysis of pil and opa as the d N /d S ratios were greater than one (Table 1). However, interstrain analysis of Gc opa genes did not show evidence for positive selection as the d N /d S ratio was less than one (Table 1). This is rather surprising as previous analysis revealed that N. meningitidis opa alleles were less diverse than those isolated from N. gonorrhoeae, indicating strong positive selection in the latter [89]. However, while these conclusions were drawn from a robust sample of N. meningitidis (297 sequences) and N. gonorrhoeae (154 sequences) opa alleles, each species was acquired from a single geographic location (the Czech Republic [86] for N. meningitidis and Sheffield and London, England for N. gonorrhoeae [87]). In contrast, the data used in this study utilized a more diverse sample of opa genes as six N. meningitidis and sixteen N. gonorrhoeae strains from distinct geographic locations (including China, France, Germany, and Africa for N. meningitidis and Sweden, North America, and Russia for N. gonorrhoae) were sampled. Therefore, while phylogenetic analysis indicated that the presence of polymorphisms was not linked to geographical isolation of strains (Fig 1), the more diverse data set used in this analysis may have provided a larger array of polymorphisms yielding a more robust d N statistic.
To further extrapolate sites under selective pressures, a sliding window approach was implemented. This analysis revealed evidence for positive selection within interspecies and interstrain pil genes (Fig 7b), while intrastrain analysis of opa and pil detected purifying selection within several strains ( Table 3). The absence of positively selected regions within a single strain as compared to their presence among species and strains indicates a greater divergence of pil genes in the latter categories (interspecies and interstrain). Such divergence is evidenced in the greater number of nonsynonymous polymorphisms between strains and species resulting in a significant D T statistic (p < 0.05) that indicates positive selection as the driving force behind this divergence (Fig 7b). The negatively selected sites detected from the intrastrain analysis of pil may similarly be masked in the interspecies and interstrain analysis by the high degree of sequence polymorphisms (Table 1 and Fig 2b). While purifying selection is presumed to occur within constant regions to conserve necessary functional and/or structural elements for the encoded proteins, only a portion of sites under intrastrain purifying selection within both pil and opa occurred within predefined invariable regions. Due to this, it may be speculated that variable regions under purifying selection also encode structural elements for the encoded proteins. Furthermore, although identified through a codon substitution model, a portion of the variable minicassettes containing negatively selected sites within pil share sequences homologous to experimentally verified non-canonical internal promoter elements [83,85]. Consequently, further investigation into a small RNA transcriptome of Gc strain MS11 revealed the presence of two non-canonical promoter motifs upstream of pil-derived sRNAs which correspond, in sequence, to experimentally verified pilE and pilS promoters (data not shown; [83,85]). Investigation into their conservation revealed that these promoter motifs are maintained as highly conserved elements (100% conservation of motif 1 and~97% conservation of motif 2) within all species and strains of pathogenic Neisseria analyzed regardless of their location in predefined invariable (motif 1 in mc5) or constant regions (motif 2) (Fig 7C). Therefore, these regions may be placed under negative selection to maintain the internal promoters. Additionally, with regard to Opa, as CEACAM binding by Opa is determined by the hypervariable domains [25][26][27], and, as negative selection for CEACAM binding has been previously determined in vivo [90], it is likely that these cells are maintaining a preferred binding potential as the negatively selected sites within the variable regions of opa occurred within the hypervariable domains as previously speculated [6].
Additional investigation into selective pressures with PRFMLE identified evidence for positive selection in all interspecies and interstrain analysis, which were confirmed in interspecies pil and interstrain opa and pil by PAML, as these analyses determined that the d N /d S ratios were greater than one (Tables 4 and 5). While full-length interspecies analysis of opa by PAML did not reveal evidence for positive selection, further differentiation revealed that a portion of the interspecies opa sequence (~17%) contained a d N /d S ratio greater than one (d N /d S = 4.06). While these tests of neutrality differ on the extent of selection within the opa and pil variable genes of Neisseria, both conclude that portions of these genes are under the influence of positive selection. The statistically significant positively selected sites determined through a Bayes Empirical Bayes approach were primarily located within the exposed variable domains of Opa and PilE (Fig 7 and S2 and S3 Figs). The existence of positively selected sites within the pil conserved (detected within~6% of sites under positive selection) and semivariable (detected withiñ 30% of sites under positive selection) regions was unanticipated, as previous analysis of PilE only detected positive selection within the hypervariable domains [81]. However, these sites were obtained from investigation of 18 pil sequences from two N. meningitidis strains (N. meningitidis Z2491 and MC58), while this study analyzed 219 individual pil genes from multiple strains of N. meningitidis and N. gonorrhoeae. Similarly, the high degree of positively selected sites within the hypervariable domains of opa (detected within~71% of sites under positive selection) conflicts with previous beliefs that the CEACAM binding domains located within HV1 and HV2 exist in specific combinations that are maintained to conserve receptor binding [86,87]. Indeed, a high degree of amino acid variability was found to exist within these positively selected hypervariable regions with relatively little evidence for any pattern associated with the polymorphisms (Fig 7D and 7E). While both HV1 and HV2 contain fewer hydrophobic residues than those found in the entire alignment (~14% of amino acids in HV1,~16% of amino acids in HV2, vs~22% in Opa), the regions determined to be under positive selection within HV2 contain a substantially higher proportion of hydrophobic amino acids (~23%). Furthermore, as many of the predicted solvent-exposed hydrophobic amino acids within HV2 were also determined to be under the influence of positive selection (~52%), tThis apparent selection for hydrophobic residues may aid in CEACAM binding. This is further supported as previous analysis of CEACAM1 identified critical roles for hydrophobic amino acid residues in Opa interactions [91,92]. The lack of preference for hydrophobic amino acids within the selected regions of HV1 (only 20% of the solvent-exposed hydrophobic amino acids) and the tendency for this region to contain a higher amount of charged or polar amino acids (~10% negatively charged,~18% positively charged,~27% uncharged polar) than what is found in the entire protein alignment (~6% negatively charged,~12% positively charged,~15% uncharged polar) does not appear to confer an advantage in CEACAM binding as evidence suggests hydrophobic interactions are also involved in receptor binding for this domain [91,92]. Other host adhesion receptors, however, contain hydrophilic interfaces with a significant number of charged residues involved in receptor binding (such as CD2 and CD58), therefore if the selected regions in HV1 are involved in receptor binding they may confer differential receptor specificity [92]. Consequently, this analysis was able to further extrapolate selected sites between available Opa and Pil encoding genes among a multitude of N. meningitidis and N. gonorrhoeae species and strains.
In conclusion, analysis of variable alleles within pathogenic Neisseria has allowed insights into the evolution and maintenance of the opa and pil genes across geographical locations, species and strains. Phylogenetic and gene conversion analysis have indicated that homologous recombination via horizontal gene transfer within strains is the most likely source of novel polymorphisms (S3 and S4 Tables). Since there is evidence for horizontal exchange of opa and pil between and within species (Fig 6), the fact that statistically significant horizontal transfer events were only detected within intrastrain polymorphisms should be further elucidated, perhaps through more sensitive tests for linkage equilibrium. However, it is likely that physical separation due to the typical sites of infection for N. meningitidis and N. gonorrhoeae has not allowed for significant interspecies horizontal exchange of pil and opa genes, a factor which has most likely played a role in the divergence of these genes between N. meningitidis and N. gonorrhoeae (Fig 1). Regardless, selection pressures applied by the host immune system appear to be the catalyst for these gene conversion events as the majority of polymorphisms are nonsynonymous causing d N /d S ratios to be greater than one (Table 1). While the variable portions of these genes contained the majority of positively selected sites (Fig 7), the degree of positive selection in these regions was great enough to skew the d N /d S ratio of the entire gene in favor of positive selection (Table 1) despite evidence for purifying selection in certain regions of the genes (Table 3). Therefore, due to the large quantity of nonsynonymous polymorphisms located mainly within exposed variable portions of the encoded peptides (Figs 3, 4 and 7), strong positive selection pressures were detected within these variable regions of opa and pil (Tables 1, 4 and 5). Indeed, the occurrence of positive selected surface exposed regions is not unique to Neisseria, as this has also been observed in vampire bat venom, and has led to the theorization that these polymorphisms act to oppose the immunological resistance developed in prey animals [93]. Therefore, within Neisseria, these strong positive pressures correspond to the obligate pathogenic nature of this pathogen, as it undergoes persistent scrutiny from the host immune system.  Table. The opa and pil genes utilized in this study. The species and strain of Neisseria is given along with the Genbank accession number, locus tag, and the chromosomal location of the gene. In some instances, the gene was lengthened to include important sequence features or shortened to remove repeat sequence elements; in either case the genomic location of the gene used in this paper is shown. (DOCX) S2 Table. Test of substitution saturation of the opa and pil nucleotide alignments.  Table. Proportion of amino acids in Opa predicted to be buried or solvent exposed. (DOCX) S7 Table. Proportion of amino acids in Pil predicted to be buried or solvent exposed.