Positively selected amino acid replacements within the RuBisCO enzyme of oak trees are associated with ecological adaptations

Phylogenetic analysis by maximum likelihood (PAML) has become the standard approach to study positive selection at the molecular level, but other methods may provide complementary ways to identify amino acid replacements associated with particular conditions. Here, we compare results of the decision tree (DT) model method with ones of PAML using the key photosynthetic enzyme RuBisCO as a model system to study molecular adaptation to particular ecological conditions in oaks (Quercus). We sequenced the chloroplast rbcL gene encoding RuBisCO large subunit in 158 Quercus species, covering about a third of the global genus diversity. It has been hypothesized that RuBisCO has evolved differentially depending on the environmental conditions and leaf traits governing internal gas diffusion patterns. Here, we show, using PAML, that amino acid replacements at the residue positions 95, 145, 251, 262 and 328 of the RuBisCO large subunit have been the subject of positive selection along particular Quercus lineages associated with the leaf traits and climate characteristics. In parallel, the DT model identified amino acid replacements at sites 95, 219, 262 and 328 being associated with the leaf traits and climate characteristics, exhibiting partial overlap with the results obtained using PAML.


Introduction
RuBisCO is one of the best-studied enzymes and is often used as a model protein in evolutionary studies. During photosynthesis, RuBisCO binds CO 2 to the Calvin cycle intermediate ribulose-1,5-bisphosphate (RuBP), thereby acting as the essential entry point for carbon into the biosphere. Due to its imperfect ability to distinguish between CO 2 and O 2 , RuBisCO also a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 catalyzes the oxygenation of RuBP, giving rise to the energy-dissipating process of photorespiration. Compared to other catalysts, RuBisCO is a sluggish enzyme, with a catalytic turnover rate (k cat c ) of about 3 s −1 in terrestrial plants [1]. Alongside these catalytic imperfections and its large molecular weight, RuBisCO also represents a significant nitrogen investment, typically accounting for 25-30% of the leaf total nitrogen in C 3 plants [2].
The photosynthetic process adapts to abiotic stress, such as high temperature or water deficit [3,4,5], by optimizing leaf conductance (stomatal and mesophyll) governing CO 2 diffusion [6] and by adjustments in the activity and concentration of RuBisCO and other rate limiting enzymes [7,8,5]. Temperature and CO 2 concentration at RuBisCO active sites are the main driving forces of RuBisCO evolution and adaptation [9,10,11,12,13,14,15]. Computational analysis of carbon uptake at the leaf [16] and canopy level [17] also suggests that optimization of RuBisCO kinetics in modern C 3 plants depends on the temperature regime and CO 2 concentration. Therefore, plants from dry environments and plants with high leaf mass per area have the lowest CO 2 diffusion, and tend to have higher RuBisCO affinity for CO 2 [12,18]. By contrast, plants possessing the C 4 carbon concentration mechanism have faster, but less CO 2 specific RuBisCO [19,20,21,22,23,24]. High temperatures decrease the ratio of CO 2 /O 2 dissolved in the leaf liquid media, and directly decreases the affinity of RuBisCO for CO 2 [25]. Accordingly, adaptation to higher temperatures can be achieved by a greater specificity of RuBisCO for CO 2 (S c/o ), thereby reducing the loss of carbon due to photorespiration. Selection pressure on RuBisCO with increased S c/o in hot environments has been demonstrated in some thermophilic red algae [26] and in terrestrial plants [12]. Because of the trade-off between RuBisCO affinity for CO 2 and maximum carboxylase activity (k cat c ), the selection for increased affinity for CO 2 would inevitably take place at the expense of decreased k cat c [13,14]. Such fine-tuning of RuBisCO kinetic traits is attributed to environmentally driven changes at the molecular level, most likely amino acid replacements within the catalytic large subunit.
In higher plants and green algae, the structure of RuBisCO consists of eight chloroplastencoded large (L, 50-55 kDa) and eight nucleus-encoded small (S, 12-18 kDa) subunits assembled into a hexadecamer [27]. Large subunits possess the active site and therefore primarily determine RuBisCO kinetic traits [28], although recent studies demonstrate that S-subunits can also influence catalysis [29,30,31]. Directed mutagenesis and a variety of recombinant RuBisCOs from plastome-transformed plants allowed identifying molecular changes in L-subunit that translate into changes in RuBisCO catalysis, as well as determining how they affected photosynthesis and plant growth [32,33,34,35,36,37]. Recent studies have demonstrated the relationship between amino acid polymorphism in the L-subunit of RuBisCO and catalytic efficiency in natural vegetation and crops, by comparing both distant phylogenetic lineages [18,38,39] and closely related species [15,40] of land plants.
Studies comparing the rates of non-synonymous and synonymous substitutions along phylogenies have demonstrated that positive Darwinian selection is acting on RuBisCO within most lineages of plants, but is restricted to a relatively small number of residues [41,42,43,44,45,46,47,48,49]. Results derived from analyses of RuBisCO molecular adaptation complement trends in RuBisCO kinetics and confirm the predominant role of some environmental and physiological factors driving RuBisCO evolution. For example, signatures of positive selection are associated with changes in intracellular concentrations of CO 2 driven by carbon-concentrating mechanisms, both in algae and terrestrial C 4 plants [43,48,49,50].
Mapping positively selected residues within the protein structure helps to locate catalytically important regions of RuBisCO, and suggests candidate amino acid replacements which could be implemented to optimize RuBisCO performance in crops [42,48,49]. However, the effect of an amino acid replacement on protein properties could vary in the presence of other mutations, either individually or together, because of the molecular sign epistasis among mutations [51]. These epistatic interactions impose strong selective constraints on amino acid replacements and also may explain the failure of most attempts to improve RuBisCO catalysis by single point mutations [34,52]. In agreement with this prediction, positive selection analysis must also account for co-adaptive amino acid replacements through the identification of coevolutionary signatures to find how key residue changes affect RuBisCO structure and function. Coevolutionary studies have been applied to various proteins [53,54,55], but only recently to RuBisCO [47,56]. It has been shown that coevolution of residues is common in RuBisCO of land plants and there is an overlap between coevolving and positively selected residues [56].
Evolutionary analyses are needed to identify adaptive changes in the Rubisco sequence, but the drivers of such evolution must also be investigated. In this paper, we used a predictive model called decision tree (DT), which is able to statistically associate a combination of environmental variables to variation in the amino acid residues. A DT can be used for both classification (classification tree) and regression (regression tree) tasks. We used this model for classification tasks, which are frequently employed in applied fields such as engineering and medicine [57,58,59,60,61]. A DT implicitly performs feature selection and requires relatively little effort for data preparation. The analysis is straightforward, the results shown graphically and they can be easily interpreted.
The objective of the study was to investigate molecular adaptation of Quercus RuBisCO to particular ecological conditions and to test if leaf morphological traits are associated with adaptive amino acid substitutions. To achieve this, we compared two different methodologies: the DT model and phylogenetic analysis by maximum likelihood. We selected oak (Quercus) species as a model group for this study because this genus contains a large number of species (ca. 500) inhabiting a wide range of environments. Both evergreen and deciduous oak species have contrasting leaf morphology [62], and therefore variable diffusive limitations to CO 2 transfer from the atmosphere to the site of carboxylation [63]. Finally, oaks are often an ecosystem-defining species in most broad-leaved forests worldwide making them an ecologically important group.

Taxon selection and sampling
A total of 174 species in Fagales were selected for the study (S1 Table). These species belong to the Fagaceae (n = 170) and Nothofagaceae (n = 4). Within Fagaceae, the majority of the species belong to Quercus (n = 158; ca. 30% of the total number of Quercus species).
Each species was classified according to its geographic distribution, prevalent climate and leaf habit (S1 Table). The geographic distribution area of each species was assigned according to Govaerts et al. (1998) [64] and information found in publicly available databases [65, 66, 67, 68]. The prevalent climate was obtained by overlapping the species geographical distribution in our study and the Köppen-Geiger world map of climate classification [69]. To simplify the analysis, fifteen Köppen-Geiger climate types were grouped into six: 1) tropical (including climates Af, Am and Aw according to Köppen-Geiger classification); 2) arid steppe (Bsh and Bsk); 3) temperate with dry winter and hot or warm summer (Cwa and Cwb); 4) temperate with dry summer and hot or warm summer (Csa and Csb); 5) temperate or cold without dry season and hot or warm summer (Cfa, Cfb, Dfa and Dfb) and 6) cold with dry summer and hot or warm summer (Dsa, andDsb) (S2 Table). Regarding the leaf habit, species were classified as evergreens (those species retaining their leaves during the whole year), deciduous (when losing all leaves during the unfavourable season) and semi-evergreen (those species that lose some leaves during the unfavourable season, depending on its length and severity).
Leaves from all species were sampled from living collections of Jardín Botánico de Iturrarán (Parque Natural de Pagoeta, Aya, Guipúzcoa, Spain), with the exceptions of Q. palmeri, Q. baloot and Q. vaccinifolia, which were collected from The Cheviton Barton collection (Bevon, UK). For each species, leaf density was calculated from leaf thickness and leaf mass area (LMA) measurements performed on fully expanded leaves that developed in the external part of the tree canopy (i.e., exposed to full solar irradiation). The leaf thickness of each species was measured on two discs (disc area = 0.33 cm 2 ) per leaf from five fully hydrated leaves, collected from three to five different individuals. The leaf thickness was measured using a digital contact sensor GTH10L coupled to an amplifier GT-75AP (GT Series, Keyence Corporation, Japan) [70]. Afterwards, LMA of each disc was calculated as the ratio between the dry weight and the area. The dry weight was obtained after drying the leaf discs in a ventilated oven at 60˚C until constant weight (typically after 2 days).

DNA sequencing
Total genomic DNA was extracted from leaf material using the DNeasy Plant Mini Kit (Qiagen Ltd., Crawley, UK) according to the manufacturer's protocol.
All PCR reactions were performed using the BioMix Red reagent mix (Bioline Ltd., London, UK). The PCR program for the amplification of the rbcL comprised an initial denaturation at 95˚C, 2 min, and 36 cycles of 93˚C for 30 s, 53˚C for 30 s and 72˚C for 3.5 min, and a final extension at 72˚C for 30 min. The PCR program for the amplification of the matK gene comprised an initial denaturation at 95˚C for 2 min, followed by 35 cycles of 30 s at 94˚C (denaturing), 45 s at the annealing temperature of 56˚C, 2 min at 72˚C (extension), and a final extension phase of 7 min at 72˚C. The microsatellites were amplified using the following PCR conditions: 95˚C for 2 min, and 35 cycles of 95˚C for 30 s, 50˚C for 30 s and 72˚C for 2 min, and a final extension at 72˚C for 5 min. The rbcL and matK PCR products were separated on 2% agarose gels buffered with 1X TAE and purified using Roche High Pure PCR Product Purification Kit (Roche Diagnostics Corporation P.O., Indiana, USA). Chloroplast gene sequencing was performed using an ABI 3130 Genetic analyzer with the ABI BigDyeTM Terminator Cycle Sequencing Ready Reaction Kit (Applied Biosystems, Foster City, California, USA). For microsatellites, we used the ABI 3130 XL Genetic analyzer and fragment analysis was performed using the GeneMapper software v4.1 (Applied Biosystems). The DNA sequences from the chloroplast markers were aligned using Clustal X [75] and manually adjusted with Bioedit v.7.2.5 [76]. All variable sites were checked against the original sequence chromatograms, and doubtful regions were sequenced again. All newly generated sequences were submitted to the GenBank (S3 Table).

Phylogenetic analyses
We inferred the phylogenetic relationships from the nucleotide data using Bayesian inference (BI). We constructed a phylogeny using rbcL sequences from the 158 Quercus species (denoted Quercus large dataset) (Fig 1). The tree topology was not fully resolved for this group when using only one gene. Because we require a robust phylogeny to detect adaptive evolution by maximum likelihood, we chose a subset of species to construct a multilocus tree with betterresolved topology. The tree was constructed with a concatenated alignment of 45 rbcL, 43 matK and 42 SSRs for Quercus species (denoted Quercus small dataset) (Fig 2). Tree topologies using rbcL were congruent with those based on the use of multiple genes, with both leading to similar lists of amino acid sites detected to have evolved under positive selection. Finally, we constructed the phylogeny for all 174 species containing the Fagaceae and Nothofagaceae species (denoted Fagales henceforward) (Fig 3).
Nucleotide sequences were translated into amino acid sequences with MEGA 5 software [78] and aligned online using MAFFT [79]. The optimal DNA substitution model was determined by Modeltest v.3.7 package [80,81] by comparing available models using Bayesian information criterion (BIC). BI was performed in MrBayes version 3.2 [82] allowing different models for each region (rbcL, matK and SSRs). Markov Chain Monte Carlo (MCMC) used two independent runs of 1 × 10 6 generations. Trees for Quercus small and Fagales datasets were sampled every 300 generations. For the Quercus large dataset the MCMC used independent runs of 5 × 10 6 generations and trees were sampled every 100 generations. The first 25% of the runs were discarded as burn-in. The trees sampled before reaching a stable posterior probability (PP) were excluded from the consensus. A majority rule consensus of the remaining trees from the two runs was edited in FigTree v 1.4.0 [77] and used as BI tree.

Maximum likelihood tests for positive selection
Six different codon based models were applied using the codeml program of the PAML package version 4.7 to test for the presence of positive selection [83]. These models were compared for goodness-of-fit to the data and phylogenies using the Likelhood Ratio Test (LRT) and the best model was used to estimate the nonsynonymous-to-synonymous rates ratio (ω = d N /d S ). This ratio represents the selective pressures acting on the protein-coding gene with values of ω = 1, ω < 1, and positive ω > 1, being indicative of neutral evolution, purifying selection and positive selection, respectively. BI trees were used as the reference topologies for the PAML analyses (Figs 1-3).
Site models allow the ω ratio to vary among codons in the protein [84,85]. Model M1a assumes the same selection pressures on all branches of the phylogenetic tree. In this model, codons can either evolve neutrally or under purifying selection, and thus the estimated values of ω < 1 and/or ω = 1. Model M2a allows for an extra category of codon site compared to M1a which can evolve under positive selection (ω > 1). Model M8a assumes a discrete beta distribution for ω, which is constrained between 0 and 1 including a class with ω = 1. Model M8 allows the same distribution as M8a with an extra class of codons under positive selection with ω > 1.
Branch-site models allow ω to vary both among sites in the protein and across branches on the tree with the aim to detect positive selection affecting a few sites along particular branches (known as foreground branches). The branch-site A model was applied for branches leading to species with high or low leaf density; deciduous, evergreen or semi-evergreen species and species living in climates 1, 2, 3, 4, 5 and 6. When the number of species inhabiting a particular climate represented less than 15% of the total species analysed, then this climate was discarded for the branch site test. Model A1 allows 0 < ω < 1 and ω = 1 for all branches and also two additional classes of codons with fixed ω = 1 along pre-specified foreground branches while restricted as 0 < ω < 1 and ω = 1 on background branches. The alternative model A allows 0 < ω < 1 and ω = 1 for all branches and also two alternative classes of codons under positive selection with ω > 1 along pre-specified foreground branches while restricted to values of 0 < ω < 1 and ω = 1 on background branches.
We performed three LRTs to compare the nested site models M1a-M2a, M8-M8a and branch-site models A-A1. The LRTs values are calculated as twice the difference in the loglikelihood values of the models being compared, with the degrees of freedom being the difference in the number of parameters estimated in each of the models. LRT value can be approximated to a chi-square distribution. For the comparisons between M1a-M2a, M8-M8a and A-A1 the df was 2, 1 and 0.5, respectively.

Coevolution analyses
CAPS software [86] was used to test for dependencies among amino acids on the RuBisCO structure. We used the Bayesian trees of 158 Quercus large and 174 Fagales as topology references for the analyses. CAPS compares the correlated variance of the evolutionary rates at two sites. This variance is corrected by the amount of divergence between the sequences compared using either the synonymous nucleotide substitutions or, alternatively, amino acid replacements as a relative measures of time. For each protein alignment, the corresponding BLOSUM matrix was applied depending on the average sequence identity. The significance of the results was evaluated by randomization of pairs of amino acid sites in the alignment, calculation of their correlation values, and comparison of the real values with the distribution of 10,000 randomly sampled values. An alpha value of 0.01 was applied to minimize the number of false positives. The level of substitutions per synonymous site weighted the correlated variability among amino acid sites in order to normalize parameters by the time of sequence divergence. The method detects phylogenetic-independent coevolution. We also conducted an analysis of the statistical support for each of the coevolving amino acid pair using a non-parametric bootstrap analysis. Briefly, for each pair exhibiting significant coevolution signatures we shuffled the sequences across the tree and we then re-ran CAPS on the new resulting alignment. We then identified coevolving pairs of amino acid sites and checked for the presence of the pair identified in the original non-random alignment. We repeated this procedure 1000 times and for each pairs of coevolving sites determined its frequency as the number of times it is detected in the 1000 replicates divided by 1000. A pair of coevolving sites was considered to be significantly represented in the bootstrap procedure when its frequency was equal or larger than 0.8.

Decision tree model
Decision tree (DT) model analysis ("rpart" package in R v3.1.1) [87] was used to relate the proportion of amino acids present in all variable positions of the L-subunit of RuBisCO to species-specific traits (geographic area, climate, leaf habit and density), denoted as external variables.
For each variable position, the program builds a DT as follows. First, a question is found based on the analysis of all three external variables to split the species. Then, based on that question, the species are separated into two groups, in which the variability of that site is as low as possible. The analysis is repeated for each subgroup using all three external variables. The process continues until the lowest entropic error (xerror) for the entire DT is obtained [88]. The quality of the DT is categorized by its xerror as a function of the proportion of correct predictions and the complexity of the tree. The lower the xerror, the higher the relationship between the external variable and the variable site. Only DTs with xerror < 1 were selected. The program also calculates the importance of each external variable within the predictive model.
An advantage of DTs is that no statistical assumptions (about the independence, the distribution, the variance, etc.) are needed. The main limitation of DTs is to identify the optimal tree under certain criteria, so algorithms are employed to give an approximate solution.
Complete sequences of the chloroplast matK gene and nuclear SSRs were obtained for 43 and 42 species, respectively (Quercus small dataset). The phylogenetic tree constructed for the Quercus small dataset with rbcL, matK and SSRs is well resolved with posterior probabilities > 50% (Fig 2). The tree topology was similar to that of Manos et al. (1999) [89] based on combined chloroplast DNA and nuclear internal transcribed spacers (ITS).

Positive selection in Quercus rbcL
LRTs for positive selection (Table 1) indicated that the free-ratio model, that allows estimating ω for each of the branches of the tree, was significantly better than the models that do not allow for differences in ω values among tree branches (p-value = 0.0001). Models M2a and M8 both pointed to positive selection on rbcL in Quercus (small and large datasets) and Fagales. The three datasets exhibited positive selection at the amino acid sites 95, 219 and 328 (Table 1). In Quercus, asparagine (Asp) and serine (Ser) occurred at site 95, however threonine was found (Thr) in Nothofagus antarctica and N. procera (S4 Table). In the three groups, valine (Val) and leucine (Leu) occurred at site 219, and alanine (Ala) and Ser were found at position 328. In both the Quercus small and Fagales datasets, site 262 (Val or Ala) was positively selected. Sites 251 and 475 appeared as positively selected only in the Quercus large dataset. Isoleucine (Ile) and methionine (Met) occurred at residue 251, and Leu and Val occurred at residue 475. Site 145 in Fagus, Lithocarpus and Nothofagus was either a Val or Ala, although all Quercus species, Castanopsis carlesi, Castanea pumila and Lithocarpus densiflorus shared Ser.
In the Fagales dataset, LRT (Table 2) indicated that the branch-site model A (ω 2 = estimated, in branches leading to deciduous or evergreen species or belonging to climate 5, see S1 and S2 Tables) was a significantly better fit to the data than the null model A1 (ω 2 = 1, fixed) (p-value = 0.0001). However, no positively-selected sites were identified under the branch-site model A in either the Quercus small or large datasets. A total of five sites (95, 145, 251, 262 and 328) appeared as positively selected in Fagales, each exhibiting a posterior Bayesian probability greater than 0.90 (Table 2). In branches leading to evergreen species, Asp95Ser replacement was found at least two times (Q. germana, C. carlesii) ( Table 3). In branches leading to deciduous species, i) Ile251Met replacement was found at least six times (Q. aliena, Q. fabri, Q. griffithii, Q. muehlenbergii, Q. serrrata var. brevipetiolata and Q. wutaishanica); ii) Ala262Val replacement occurred on one branch leading to C. sativa; iii) Ala328Ser replacement occurred in branches leading to Q. eugeniifolia and Q. seemani, although Ser328Ala replacement occurred instead in branches leading to N. procera and N. antarctica. In branches leading to species in climate 5, i) Ser145Ala replacement took place in at least six times (F. crenata, F. japonica, F. lucida, F. sylvatica, N. procera, and L. hancei), while Ser145Val replacement occurred at least three times in branches leading to N. menziesii, N. moorei and N. antarctica, ii) Ala262Val replacement occurred in the branch leading to C. sativa; iii) Ala328Ser

Analysis of dependent evolution among amino acid sites in rbcL
Analysis of coevolution in rbcL identified 29 pairs of coevolving amino acids in Fagales dataset with a total of 14 non-redundant amino acids involved (Fig 4 and S5   with the nested model A (that aims to detect positive selection along particular lineages, forward branches). a According to information in S1 Table. b Number of species labelled as forward branches.

Decision tree model
In the Quercus large dataset (158 species), the DT model pointed to a link between the external variables (geographic distribution, climate and leaf habit and density) and the RuBisCO L-subunit variable sites 95, 219, 262 and 328 (Table 4, Fig 5). The xerrors calculated for each variable site were 0.89, 0.39, 0.44 and 0.59, respectively. According to the xerror, the sites that were best explained by the external variables were 219 and 262 followed by 328 and 95.  The xerror corresponding to the best DT found for each variable site, and relative importance (%) of the external variables (geographic area, climate and leaf habit and density) calculated for each resolved sites are shown. The lower the xerror, the higher the relationship between the external variables and the variable site. The external variable with the higher relative importance is the most important external variable explaining the variability in the site. n.a. denotes not selected external variable. https://doi.org/10.1371/journal.pone.0183970.t004

Molecular evolution of RuBisCO in Quercus (Fagaceae)
The leaf habit (evergreen, semi-evergreen and deciduous) was the external variable that best explained variability at site 95, followed by the geographic area, climate and leaf density ( Table 4). The most important variable for sites 219 and 262 was geographic area, followed by climate and leaf density for site 219 and leaf habit, leaf density and climate for site 262.
Site 328 was best explained by climate (2 or 5), followed by geographic area, leaf habit, and density (Table 4, Fig 5).  S1 Table for details on the external variables: Geographic distribution, climate and leaf habit and density). Numbers above each tree correspond to the RuBisCO L-subunit variable site according to the spinach sequence (AJ400848.1). First level presents the proportion of amino acids in each variable site (brackets). The external variable that allows the best separation of species is shown over the line. The second level presents the distribution of amino acids (in brackets) after the first split. Subsequent divisions are performed until the lowest xerror for the entire DT is obtained (symbolized as squares). Taking as an example the RuBisCO L-subunit variable site 95, the first level shows the separation of the 158 species between those that present N (121) and those that present S (37). Over the line, leaf habit is indicated as the external variable that gives the best split among the four external variables, with evergreen and semievergreen species having a proportion of N/S of 79/8. On the other hand, deciduous species present a proportion of N/S of 42/29. The latter group is further split using geographic area as the best external variable into a group of species from North and Central America having a N/S proportion of 29/9, and a group of species from Eurasia and Asia having a proportion of 13/20. The relative importance of each external variable is shown in Table 4.

Both phylogenetic analyses by maximum likelihood and decision tree analyses highlighted the same amino acid substitutions
Methodologically different approaches have been used to study molecular adaptation of RuBisCO to particular ecological conditions in oak trees (Quercus). Phylogenetic analysis by maximum likelihood (PAML) [90] is a standard method to identify positive selection at the molecular level. In the present study, six RuBisCO L-subunit sites (95, 219, 251, 262, 328 and 475) were identified by models M2a and M8 to have evolved under positive selection in Quercus large and small datasets ( Table 1), all of which were previously reported in other groups of plants [41,42,43,44]. For the same Quercus datasets, we compared the results of another method, the DT model. DT linked RuBisCO L-subunit sites 95, 219, 262 and 328 to distribution, climate, leaf habit and density ( Table 4). All four variable sites resolved by the DT model were positively selected according to maximum likelihood analyses (Tables 1 and 4). The analytically simple DT method combined with PAML provided evidence of a link between amino acids replacements in RuBisCO and specific phenotypes [84,85]. The combination of both methods constitutes a powerful tool to identify causal links between genetic variants and adaptation of the L-subunit of RuBisCO.
According to the DT model analysis, replacement at site 95 was linked to the leaf habit as the most important external variable (Table 4). Since site 95 evolved under positive selection (Table 1), and evergreen and deciduous species typically display different mesophyll conductance (g m ) influencing the CO 2 concentration at the site of carboxylation [63], this result provides support to the idea that CO 2 availability shapes RuBisCO evolution [14,91].
The six RuBisCO sites under positive selection in Quercus large and small datasets (Table 1) were located in functionally important subunit interfaces within the RuBisCO complex (95, 219, 251, 262, 328 and 475). Site 95 was hypothesized to be involved in interactions between RuBisCO and RuBisCO activase [92,93]. Sites 219 and 262 were reported to be involved in interactions between large and small subunits [94]. Site 262 is located in loop 3 in a hydrophobic core in the C-terminal α-β barrel domain and could influence holoenzyme thermal stability and catalysis [95]. Site 251 seems to be involved in dimer-dimer interactions within the large subunits [96] and sites 328 and 475 are located close to the active site and in the C-terminus, respectively [95,97].

Evidence for CO 2 as a major factor driving RuBisCO evolution in Fagales
In Fagales, species with evergreen leaves had RuBisCO residues 95 and 262 under positive selection, species with deciduous leaves had residues 251, 262 and 328 under positive selection, and species inhabiting climate number 5 had residues 145, 262 and 328 under positive selection (Table 2). In species with evergreen leaves, amino acid replacements in position 95 (Table 3) may be linked to an increased affinity of RuBisCO for CO 2 (i.e., low values of the RuBisCO Michaelis-Menten constant, K c ). The work by Galmés et al. [18] linked the Asp95Ser replacement to high affinity for CO 2 (low K c ). The dataset of evergreen Fagales had an average LMA of 117.5 ± 0.4 g m -2 . This high LMA, and specifically high leaf density, could have been associated with high resistance to CO 2 internal diffusion [63,97]. Moreover, species with high LMA also tend to present lower values of stomatal conductance [98]. Hence, RuBisCO of evergreen Fagales probably works at relatively low CO 2 partial pressures. Taken together, these results suggest that amino acid replacements at position 95 in evergreen Fagales may lead to RuBisCO with increased affinity for CO 2 . Unfortunately, our attempts to extract active RuBisCO in different Fagales failed due to the high content of polyphenols and other secondary metabolic compounds. Future efforts will demand testing different extraction buffers to obtain sufficient active enzyme and determine key RuBisCO kinetic parameters. In species with deciduous leaves or inhabiting areas lacking a dry season (climate number 5), replacements at site 328 (Table 3) could be related with decreased RuBisCO CO 2 affinity (i.e., high values of K c ). Galmés et al. [15] reported low specificity factor (S c/o ) and high maximum carboxylase turnover rate (k cat c ) and K c in Limonium species having serine at the site 328. Those kinetic values are associated with an increase in the chloroplast concentration of CO 2 . Within Fagales, species with deciduous leaves or belonging to temperate climate (climate number 5) had average LMA value of 87.6 ± 0.2 and 95.4 ± 0.4 g m -2 , respectively. Low LMA and low leaf density, together with the absence of a dry season, may favour high CO 2 concentration in the stroma of the chloroplast, via indirect effects on leaf conductance to CO 2 [98]. RuBisCO in deciduous species or inhabiting climate number 5 could have adapted towards a higher k cat c and lower CO 2 affinity (high K c ) and S c/o , although this could be confirmed with future kinetic analyses.
In total, twenty-nine residue pairs in Fagales RuBisCO L-subunit were linked through intramolecular coevolution, representing c.a. 3% of L-subunit residues (14 out of 476) (Fig 4  and S5 Table). Many of the coevolving residues detected in the present study were already reported as coevolving in previous studies including different land plant lineages [47,56]. Our results showed that both positive selection and coevolution affect some sites. For example, site 95 was positively selected within evergreen species (Tables 1 and 2) and appeared as coevolving with site 309 (S5 Table). By contrast, site 145 was positively selected within species of climate number 5 (Tables 1 and 2) and appeared as coevolving with seven sites (30,142,143,270,340,353, 470) (S5 Table). Three of the coevolving pairs, 30-340, 142-470 and 270-309 coevolve in terms of their physic-chemical properties, including molecular weight and hydrophobicity (S5 Table). Coevolving sites may be located within structurally and/or functionally important positions. For example, Kellogg and Juliano (1997) [99] reported the importance of sites 142 and 145 for dimer-dimer association, and sites 225 and 449 could be important for the interaction between large and small subunits [100]. Knowledge of co-evolution networks operating in RuBisCO L-subunit of Fagales provides useful information on target substitutions to improve the catalytic performance of RuBisCO.

Concluding remarks
Based on our research, it is reasonable to postulate that finely tuned biochemical properties of Quercus RuBisCOs have evolved as a result of environmental pressures. Such evolution is manifested by positively selected amino acid replacements within the large subunits of Quercus RuBisCO, which are likely related to different physiological and environmental traits. These changes could have fine-tuned RuBisCO catalytic efficiency and may have facilitated Quercus adaptive radiation into diverse ecological niches. The DT model and phylogenetic analysis by maximum likelihood identified the same amino acid replacements associated with ecological adaptation and positive selection.
Supporting information S1 Table. Studied species from Fagaceae and Nothofagaceae families from the order Fagales. The genus, subgenus and section are indicated along with information on the species distribution, climate, and leaf habit and density. Data on the geographic distribution and leaf habit were obtained from Govaerts et al. (1998) [64] and publicly available databases [65, 66, 67, 68]. The climate types were obtained by overlapping the species geographical distribution and the Köppen-Geiger world map of climate classification [69]. Fifteen different Köppen-Geiger types of climates were grouped into six: 1 = tropical, 2 = arid steppe, 3 = temperate with dry winter and hot or warm summer, 4 = temperate with dry summer and hot or warm summer, 5 = temperate or cold without dry season and hot or warm summer, 6 = cold with dry summer and hot or warm summer. The species leaf density was calculated from leaf thickness and leaf mass area (LMA) measurements. The three columns on the right correspond with [1] classification. (PDF) S2 Table. Köppen-Geiger climate types assigned to each of the 174 species (S1 Table) based on the geographic area. To simplify the analysis, 15 different Köppen-Geiger types of climates were grouped into six: 1) tropical (including climates Af, Am and Aw according to Köppen-Geiger classification); 2) arid steppe (Bsh, Bsk); 3) temperate with dry winter and hot or warm summer (Cwa, Cwb); 4) temperate with dry summer and hot or warm summer (Csa, Csb); 5) temperate or cold without dry season and hot or warm summer (Cfa, Cfb, Dfa, Dfb), and 6) cold with dry summer and hot or warm summer (Dsa, Dsb). (PDF) S3 Table. Fagaceae and Nothofagaceae GenBank accession numbers for rbcL and matk genes. (PDF) S4 Table. RuBisCO L-subunit 19 variable sites identified in 174 Fagales species, grouped in 30 haplotypes (i.e., groups of species with identical L-subunit sequence). Variable sites identified when Quercus large dataset (158 species) were analyzed separately are marked in grey (9 variable sites and 21 haplotypes). Species marked with an asterisk were used to construct the Quercus small dataset (45 species) phylogeny based on rbcL, matK and SSRs. (PDF) S5 Table. Coevolution pairs within the L-subunit of RuBisCO. Significant correlation coefficients are shown (p < 0.001). Mean D1 and Mean D2 values correspond to the mean distance calculated for each coevolving position based on BLOSUM as calculated in the method of Fares and Travers (2006) [1]. The bootstraps are based on 100 resampling, the confidence level associated with the pair that coevolves (values greater than 75% are significant after a non-parametric resampling over the tree). Atomic distances were calculated from 3D crystal structure wherever available by measuring the average Euclidean distance between atoms of two amino acids (Å). Atomic distances are not used here as evidence of coevolution but rather as additional supporting information in the identification of functional and structural coevolution. A test for variability in hydrophobicity and molecular weight has been also conducted giving as a result the pair 30-340, 142-470 and 270-309. n.a. not available. (PDF)