Comparative Analysis of Predicted Plastid-Targeted Proteomes of Sequenced Higher Plant Genomes

Plastids are actively involved in numerous plant processes critical to growth, development and adaptation. They play a primary role in photosynthesis, pigment and monoterpene synthesis, gravity sensing, starch and fatty acid synthesis, as well as oil, and protein storage. We applied two complementary methods to analyze the recently published apple genome (Malus × domestica) to identify putative plastid-targeted proteins, the first using TargetP and the second using a custom workflow utilizing a set of predictive programs. Apple shares roughly 40% of its 10,492 putative plastid-targeted proteins with that of the Arabidopsis (Arabidopsis thaliana) plastid-targeted proteome as identified by the Chloroplast 2010 project and ∼57% of its entire proteome with Arabidopsis. This suggests that the plastid-targeted proteomes between apple and Arabidopsis are different, and interestingly alludes to the presence of differential targeting of homologs between the two species. Co-expression analysis of 2,224 genes encoding putative plastid-targeted apple proteins suggests that they play a role in plant developmental and intermediary metabolism. Further, an inter-specific comparison of Arabidopsis, Prunus persica (Peach), Malus × domestica (Apple), Populus trichocarpa (Black cottonwood), Fragaria vesca (Woodland Strawberry), Solanum lycopersicum (Tomato) and Vitis vinifera (Grapevine) also identified a large number of novel species-specific plastid-targeted proteins. This analysis also revealed the presence of alternatively targeted homologs across species. Two separate analyses revealed that a small subset of proteins, one representing 289 protein clusters and the other 737 unique protein sequences, are conserved between seven plastid-targeted angiosperm proteomes. Majority of the novel proteins were annotated to play roles in stress response, transport, catabolic processes, and cellular component organization. Our results suggest that the current state of knowledge regarding plastid biology, preferentially based on model systems is deficient. New plant genomes are expected to enable the identification of potentially new plastid-targeted proteins that will aid in studying novel roles of plastids.


Introduction
The plastid is an intracellular organelle derived from an endosymbiotic event wherein a free-living autotrophic photosynthetic bacterium was phagocytized by a separate heterotrophic organism [1]. These organelles have since become essential to plant survival and have been documented to participate in numerous biological processes including photosynthesis, storage of oils, and proteins, pigment synthesis and storage, monoterpene synthesis [2], gravity sensing [3], and starch and fatty acid synthesis [4]. Over an extensive period of evolution, large parts of the plastid genome are hypothesized to have integrated into the nuclear genome [5]. In higher plants, the vast majority of proteins constituting the plastid proteome are encoded by genes physically resident in the nuclear genome, with about 120 genes retained in the plastid genome, a number which varies between species [6]. Comparative genomic analysis between Arabidopsis (Arabidopsis thaliana) and cyanobacteria indicates that 18% of the Arabidopsis protein-coding genes were derived from events involving transfer of genetic material from the plastid to the nucleus [7]. In part, exchange of genetic material and related biological functionality has necessitated an orchestration of processes between the plastid and nucleus where the nucleus actively exerts control on all aspects of plastid function.
Plant cells have developed intricate mechanisms to import nuclear-encoded proteins to or across the three plastid membranes (outer, inner plastid envelope, thylakoid). The presence of multiple protein transport pathways have been shown to play a role in aiding protein transport across the inner and outer plastid envelopes; however, the vast majority of the plastid proteome is transported via the tic/toc pathway [8]. In order to utilize this pathway, most proteins possess a signal peptide which interacts with chaperones and is later cleaved. Stromal-targeting peptide sequences, while not conserved, possess some similarities in amino acid composition. These targeting peptides are typically comprised of a relatively high abundance of serine and threonine residues [9,10] and are positively charged [11]. There are also some proteins that do not have any canonical signaling peptides and yet localize to plastids [12,13,14]. Therefore, the signaling prediction programs provide a good reference point to initiate an understanding of the plastid-targeted proteome for any new species, but as predictions, they do require experimental validation.
Due largely to the technical complexity with whole plastid proteome characterization, transcriptome or genome sequences have become a widely used dataset to predict plastid-targeted motifs. Such an approach also enables the identification of plastid-targeting proteins in a spatial and temporal context. Prediction of subcellular localization has been reported to be performed with software such as PCLR [15], iPSORT [16], TargetP [17,18], and PREDOTAR [19] amongst many other programs. Most prediction methods exploit the presence of an Nterminal signal sequence to predict cellular localization. Of these, TargetP was recommended to be most successful in prediction and was comparable to PCLR, with each having sensitivity values of 0.72 [20].
The Rosaceae family represents a unique diversity in fruit development, which is unrealized in the many model plants whose genomes have been sequenced [21,22,23,24,25,26]. Pomes (apples and pears), stone fruits (cherry and peach), and aggregate fruits (strawberry and raspberry) display a diversity that suggests the presence of novel metabolic processes, and is supported by a large number of genes which far exceeds the number of genes in Arabidopsis. While fruit development in these Rosaceae species differs vastly, the ubiquitous process of the plastidial transition from a chloroplast to chromoplast is often assumed to be conserved. Within Rosaceae fruit, plastids play extremely important roles in determining fruit quality and organoleptic appeal as they are the site for synthesis of carotenoids [27,28], monoterpenes [2], fatty acids [4] and aromatic amino acids. Many of these compounds have been linked to human health and nutrition [29,30]. Plastids are also important in converting starch into various types of carbohydrate and sugars in developing fruits [31].
The plastid structure has also been reported to differ between different tissues of the fruit. Phan [32] reported the presence of a large single granum comprised only of stacked thylakoid membranes in the plastids of the endocarp tissue of apple. In addition, chloroplasts with leaf-like thylakoid and grana organization in the outer six cell layers of mature apple fruit and presence of globular chromoplasts in epidermal cells were described. It is expected that differences in structure, physiology and biochemistry in the Rosaceae fruit plastids as well as other non-model systems will assist in identifying novel processes associated with plastids in plants.
In this study we tested the three primary hypotheses using a bioinformatics approach, (1) The total number and composition of the plastid-targeted protein coding genes in apple, a model representative of Rosaceae, that is taxonomically different from Arabidopsis, (2) The plastid-targeted protein coding genes are under transcriptional control during apple fruit development and (3) There is a subset of unique plastid-targeted protein coding genes that are unique and novel to each plant species.
In order to test the first hypothesis, we performed an in-depth computational analysis predicting the plastid-targeted proteome of apple and compared it with Arabidopsis resulting in the identification of a much larger number of plastid-targeted genes with nearly 4000 plastid-targeted protein coding genes being unique to apple. The second hypothesis was tested by reanalyzing publically available apple transcriptome data which revealed the presence of co-expression profiles of plastid-targeted genes and their association to development and metabolism. Finally, the third hypothesis was tested by extending the custom analysis workflow to an inter-genera comparison between six published genomes: Arabidopsis thaliana, Vitis vinifera, Prunus persica, Populus trichocarpa, Fragaria vesca, and Solanum lycopersicum resulting in the identification of plastid-targeted proteins unique to each species. A core set of 737 Arabidopsis thaliana proteins, highly enriched in photosynthesis and primary metabolism gene ontology (GO) terms, were identified to have homologous plastidtargeted proteins in all investigated species.

Materials and Methods
TargetP-based prediction of Malus 6 domestica plastid proteome The Malus 6domestica predicted protein set was obtained from the apple genome sequencing project [26]. Protein sequences were analyzed using TargetP using plant networks with default parameters [17,18]. All sequences with predicted chloroplast transit peptides were compiled into a new dataset and were sorted based on length using USEARCH [33]. Figure 1. Venn diagrams displaying the predicted plastid-targeting proteins unique to apple compared to Arabidopsis. Two plastidtargeting methods, TargetP and a custom analysis method, were used to predict genes encoding plastid localized proteins. Sequences in these data sets were compared to Arabidopsis plastid-targeted proteins from the Chloroplast2010 project using USEARCH. Genes not clustered to Arabidopsis were compared between prediction methods displaying a high agreement between the methods. Venn diagrams were constructed using Venny (Oliveros, 2007 Custom protein targeting analysis A part of the functional annotation pipeline was applied to identify organelle 'plastid' targeted gene products encoded by the apple genome [26]. The peptide sequences were analyzed first through InterProScan [34] results provided by the genome consortium [26], followed by in-house analysis using the SignalP [17], Predotar [19] and TMHMM [35]. InterPro provided the domain annotations, and any genes/peptides with transposable element/domain annotations were filtered out for further analysis. The next steps of the pipeline employed: (1) SignalP to predict localization to the mitochondrial or plastid or secretion pathway, plus providing signal peptide cleavage sites, (2) Predotar to predict localization to either or both the mitochondrion or plastid, and (3) TMHMM to identify predicted transmembrane domains in the protein sequences. After collecting these annotations, standardized protocols for assigning the annotations were adopted [24]. The higher quality scores with reviewed after computational analyses (RCA) were selected if the scores of 0.75 and greater were predicted for TargetP and Predotar and two or more transmembrane annotations were predicted by the TMHMM. The parameters selected for inferred by electronic annotation (IEA)  include the scores of 0.5-0.749 for TargetP and Predotar and one/ single transmembrane domain suggested by the TMHMM. The majority of these annotations were IEA evidence codes. If the annotations overlapped for gene products that had plastidtargeting predicted from TargetP and Predotar and membrane spanning domains identified by the TMHMM, then the suggested location of the targeted protein was 'plastid membrane'. The Inparanoid algorithm [36] was used to find orthologous genes and paralogous genes that arise by duplication events. The pipeline was discussed in the Fragaria vesca genome paper [24]. For this study, the analysis included the peptide sequences from 22 species, including Arabidopsis thaliana, Brachypodium distachyon, Caenorhabitis elegans, Chlamydomonas reinhardtii, Danio rerio, Eschericia coli, Fragaria vesca, Glycine max, Homo sapiens sapiens, Zea mays, Malus 6domestica, Mus musculus, Neurospora crassa, Oryza sativa, Physcomitrella patens, Populus trichocarpa, Saccharomyces cerevisiae and pombe, Selaginella moellendorffii, Sorghum bicolor, Synechosystis, and Vitis vinifera to cover the tree of life with emphasis on fully/nearly complete and published genomes. The peptide sequences were downloaded from Phytozome.net for grapevine, Selaginella, Physcomitrella, Chlamydomonas, Glycine, Populus, and Malus from the genome portal [26], and Gramene [37] for rice, sorghum, maize and Arabidopsis. The remaining sequences were downloaded from Ensembl [38,39].

Identification of sequences unique to apple datasets compared to Arabidopsis
The Arabidopsis thaliana plastid-targeted gene set was obtained from the Chloroplast 2010 project website (www.plastid.msu.edu) [40]. Arabidopsis embryo lethal mutants were analyzed using TargetP [17,18] and any chloroplast targeted proteins were added to the aforementioned dataset, as these were omitted from the Chloroplast 2010 database. Proteins predicted to target the apple plastid were then compared to plastid-targeted proteins from Arabidopsis thaliana using USEARCH [33]. Predicted plastidtargeted proteins were compared using two conditions: first, a global USEARCH was performed using 40% amino acid identity and 40% coverage (40/40), and a second global comparison was performed using 50% amino acid identity with 50% coverage (50/ 50). Header files of proteins unique to the M. 6 domestica dataset were compared between the TargetP-based method as well as the custom analysis to investigate any bias associated with either respective prediction technique.
USEARCH-based multispecies comparative analysis of predicted plastid-targeted proteomes Predicted coding sequences were collected from the genomes of Fragaria vesca (Woodland Strawberry) [24], Vitis vinifera (Grapevine) [22], Solanum lycopersicum ITAG1 release (Tomato) [23], Prunus persica (Peach) [21] and Populus trichocarpa (Black Cottonwood) [25]. Protein sequences were analyzed with TargetP [17,18] using default parameters to predict localization. Sequences predicted to be plastid-targeted were organized into new files for each species. Comparisons were performed for each plastidtargeted dataset using USEARCH [33] 40/40 and 50/50 global parameters against the Arabidopsis thaliana plastid-targeted dataset, the entire Arabidopsis TAIR V10 protein set (Arabidopsis.org) [41], the predicted proteins from Solanum lycopersicum, as well as a file comprised of the sequences of the predicted plastidtargeted protein sequences of the other six species. All datasets were first sorted by length using USEARCH.
Further analysis was performed with USEARCH to identify those proteins present in the predicted plastid proteomes of all investigated species. To perform this analysis, the Arabidopsis thaliana putative plastid-targeted protein set was compared using USEARCH 40/40 global parameters separately against the plastid-targeted proteins from woodland strawberry, grapevine, tomato, peach, black cottonwood, and apple. Output files were then analyzed to identify those Arabidopsis sequences which had a match in the plastid-targeted proteomes of all species.
UCLUST-based multispecies comparative analysis of predicted plastid-targeted proteomes A second comparative analysis was performed using the clustering feature of the USEARCH package, UCLUST. In this analysis, the plastid-targeted protein sequences from the seven examined species were compiled into a single file and sorted by length. UCLUST was performed at 50% identity. The output was parsed to identify protein clusters with members from all seven species, as well as those clusters containing sequences from only one species.

Determination of Jaccard's similarity coefficients
Two separate techniques were used to create the similarity matrices based upon Jaccard's coefficients. To calculate the value of an individual cell (the distance between species A and species B) we first determined if two genes were considered homologous. If a parameters. Alternatively, in the UCLUST-based approach two genes match if they belong to the same cluster. Datasets comprised of putative plastid-targeted proteins unique to each species were clustered against the predicted plastid-targeted protein sequences of Arabidopsis thaliana (At) from Chloroplast2010 and TAIR V10, Solanum lycopersicum (Sl), and a database consisting of the plastid-targeted proteins from all 6 species using USEARCH. Sequences were clustered globally with 40% coverage with 40% identity (40/40). Results suggest that a large portion of the plastid-targeted proteins may be unique to each respective species. doi:10.1371/journal.pone.0112870.t004 Blast2GO Gene Ontology analysis and GO term enrichment analysis Sequences for all genes encoding unique or shared plastidtargeted proteins in the investigated apple, Arabidopsis, grapevine, peach, strawberry, black cottonwood, and tomato datasets were analyzed via Blast2GO [42,43]. BLASTP was performed using the NCBI nr database with Blast2GO default parameters. Gene ontology mapping and annotation were also performed using default parameters with the August 2012 database. Following GO annotation, an Interpro scan [34,44]was performed and results were merged with the GO annotations. Annotation augmentation was performed using ANNEX [45], followed by GO-slim with the goslim_plant.obo database. Kyoto Encyclopedia of Genes and Genomes (KEGG) information was downloaded from the KEGG Pathway Database [46,47]. Datasets comprising those unique to each plastid-targeted proteome, as well as those shared between all seven species were investigated using Single Enrichment Analysis with agriGO [48] to identify enriched GO terms. Analysis was performed using the Fisher test for significance and adjusted using the Yekutieli multi test adjustment with the minimum mapping entries set to three. A significance level was set at 0.01 and all terms GO-terms with p-values lower than this cutoff were reported as enriched.

Analysis of apple fruit gene expression
In order to ascertain if genes encoding plastid-targeted proteins in apple are expressed in fruits, as well as to identify co-expressed gene sets, microarray data from a previously published experiment were used [49]. The Janssen study measured the relative expression of about 13,000 features designed from apple fruit expressed sequence tags (ESTs) at 8 time points ranging from 0 days after anthesis (DAA) to 146  DAA. All EST sequences utilized in the microarray experiment were retrieved from NCBI and a BLASTX was performed against the predicted apple protein set generated from the apple genome [26]. The EST expression data were then assigned to the top protein hit. Sequences which were previously found to be plastid-targeted were extracted and their respective expression data were analyzed by determining relative expression to the lowest measured mean expression value. The Log2 of relative expression data were imported into and analyzed with Multi-Experiment Viewer [50,51]. Sequences were clustered using Cluster Affinity Search Technique [52] using Pearson Correlation and a threshold of 0.8. Blast2GO [42] was used to assign annotation to those proteins with associated gene expression data. Single Enrichment Analysis was performed with AgriGO [48] as previously described, however a chi-square test was used instead to determine statistical significance.

Results
Predicted Plastid-targeted Proteomes of Malus 6 domestica The apple genome has a total of 57,386 predicted genes [26] nearly 30,000 more genes than Arabidopsis [41,53]. We analyzed the complete apple gene set for cellular localization using two approaches, namely TargetP [17,18] and a custom prediction method (see materials and methods section for details). TargetP predicted the presence of 10,492 plastid-targeted proteins in the apple genome, while the custom gene ontology-based analysis predicted 9,882 genes, with an overlap of 9,256. Each data set was then clustered with the Arabidopsis plastid-targeted protein set using USEARCH [33] with 40% identity and 40% coverage (40/ 40 parameters) to identify homologous protein sequences. The TargetP method and custom analysis predicted 6,209 and 5,789 plastid-targeted proteins respectively to be unique to the apple dataset. The two methods agreed upon 5,318 proteins (86% and 92% respectively) uniquely targeted to apple chloroplasts and absent from those of Arabidopsis thaliana (Figure 1). Alternative clustering using 50% identity and 50% coverage (50/50 parameters) resulted in less clustering with Arabidopsis sequences and, consequently, increased the number of proteins predicted to be unique to apple. Using these parameters 7,110 sequences were predicted to be unique to the apple plastid proteome by TargetP, 6,639 with the custom analysis, and a set of 6,131 agreed upon by the two methods.
In order to identify prediction biases between the custom analysis and TargetP, an agriGO [48] GO term enrichment was performed on the proteins predicted to be differentially targeted. No significant GO terms were found to be enriched in the 1,236 proteins predicted to target the plastid with TargetP. However, agriGO identified the GO terms oxygen binding (GO:0019825, p-value 3.2e-05), hydrolase activity (GO:0016787 p-value 6e-05) and catalytic activity (GO:0003824 p-value 3.3e-04) are enriched in the plastid-targeted-proteins unique to the custom analysis.

Expression analysis of genes encoding plastid-targeted proteins in Malus 6 domestica
In order to test the hypothesis that plastid-targeted protein coding genes are under transcriptional control as the apple fruit develops, we reanalyzed data from a previously published microarray-based analysis of developing apple fruit [49]. Of the 13,000 unigene microarray probe sets studied, 2,698 were determined to map back to putative plastid-targeted proteins identified in this study, and represent a total of 2,224 unique sequences. Clustering of expression data using MultiExperiment Viewer [50,51] identified 92 different expression clusters, however, only 64 of these had 5 or more members. Over 50% of the genes fit into 9 co-expression clusters. These co-expressed genes were annotated using Blast2GO to infer their functions. Expression data for each cluster are provided in File S1 as well as their associated GO term information (File S2).
Of the 11 main clusters investigated, only the two most populous clusters of co-expressed genes contained GO terms which were determined to be significantly enriched by agriGO analysis (Table 1). Cluster 1 had a single enriched biological process GO term of photosynthesis (GO:0015979) along with 12 enriched cellular component GO terms with thylakoid (GO:0009579) having the lowest p-value. Cluster 2 was enriched in the biological process GO terms lipid metabolic process (GO:0006629), secondary metabolic process (GO:0019748), biosynthetic process (GO:0009058), catabolic process (GO:0009056), transport (GO:006810), establishment of localization (GO:0051234), and localization (GO:0051179). Additionally, one molecular function GO term was enriched in cluster 2, catalytic activity (GO:0003824), along with 11 cellular component GO terms.
In order to determine if genes encoding plastid-targeted proteins were indeed expressed within the fruit of apple, data from a previous study were analyzed [49]. The initial microarray experiment was a large scale analysis representing 13,000 of the ,57,000 apple genes, and was designed around many significant physiological events occurring during apple fruit development. These 13,000 genes were compared to the genes encoding predicted plastid targeted proteins described earlier in this study. About 20% of the genes (2,224 genes) encoding predicted plastidtargeted proteins mapped back to genes represented in the Janssen study. Analysis with MultiExperiment Viewer revealed that the majority of these genes were co-expressed in 9 clusters. To show how these expression profiles may relate to important fruit developmental events, expression profiles for the co-expressed genes were overlaid with those events described in Janssen et al. (Figure 2). An additional event, plastid globule accumulation, was also added, as it was noted in developing apple fruits alongside the unstacking of photosynthetic membranes [54]. As seen in Figure 2, the gene expression of these clusters and their GO terms coincide to some extent with the processes occurring within the apple fruits. Many of the biological process GO terms and KEGG pathways associated with each gene expression cluster suggest that expression of genes encoding plastid-targeted proteins may coincide with these important events. The expression of Cluster 1 greatly mirrors the photosynthetic activity of apple fruit tissue, with highest expression occurring in young, photosyntheticallycapable fruit, and expression lowering as the fruit matures and has a reduction in photosynthetic capabilities. Additionally, the expression of those genes in Cluster 2 appear to mirror the development of carotenoids, volatile compounds, and maturation of fruit, with expression lowest in young fruit and increasing as the fruit reaches maturity. In particular the expression of genes whose products are involved in lipid metabolic processes, secondary metabolic processes, biosynthetic processes, and catabolic processes, as determined via GO term enrichment would be great candidates for further study in their participation in apple fruit volatile production. Cluster 11 is particularly interesting as it is comprised of genes whose expression peaks at a single time point (60 DAA), however, the associated KEGG pathways and GO terms do not suggest a connection to the significant fruit processes of cell expansion and starch accumulation occurring at that time point. Blast2GO analysis revealed that 15.2% of the entire plastidtargeted proteome of apple lacked GO term information.
However, the set of 2,224 genes represented in this study reveals that this subset is better characterized as it contains only 78 (3.5%) sequences with no associated GO terms. Of course the mere expression of a gene does not indicate that a functional protein is present within the fruit plastids as this process could be affected or controlled at a number of levels including translation, interaction with chaperone proteins, redox state of the plastid, presence of appropriate translocation proteins, protein and mRNA stability and turnover, and likely many other factors. Regardless, the data presented in this study indicate that the expression of genes encoding plastid-targeted proteins is dynamic in the fruit of Malus Blast2GO was used to determine GO terms associated with all predicted plastid-targeted proteins. Enrichment analysis was performed with agriGO to identify significant enriched GO terms. Gene Ontology terms are provided for biological process (P), molecular function (F), and cellular component (C

Prediction and comparative analysis of plastid-targeted proteomes
In order to identify plastid targeted-proteins in seven species of interest (Arabidopsis thaliana, Prunus persica, Malus 6 domestica, Populus trichocarpa, Fragaria vesca, Solanum lycopersicum, and Vitis vinifera), plastid-targeting predictions were primarily performed using Target P. TargetP was selected in order to be consistent with previously published work in this area and because previous studies have found TargetP to be the most reliable single prediction program [17,20]. TargetP analysis revealed a large variance in the percentage of total transcripts encoding putative plastid-targeted proteins between the investigated species. The largest of these datasets belonged to Malus 6domesticawith 18.3% of its nuclear-encoded proteins predicted to be plastid-targeted, while the lowest was that of Populus trichocarpa with only 9.9% ( Table 2). Header information for the predicted plastid-targeted datasets is provided in File S3.

Comparison of plastid-targeted proteomes with model systems
The predicted plastid proteomes for Malus 6 domestica, Fragaria vesca, Populus trichocarpa, Prunus persica, Vitis vinifera, and Solanum lycopersicum were independently compared with the Arabidopsis plastid proteome dataset as well as the entire Arabidopsis protein set using USEARCH. In Arabidopsis, the Chloroplast2010 project (www.plastid.msu.edu) [40] identified 5,181 unique genes encoding plastid-targeted proteins using software predictions and direct experimental evidence. Since this dataset did not represent embryo lethal mutants, an additional 201 genes predicted to encode embryo lethal plastid-targeted proteins from the SeedGenes database [55] were added to the dataset used in this study for comparative analysis to bring it to a total of 5,382 sequences. Comparison at 40% identity and 40% coverage (40/ 40) reveals that about 50% of each predicted plastid proteome has a likely homolog in the Arabidopsis plastid-targeted proteome subset, while about 60-70% of the proteins have likely homologs in the entire Arabidopsis protein subset (Table 3). Further comparison with 50/50 clustering parameter lowers these estimates significantly.
Additional comparison of putative plastid-targeted protein sequences for all six species was performed against the plastidtargeted proteome of Solanum lycopersicum, another model system for plastid biology research. This analysis showed that smaller proportions of the plastid proteomes had homologs in the tomato plastid proteome (ranging from 32-48%) than they had in the Arabidopsis proteome (40-60%) ( Table 4). Strawberry and apple had the lowest similarity with the predicted plastid-proteomes of both Arabidopsis and tomato, while peach had the highest.

Identification of unique plastid-targeted proteins
Two separate analyses were performed to identify the plastidtargeted proteins in each of the seven species examined. The first consisted of a USEARCH-based comparison of predicted plastidtargeted proteins against the plastid-targeted protein sequences from the other six species. A second comparison utilized a clustering technique with UCLUST [33]. In this analysis the plastid-targeted proteins from all species were clustered together and clusters containing singletons or sequences from a single species were identified and further analyzed. In both the USEARCH and UCLUST-based analyses, a significant proportion of each predicted plastid proteome was found to be unique to that species (Table 5). The proportion of uniquely targeted proteins ranges from 16.3% in Prunus persica to 41.5% in Fragaria vesca in the USEARCH method and 20.6% in Prunus persica to 46.8% in Fragaria vesca in the UCLUST method. . Biological process GO term composition of proteins shared between the 7 investigated plastid-targeted proteomes using two separate techniques. Two separate analyses were performed to identify proteins within the predicted plastid-targeted proteomes of seven species. The first, UCLUST with 50% identity, generated 15,750 clusters, 289 of which contained a member from all 7 species. USEARCH comparison performed at 40% identity and 40% coverage identified 737 sequences in the Arabidopsis thaliana putative plastid-targeted dataset which had a match in a protein sequence from the plastid-targeted sequences from all of the other 6 species. Sequences were analyzed via Blast2GO to determine the biological processes in which they partake. doi:10.1371/journal.pone.0112870.g004 However, the majority of the protein sequences have a homolog in at least one of the investigated species (Figure 3). In order to determine if uniquely-targeted proteins were in fact completely unique to each species, as opposed to alternatively targeted, a comparison with USEARCH 40/40 parameters was performed against datasets comprising the entire predicted proteomes of the other 6 species (Table 5). This investigation revealed thata significant number of proteins that lacked homology with interspecies plastid-targeted proteins, in fact have alternatively targeted homologs. This difference was most significant in Arabidopsis and Malus 6 domestica increasing the percentage of homology by 15.3% and 16.8%, respectively. Roughly 7-10% more proteins had homologs when using this matching scheme in the other investigated species. Information on these uniquely plastid-targeted proteins is provided in File S4.
GO term enrichment analysis was performed on the speciesspecific plastid-targeted protein sequences for both UCLUST and USEARCH-based analyses. Comparisons were performed using a Fisher's test with the entire predicted plastid proteome as a reference. This analysis revealed significantly enriched GO terms for both comparative techniques. In the USEARCH based technique, the majority of enriched GO terms were present in the unique apple plastid-targeted sequences with the most significant GO terms were DNA metabolic process (GO:006259, p-value 8.30E-60) and cellular macromolecule metabolic process (GO:0044260) ( Table 6). Additionally, the GO term nucleic acid binding (GO:0003676) was enriched in poplar as well as apple. No significant GO terms were found for grape, strawberry or tomato. More GO terms were found to be enriched in the UCLUST-based comparison ( Table 7). In this analysis transcription regulator activity (GO:0030528) and transcription factor activity (GO:0003700) were enriched in the Arabidopsis and tomato datasets. DNA binding (GO:0003677) was enriched in apple, Arabidopsis and tomato. Cell death (GO:0008219) and death (GO:0016265) were also enriched in the apple UCLUST50 dataset.
While a significant amount of functional annotation was performed with Blast2GO, a large proportion of the proteins Figure 5. Molecular Function GO term composition of proteins shared between the 7 investigated plastid-targeted proteomes using two separate techniques. Two separate analyses were performed to identify proteins within the predicted plastid-targeted proteomes of seven species. The first, UCLUST with 50% identity, generated 15,750 clusters, 289 of which contained a member from all 7 species. USEARCH comparison performed at 40% identity and 40% coverage identified 737 sequences in the Arabidopsis thaliana putative plastid-targeted dataset which had a match in a protein sequence from the plastid-targeted sequences from all of the other 6 species. Sequences were analyzed via Blast2GO to determine the molecular functions of proteins at level 3. doi:10.1371/journal.pone.0112870.g005 Table 9. GO terms enriched in Arabidopsis thaliana members of the 289 plastid-targeted protein clusters shared between all species investigated. GO  predicted to be unique to each species lack any associated GO term (Table 8). Proteins predicted to be unique to the plastids of Arabidopsis appear to be the best characterized with 98% containing some form of GO information, followed by those of Prunus persica with 81.9%. Solanum lycopersicum displays the least amount of GO information with only 56.5% of these uniquely plastid-targeted proteins having associated GO terms.

Analysis of proteins conserved between all plastidtargeted proteomes
Two separate analyses were performed to identify proteins which were predicted to be targeted to the plastids of all seven angiosperms studied. First analysis workflow utilized a semi-global approach where UCLUST [33] was used to cluster the proteins of all predicted plastid proteomes at 50% identity. The second approach utilized a global approach where the plastid proteome of Arabidopsis thaliana was compared with every other species' predicted plastid proteome at 40% identity and 40% coverage. Those proteins which have a matched protein from all other six species were then determined to be conserved across the plastid proteomes.
The first analysis using UCLUST at 50% identity identified 289 clusters of proteins which had at least one member from all seven species. These 289 clusters contain 497 unique sequences from Arabidopsis thaliana, 773 from Malus 6 domestica, 384 from Vitis vinifera, 392 from Fragaria vesca, 545 from Populus trichocarpa, 439 from Prunus persica, and 478 from Solanum lycopersicum. Blast2GO analysis reveals that these proteins are involved in a large number of biological processes (Figure 4) Figure 5). GO term enrichment was performed by selecting a single Arabidopsis thaliana protein sequence and utilizing agriGO to compare with those GO terms from the entire Arabidopsis thaliana predicted plastid-targeted proteome. This dataset contains 56 GO terms which were enriched with a p-value cut-off of 0.01 (Table 9). The lowest p-values are associated with the GO terms photosynthesis (GO:0015979), carbohydrate metabolic process (GO:0005975), macromolecule modification (GO:0043412), protein modification process (GO:0006464) and thylakoid (GO:0009579) respectively.
Analysis with USEARCH 40/40 identified the presence of 737 unique protein sequences from Arabidopsis thaliana with a matching protein in the predicted plastid-targeted proteomes of Solanum lycopersicum, Prunus persica, Vitis vinifera, Malus 6 domestica, Fragaria vesca, and Populus trichocarpa. As in the UCLUST50 analysis, the top three biological processes in this dataset as determined by Blast2GO are cellular component organization (GO:0016043, 254 proteins), response to stress (GO:0006950, 251 proteins), and carbohydrate metabolic process (GO:0005975, 210 proteins) (Figure 4). Again, the most populous molecular function GO terms mirror those in the UCLUST 50% with the majority falling into the six categories of: organic cyclic compound binding (GO:0097159 298 proteins), small molecule binding (GO:0036094 215 proteins), transferase activity (GO:0016740 164 proteins), protein binding (GO:0005515 183 proteins), hydrolase activity (GO:0016787 115 proteins), and nucleic acid binding (GO:0003676 107 proteins) ( Figure 5). GO term enrichment using agriGO identified 59 GO terms to be enriched with a p-value cut-off of 0.01 (Table 10). The lowest pvalues are associated with the GO terms photosynthesis (GO:0015979), thylakoid (GO:0009579), macromolecule modification (GO:0043412), protein modification process (GO:0006464), and generation of precursor metabolites and energy respectively (GO:0006091).  Comparing the USEARCH40/40 dataset along with that of the UCLUST50 dataset reveals that the two methods agree upon 439 shared plastid-targeted sequences, with 58 present only in Proteins conserved in this study were further compared with those from GreenCut2. GreenCut2 represents a collection of 597 nuclear-encoded proteins determined to be conserved across 20 photosynthetic eukaryotes, but absent in non-photosynthetic organisms [56]. A total of 677 unique loci from Arabidopsis thaliana (33 redundant of the original set of 710) were compared with those identified as shared between the seven species examined in this study both with USEARCH 40/40, as well as UCLUST 50% using Venny [57]. This comparison identified 70 proteins present in all three datasets, and a substantial set unique to each dataset. Sequence information from this comparison is provided in File S6.
In order to look at the overlap of homologous proteins between each species, the Jaccard similarity coefficient was determined for the USEARCH40/40 (Table 11). This comparison displays that the plastid-targeted proteomes of apple, strawberry, Arabidopsis and poplar are most similar to that of Peach, while the predicted plastid-targeted proteomes of peach, tomato and grape are most similar to Arabidopsis. An additional modified Jaccard similarity coefficient matrix was generated with the UCLUST50 analysis. In this matrix the predicted plastid-targeted of all species are most similar to that of peach, while that of peach is most similar to apple.

Discussion
This study reveals several interesting aspects about the constitution of predicted plastid-targeted proteomes for the species analyzed. A large portion of a plant's nuclear genome is dedicated to plastid-targeted proteins, many of which lack identity to plastidtargeted proteins in other species. Some plastid-targeted proteins have identity with potentially alternatively-targeted proteins in other systems. Of the predicted plastid-targeted proteins in Arabidopsis, 737 have significant identity to predicted plastidtargeted proteins in each of six investigated species, suggesting an evolutionarily core conserved set of plastid-targeted proteins. The caveat is that TargetP accuracy is not well defined, and has been shown to differ between experiments [58]. van Wijk and Baginsky determined that TargetP has a 35% false positive rate, suggesting that the predicted datasets established here will be greatly reduced upon future confirmatory experiments. Experimental approaches to characterize the proteome of plastids or some of their  Table 11. Jaccard's similarity coefficient matrix for seven species based on predicted plastid-targeted proteomes. constituents have relied largely upon mass spectrometry techniques [59,60,61,62]. A study in 2004 focused on the isolation and classification of the constituents of the Arabidopsis thaliana chloroplast proteome resulting in the identification of 604 nuclear-encoded proteins [60]. However, TargetP (at that time) was only able to correctly predict the plastid localization of 62.3% of these proteins, with 6.1% predicted to target the mitochondria, 8.1% secreted, and 23.5% predicted to have ''any other location''. When excluding the envelope proteins, TargetP chloroplast localization accuracy increased to 67.2%. An additional study which identified 241 stromal proteins from Arabidopsis thaliana chloroplasts identified through MALDITOF MS and nano-LC-ESI-MS/MS had a much higher predictability to be chloroplast targeted by TargetP with 88% accuracy [61]. Yet another study of 916 nuclear-encoded Arabidopsis plastid-targeted proteins revealed that 86% were correctly predicted using TargetP [63].
Such studies indicate that localization prediction methods need to be improved. Some reasons for this could include the complexities of the experimental system, sequence data, dual targeting, splice variants, presence of lesser characterized transport systems, or simply a lack of understanding of mechanisms of localization. As genomes become better characterized and targeting prediction improves, our ability to better understand the commonalities and diversities of plastid compositions and functions will also likely improve.
While it may be expected that plastid-targeted proteins would be highly conserved in this 7-species analysis, a previous study demonstrated a striking lack of similarity between the plastid proteomes of Arabidopsis and Oryza sativa [20]. In this study, a predicted Arabidopsis chloroplast proteome containing 2,100 proteins shared only 900 with the 4,800 plastid proteins in Oryza sativa. These 900 proteins were largely involved in transcription, energy, and metabolism. It would not be surprising to see this shared set of proteins shrink substantially as the number of species compared increases. A large focus has been put on the identification of proteins found only in photosynthetic organisms termed the GreenCut [64]. The first draft of this protein set contained 349 proteins conserved in photosynthetic eukaryotes, and absent in non-photosynthetic organisms. This was later updated generating GreenCut2 with 597 conserved proteins [56]. The comparison of the conserved plastid-targeted protein identified in this study by UCLUST50 and USEARCH40/40 methods reveals that there is only a minor overlap with GreenCut2 ( Figure 6). Additionally, clusters do not consist of members from either a single species, or all seven. Instead, there are a significant number of clusters in our analysis with members from various combinations of species. These clusters could be helpful in the identification of proteins involved in plastidial pathways or traits conserved within a set of species. The comparison of more closely related angiosperms may in fact yield more similar plastid proteomes, while including more distantly related angiosperms likely reduces this similarity. It is worth noting that these differences could potentially occur due to the loss or gain of chloroplast transit peptides, rearrangement of protein domains or gene duplication, to name a few plausible mechanisms.
Differences in the outcome of comparative analysis projects can be attributed to the utilization of alignment or clustering methods. Previous studies in microbial comparative proteomic and genomic studies have utilized an alignments of 50% identity and 50% coverage to predictively resolve paralogs from orthologs [65,66] and 90% nucleic acid identity in algae (Bayer et al., 2012). A previous study which compared plastid proteomes between Arabidopsis and rice utilized BLASTP with e-value cut-off of 10xe 210 [20] while that of GreenCut utilized a BLASTP mutual best hit between Chalydomonas and Arabidopsis and human to identify paralogs, co-orthologs, and orthologs of Chlamydomonas [64]. GreenCut2 further expanded upon these parameters again utilizing mutual best hit analysis with BLASTP cut-off of 10xe 210 to identify orthologs and included sequences with over 50% amino acid identity as in-paralogs [56]. For this study, a predicted plastidtargeted protein was considered ''unique'' to a species if a global USEARCH alignment at 40% identity over 40% of the query sequence matched no sequences from the other 6 species. These parameters were chosen instead of 50% identity and 50% coverage as more matches were identified with a higher confidence with paralogous proteins removed. UCLUST, based on of global alignments, was then utilized to identify clusters with 50% amino acid identity, presumably including both orthologs and in-paralogs [33]. Of course the functionality of a protein cannot be ascertained through sequence identity alone. This study may not identify genes that have undergone rearrangement yet retain similar gene product functions, or examples of convergent evolution. The assumption in this study is that if such arrangements occur and functions are retained, or divergent plants adapt to create the same function in separate gene sequences, that an appropriate sequence from one of the remaining six other species would retain or have similar sequence identity for a 40/40 alignment to occur. However, the datasets that we present could be further investigated in the future through the use of a BLASTPbased comparison, and with other genomes as they become available.
The unique plastid-targeted proteins within each of the investigated species possess varied levels of functional information. Despite the lack of spatial and temporal transcriptome and proteome expression context, this information has a large referential value for future work. Blast2GO analysis and GO term enrichment analysis provide a glimpse as to what these Figure 6. Comparison of conserved plastid-targeted protein datasets with GreenCut2. Arabidopsis thaliana loci associated with conserved proteins within the plastid-targeted proteomes identified in this study were compared with those of GreenCut2. A total of 70 sequences were found to be conserved between the three datasets. doi:10.1371/journal.pone.0112870.g006 proteins are likely participating in within the plastids. However, only a few GO terms were found to be enriched, most of which differed between species. This suggests that there are likely not specific classes of proteins which fall into this category of ''unique'' for the plastid proteomes of each species. As expected, due to substantial research performed on it, a large proportion of the proteins unique to the Arabidopsis plastid proteome (98%) possess associated GO terms (Table 8). However, tomato, which is used in the scientific community to understand and characterize the chloroplast to chromoplast transition, has the lowest percentage of unique proteins without GO terms, at 56.5%. Our analysis reinforces the lack of understanding about plastid biology especially in non-model systems, and the urgent need for further functional characterization of novel biological processes that these organelles harbor.
Plastids play an integral part in plant development, photosynthesis, and several other known biochemical processes. However in fruits their role has remained uncharacterized. Several important biochemical processes for synthesis and storage of pigments, nutraceutical and medically important compounds as well as aromatic compounds are resident in fruit plastids. These components are important both for consumer appeal as well as nutritional value. In addition, various physiological disorders, such as sunscald in apple, are associated with the inability of fruits to adequately quench excessive energy from sunlight. Through furthering our understanding of the plastid function in non-model plant systems and organs such as fruit, novel mechanisms for enhancing photosynthetic efficiency and crop productivity could be discovered.
The results presented in this work indicate that the current state of knowledge regarding plastid biology, mostly derived from model systems, is not comprehensive enough. In each plant species evaluated in this work, plastids are predicted to host a plethora of biological and metabolic processes necessitating subsequent wetlab validation in non-model systems. New plant genomes are expected to enable the identification of potentially new plastidtargeted proteins that will aid in studying novel roles of plastids in plant development, metabolism and adaptation.

Conclusions
While previous studies have advocated the integration of multiple protein localization prediction techniques [20] it appears that no significant difference exists between a custom analysis utilizing multiple approaches and TargetP analysis for identifying plastid-targeted proteins in apple. Such results suggest that initial data mining with only TargetP may, in fact, be sufficient, depending upon the application. This is with the caveat that TargetP has an approximate 35% false positive rate in detecting plastid-targeted proteins [58]. As the understanding of protein localization improves, and complete genome sequences from larger number of plants become available, such predictive techniques will likely become a more reliable method to generate draft plastid proteomes.
The TargetP-based analysis indicates that a large subset of a plant's nuclear-encoded proteome is predicted to be localized to the plastid. However, the proportion of transcripts encoding plastid-targeted proteins varies, compromising 10-20% of the transcriptome depending upon the species investigated. Of the nuclear-encoded plastid proteome, there appears to be a significant subset that is species-specific. Many of these proteins have homology to proteins not predicted to be plastid-targeted in other systems, indicating that it may be common for proteins to gain or lose targeting peptides during evolution. If this is indeed the case, it would be interesting to investigate the evolutionary and mechanistic context of gain or loss of target peptides across species.
Through using two comparative methods, a USEARCH-based approach as well as a semi-global UCLUST-based approach, we displayed that very few plastid-targeted proteins are conserved between the predicted nuclear-encoded plastid proteomes. This value varied based upon the comparative technique and user parameters, but our predictions identify 497 and 737 Arabidopsis proteins which contain predicted plastid-targeted homologs in all other examined angiosperms. GO term enrichment analysis suggests that specific functions are significantly conserved in plastids, namely photosynthesis, many metabolic processes, transport, and cell death. Knowledge about these conserved proteins can be utilized in future studies to better understand and potentially predict those proteins which are plastid-targeted in other non-model systems and additionally identify novel plastidtargeted proteins.
The expression of genes encoding plastid-proteins appears to be very diverse within the fruit developmental continuum with 64 significant expression patterns detected (those containing five or more genes). However, most of the genes investigated can be clustered into nine expression patterns. These expression patterns can be overlapped with important milestones within the development of fruit to find plastid proteins which may be responsible for novel fruiting milestones or processes. While expression data is available for ,13,000 genes, subsequent developmental transcriptomics, metabolomics, and proteomics investigations are expected to provide a comprehensive understanding of the roles of plastids in apple fruit development.

Supporting Information
File S1 Expression data from Janssen et al. for Malus 6 domestica genes predicted to encode plastid-targeted proteins. Expression data for each cluster of co-expressed genes with sequence header and expression data in the form of Log2 of expression relative to the lowest value. (XLSX) File S2 GO term information for Malus 6 domestica plastid-targeted protein clusters.

(XLSX)
File S3 Header information for predicted plastid-targeted proteins from the investigated seven species. Excel file containing identifiers of predicted plastid-targeted protein sequences generated through with TargetP analysis.

(XLSX)
File S4 Plastid-targeted proteins unique to each examined species with GO term information. Excel file containing the header information and associated GO terms.

(XLS)
File S5 Plastid-targeted proteins shared between all species. Excel file containing the header information for the UCLUST50 analysis separated by species. Information is provided for the 737 Arabidopsis thaliana sequences shared in the USEARCH comparative analysis. (ZIP) File S6 Blast2GO annotation file containing header and GO term information for 70 protein sequences present in GreenCut2, UCLUST50, and USEARCH4040 analyses. (ZIP)