Ribosome profiling reveals changes in translational status of soybean transcripts during immature cotyledon development

To understand translational capacity on a genome-wide scale across three developmental stages of immature soybean seed cotyledons, ribosome profiling was performed in combination with RNA sequencing and cluster analysis. Transcripts representing 216 unique genes demonstrated a higher level of translational activity in at least one stage by exhibiting higher translational efficiencies (TEs) in which there were relatively more ribosome footprint sequence reads mapping to the transcript than were present in the control total RNA sample. The majority of these transcripts were more translationally active at the early stage of seed development and included 12 unique serine or cysteine proteases and 16 2S albumin and low molecular weight cysteine-rich proteins that may serve as substrates for turnover and mobilization early in seed development. It would appear that the serine proteases and 2S albumins play a vital role in the early stages. In contrast, our investigation of profiles of 19 genes encoding high abundance seed storage proteins, such as glycinins, beta-conglycinins, lectin, and Kunitz trypsin inhibitors, showed that they all had similar patterns in which the TE values started at low levels and increased approximately 2 to 6-fold during development. The highest levels of these seed protein transcripts were found at the mid-developmental stage, whereas the highest ribosome footprint levels of only up to 1.6 TE were found at the late developmental stage. These experimental findings suggest that the major seed storage protein coding genes are primarily regulated at the transcriptional level during normal soybean cotyledon development. Finally, our analyses also identified a total of 370 unique gene models that showed very low TE values including over 48 genes encoding ribosomal family proteins and 95 gene models that are related to energy and photosynthetic functions, many of which have homology to the chloroplast genome. Additionally, we showed that genes of the chloroplast were relatively translationally inactive during seed development.

Introduction Gene expression is regulated at multiple points such as transcriptional, post-transcriptional, translational, and post-translational levels. Although translation determines the proteome, translational regulation in plants is less well understood compared to other regulatory steps such as transcription and post-transcription. The regulation of gene expression, including at the translational level, is crucial to ensure specific proteins are expressed at the appropriate times and levels in response to genetic and environmental stimuli [1][2][3]. Thus, the understanding of translational regulation is a major focus in recent years [4][5][6]. In higher plants, translational regulation plays significant roles in the different developmental processes that control the expression of developmental and stage-specific as well as tissue-specific gene products [7,8].
Genome-wide analyses of gene expression quantify the abundance of mRNA either by microarray or, more recently, by RNA sequencing. However, neither approach provides information on translation of mRNA into protein. Ribosome profiling is a recently developed technique for studying the regulation of gene expression at the translational level [4,9]. This approach is based on high throughput sequencing of ribosome protected mRNA fragments and determines the exact position of ribosomes on mRNA. Generally, transcript abundance is used as the indicator for the gene expression measurement. Sometimes, there is a poor correlation between mRNA abundance and protein levels which is partially due to the translational regulation [6,10]. Whole-proteome mass spectrometry is the direct and powerful approach to measure the changes in protein abundance, but this method can detect only a fraction of protein products in the cell [6]. Ribosome profiling and mass spectrometry are highly complementary approaches to study gene regulation at the translational level. However, ribosome profiling itself allows mRNA abundance and protein translation to be examined in the same sample with high accuracy. One of the advantages of this technique is the measurement of translational efficiency (TE) which is calculated using normalized mRNA abundance and ribosome footprint abundance [4,6]. Higher translational efficiency (TE) values indicate the greater potential of mRNA to be translated into protein.
The maturation phase of soybean seed development has been broadly classified into three major stages. These represent early maturation seed (25-50 mg fresh weight, green seed), midmaturation (100-200 mg fresh weight, green seed), the stage when the biosynthetic capacity of the seed is maximal and proteins and oils are accumulated at a high rate, and late maturation (300-400 mg fresh weight, yellow seed) when the seed are undergoing dehydration and desiccation [11,12]. Different classes of seed storage proteins, such as glycinin, conglycinin, lectin, and trypsin protease inhibitors, accumulate to high levels during these developmental stages [12][13][14][15][16]. Changes in seed developmental stages are accompanied by changes in gene expression as revealed by transcript profiles of soybean genes [11,12] and their post-transcriptional regulation by small RNAs [17]. So the investigation of changes in the translational status or capacities of transcripts during soybean seed development would add an additional component toward dissecting gene regulatory networks.
Ribosome profiling is an emerging technique which allows us to study the translational potential of all genes during soybean seed development. Sequence information obtained by ribosome profiling can be aligned to the predicted gene models (Glyma models) from the current Glycine max reference genome [18], followed by transcript quantification and annotation. TEs calculated from transcripts bound to the ribosomes versus total mRNA transcript levels can be utilized to compare the relative translational values of each gene. Here we used cluster analysis of ribosome profiling data across three developmental stages to determine which genes changed in TE values and thus, their translational capacities, across three stages of soybean seed development.

Plant materials
Soybean (Glycine max cv. Williams) plants were grown in a greenhouse and seeds were collected at different developmental stages including early maturation (green 25-50 mg fresh weight seed), mid-maturation (green 100-200 mg fresh weight seed), and late maturation (yellow, 300-400 mg fresh weight seed) as the cotyledons are in the process of desiccation. Immediately, cotyledons and seed coats were separated by dissecting whole seeds and then frozen in liquid nitrogen. Subsequently the tissue was stored at -80˚C.

Preparation of plant cell extracts
A 1 g amount of frozen seed tissue was ground in liquid nitrogen with a mortar and pestle. Multiple beans were used to avoid sampling variation. An aliquot of 100 mg of powdered sample in a microfuge tube was kept frozen at -80˚C for total RNA isolation. The remaining powdered sample was homogenized in 10 mL of polysome extraction buffer (PEB) [200mM Trisacetate pH 8.0, 200 mM sucrose, 200 mM KCl, 10 mM MgCl 2 , 10 mM 2-mercaptoethanol, 2% (v/v) polyoxyethylene (10) tridecyl ether, 1% (v/v) Triton X-100] [3,4,19]. The suspension was filtered through 60 μm nylon mesh and centrifuged at 12,000 rpm for 15 min at 4˚C in a JA-17 rotor (Beckman) to remove cell debris. The supernatant was used in the next step for the nuclease digestion.

Nuclease digestion
At this point, 300 μl of supernatant was transferred to a new tube. Subsequently, 25 μl of CaCl 2 and 30 μl of micrococcal nuclease (equivalent to 750 units) were added to the supernatant and the reaction mixture was incubated for 1 h at room temperature with gentle rotation of the microfuge tube. In order to stop the digestion reaction, 15 μl (20 U/μl) of SUPERaseIn was added. The samples were kept on ice while preparing for the next steps.

Purification of ribosome protected mRNA fragments
Ribosome protected mRNA fragments were purified by the Sephacryl S400 spin column (GE Healthcare Life Sciences) chromatography according to the ribosome profiling kit (Epicentre) manuals. This is a rapid and simplified size exclusion spin-column method to isolate monosomes. The Sephacryl S400 columns are usually supplied in a buffer and the resin was re-suspended by inverting the tube several times. Then it was equilibrated by passing through 300 μl of 1X polysome buffer under gravity flow. Subsequently, 100 μl of nuclease-treated sample was applied onto the column and was centrifuged for 2 minutes at 3000 rpm and the flow through was collected. The ribosome footprint extraction was performed by adding 600-700 μl of RNA extraction buffer and 500 μl of phenol-chloroform. Then the total mixture was centrifuged at 14,000 rpm for 5 minutes at 4˚C to separate the phases. The aqueous phase was collected into a fresh 1.5 ml tube and one volume of Sevag (24:1 by volume of chloroform and isoamyl alcohol) was added, shaken vigorously, and centrifuged at 14,000 rpm for 5 minutes at 4˚C. After the centrifugation, the aqueous phase was placed in a fresh 1.5 ml tube and 2 μl of glycogen, 1/10th volume of 5 M sodium acetate, and 1.5 volumes of ethanol was added. The reaction mixture was stored at -20˚C for at least one hour before centrifugation at 14,000 rpm for 20 minutes to pellet the ribosome footprint fraction. Finally, the ribosome footprint fraction was re-suspended in nuclease-free water and the UV absorbance was taken with the Nanodrop instrument for the next steps of the experiment.

Control RNA fragment preparation for sequencing
Total RNA was extracted from a 50 mg aliquot of powdered sample using the phenol-chloroform extraction method as described above. The extracted RNA was fragmented using fragmentation buffer (80 mM Tris-OAc, pH~8.1, 200 mM KOAc, 60mM Mg(OAc) 2 ). Then, 20 μl of 2X fragmentation buffer for 20 μl of RNA sample was added. The reaction mixture was incubated for 35 minutes at 95˚C. After the incubation, the reaction was stopped by adding 4 μl of 0.5M EDTA.

Selection of desired fragments
At this point, both the ribosome footprint samples and total RNA samples were analyzed with a 15% TBE (Tris-Borate-EDTA)-urea polyacrylamide gel (Invitrogen). The upper (34 nt) and lower (26 nt) oligonucleotide markers (S1 Table) were used to select the fragments in the range of 26 to 34 nt. Then, the ribosome footprints and RNA fragments were extracted from the gel using the standard protocol [5,6].

Ribosome footprints and total RNA sequencing library preparation
There were multiple distinct steps for preparation of the sequencing libraries as previously described [4,6,9]. Briefly, Ribo-Zero™ rRNA removal kits (Epicentre) were used to reduce rRNA. Both fragmented RNA and ribosome footprint samples were dephosphorylated using T4 polynucleotide kinase (New England Biolabs). After the addition of 1 μl of T4 kinase buffer and 1 μl of T4 polynucleotide kinase, the mixture was incubated for 60 min at 37˚C and then inactivated for 10 min at 70˚C. Dephosphorylated fragments were then purified by precipitation as described above. Preadenylated and 3'-blocked linker (1/5rApp/CTGTAGGCACCA TCAAT/3ddC/) (S1 Table) was ligated to the ribosome footprints and RNA fragments. The desired band was excised from the gel and the selected fragments were extracted from the gel.
Subsequently, reverse transcription was performed using the following RT-primer ((Phos)-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGC(SpC18)CACTCA (SpC18)TTCAGACGTGTGCTCTTCCGATCTATTGATGGTGCCTACAG) as indicated in [5]. The designation (Phos) represents 5 0 phosphorylation, and (SpC18) represents a hexa-ethyleneglycol spacer. After the reverse transcription, the mixture was loaded onto a 15% TBE-urea polyacrylamide gel. The use of the RT-primer as marker facilitated the identification of extended products. These extended products were excised, extracted from the gel, and precipitated as previously described [5,6]. The circularization of gel extracted constructs was performed using CircLigase II (Epicentre). Circularized constructs were used as a template for the final library amplification step using the forward primer (5'-AATGATACGGCGACCACCGA-GATCTACAC-3') and one of the seven indexed reverse primers in S1 Table. Final amplification reactions were performed using 30 s denaturation at 98˚C followed by cycles of 10 s denaturation at 98˚C, 10 s annealing at 65˚C, and 5 s extension at 72˚C. Amplification reactions were carried out for 12 cycles as previously described [5,6]. The products were separated on a 10% non-denaturing TBE polyacrylamide gel. The desired band at~175 nt was cut from the gel and precipitated with sodium acetate and ethanol using standard procedures. trimmed sequences were mapped to all 88,647 high and low confidence soybean gene models from the current assembly Wm82.a2 from Phytozome (Joint Genome Institute) of the Williams 82 genome [18]. These also include all splice variant predictions for 56,044 unique gene models. The ultrafast Bowtie v.1 aligner [20] was used for alignments without any mismatches allowed and with up to 25 alignments to the genome allowed using the following command, v 0 m 25. The most conservative approach is not to allow any mismatches since the short fragment size is predominantly in the range of 26-33 nt. Sequencing reads that aligned to the rRNA sequences of 132 gene models that were falsely annotated by Phytozome as protein containing Glyma models within the nucleolar rDNA region on chromosome 13 were noted. These false annotations result from protein fragments recognized as domains and those reads that matched to the rDNA containing Glyma models were removed from further analysis including determination of the number of mapped reads which yielded 88,515 Glyma cDNA models that were compared. Total hit counts to the Glyma cDNAs from the Bowtie v.1 output were subsequently analyzed by the DESeq package [21] to obtain p-values comparing differential expression between the control and footprint samples with two replications. Hit counts of 0 were converted to 1 to avoid losing data in the DESeq analysis. Ribosome profiling and RNA sequencing data were further normalized in reads per kilobase (kb) of gene model per million mapped reads (RPKM) [22] and values were averaged for the two replicates. Alignments to the 152 kb chloroplast genome (NCBI accession DQ317523.1) were conducted with Bowtie v.1 as for alignments to the coding regions of the nuclear genome with no mismatches allowed. RPKMs were calculated in the same way using the number of total reads from each library that were mapped to the 88,515 nuclear gene models as the normalization factors. Trimmed reads from libraries were also aligned to the entire soybean genome Wm82.a2 from Phytozome (Joint Genome Institute) of the Williams 82 genome [18] as before using Bowtie v1 but with output in SAM format. The SAM format was converted to the BAM format and sorted by Samtools [23]. The Integrative Genomic Viewer (IGV) was used to convert from BAM to TDF format and visualize normalized [count at base × one million / total number of reads] data [24]. Raw sequencing data are available in the SRA (short read archive) of the National Center for Biotechnology Information under accession number SRP055880 and series GSE66580 in the GEO database.

Analysis of translational efficiency (TE)
TE is defined as the number of normalized footprint reads divided by the number of normalized mRNA sequence reads in the form of RPKM [6,9]. TE values were calculated for all the genes across the three seed developmental stages. Gene models were filtered based on their TE values, p-values, and RPKMs as described in the Results section. A higher TE value represents greater potential of mRNA for translation [6,10]. Here we refer to those gene transcripts with TE values >1 as having high values with greater translational potential and those with TE values <0.1 are referred to as having very low TE values and being more translationally silent. The selected genes with high or very low TEs during early, mid and late developmental stages were analyzed by cluster analysis.

Gene model annotations and clustering
Gene model annotations were obtained from PFAM of the Soybean Genome Project in the Phytozome database [25]. Gene models were grouped into families by manual inspection using these annotations. Clustering of gene models with either high or very low translational values in at least one stage of development was performed by the Multi-Experiment Viewer (MeV) [26]. The k-means clustering technique was used with the Pearson correlation distance metric and 5 x 10 6 maximum iterations. The correlation coefficients of biological replicates were calculated to show the reproducibility of the ribosome profiling as well as the control RNA sequencing libraries.

Results
Developmental stage-specific ribosome profiling libraries, sequencing, and sequence analyses Ribosome profiling and RNA sequencing libraries were constructed using soybean cotyledons from three different seed developmental stages of early, mid, and late maturation seed. We dissected cotyledons away from the seed coat for these three stages that represent major milestones of seed development such including nutrient accumulation, storage protein synthesis, and desiccation.
High-throughput next generation sequencing of these libraries was performed on the three stages of soybean seeds, with two biological replicates, resulting in 49 to 113 million reads per sample (S2 Table). These reads were aligned to 88,515 soybean gene models determined by the Soybean Genome Project [18] using the program Bowtie v1 with no mismatches allowed [20]. Read length distributions were found to be very similar between replicates. The read lengths for the control RNA sequencing libraries (S1 Fig Libraries were reproducible as the Pearson correlation coefficient (r) ranges from 0.8 to 0.97. Data were also aligned to the complete soybean reference genome and displayed with the Integrative Genomics Viewer to assess the quality of the footprint libraries. Fig 1 illustrates the base coverage of selected gene models representing various soybean seed protein genes. The reads aligned to exonic regions only as shown for the conglycinin gene model which has six exons. The footprint libraries also had greatly reduced alignments in the 5' UTR and 3' UTR regions as expected. It is also apparent that some gene model calls contain excess 5' UTR and 3' UTR regions that are not covered by the control RNA as illustrated for the lectin and trypsin inhibitor Glyma models. In an independent procedure, the numbers of Bowtie v. 1 read alignments to the mRNA transcripts of various gene models were also counted for the 5' UTR, coding sequence (CDS) and 3' UTR regions as shown in Table 1. The reduction in alignments for the 5' and 3' UTR regions was substantial for each of the seven genes examined indicating that the footprint reads were predominantly located in the CDS regions of the mRNA transcripts examined.

TE values for the abundant seed storage protein transcripts gradually rise during development after their peak mRNA abundance
Soybean seeds produce several classes of highly abundant storage proteins. The PFAM database annotations for the soybean gene models and all splice variants were filtered by keywords to assemble lists of gene models annotated as in the cupin superfamily and then were verified as soybean seed storage proteins such as glycinin, conglycinin, lectin, and Kunitz trypsin protease inhibitor. These seed storage proteins are the major constituents of soybean seed cotyledons that impart the value of soybean as a high protein crop [27]. We analyzed the expression profiles of these 19 storage protein genes including mRNA abundance, ribosome footprint abundance, and translational efficiency during three stages of seed development (Fig 2 and Table 2). Most of the seed storage protein genes are consistently expressed at much higher levels than most other genes, as expected. Interestingly, whereas seed storage protein transcripts showed their highest level of mRNAs at the mid stage of cotyledon development as expected, their highest ribosome footprint levels were found exclusively at the late stage of seed development. Thus, the proportion of ribosome bound transcripts gradually increases during development for all 19 seed protein genes (Fig 2A and Table 2). A representative gene model from each group of seed storage proteins with their relevant expression profiles are graphed in Fig 2B. The seed lectin gene (Glyma.02G012600.1) is the only lectin family member expressed during seed development and results from expression of a single gene [28,29]. Its mRNA abundance sharply increases from 137 RPKMs in the early stage to 455 RPKMs in the mid stage and then it decreases to 377 RPKMs in the late stage ( Fig 2B and Table 2). But in the ribosome footprints data, the peak abundance, 498 RPKMs, was found in the late stage. There is not much difference in the translational efficiency (TE) of the seed lectin gene across development, ranging from 0.44 to 1.3. The highest level of relative translational efficiency was found at the late stage which indicates the attachment of more ribosomes to the lectin mRNAs at this stage compared to the other earlier two stages. Glycinin and conglycinin are the two most abundant seed storage proteins and there are multiple members of this family. For a representative glycinin (Glyma.03G163500.1), the peak mRNA abundance was 1049 RPKMs found at the mid stage, whereas a peak ribosome footprint abundance of 1102 RPKMs was found in cotyledons of the late stage. The translational efficiencies of this particular glycinin range from 0.47 to 1.19. Similarly, the representative conglycinin (Glyma.10G246300.1) gene showed the highest mRNA level of 1896 RPKMs at the mid stage, whereas peak footprint abundance was found at the late developmental stage. The translational efficiencies of this conglycinin range from 0.35 to 0.87. We found a similar expression pattern in terms of mRNA abundance, ribosome footprint abundance, and translational efficiencies for other seed storage proteins such as the Kunitz trypsin protease inhibitors. As shown in Table 2, only two transcripts, one for a glycinin family member and another for a Kunitz trypsin inhibitor, passed both our pval <0.05 and TE> = 1 filtering criteria and only in the late stage C300 as shown in the complete data in S1 File. The cDNA transcripts derived from the indicated gene models from the soybean genome were used as individual references to which libraries from the C300 stage of 300-400 mg immature seeds were aligned. The specific libraries used were the control library C300CT (RP3001C) compared to its footprint library C300FP (RP3001T). The non-normalized total numbers of trimmed and processed reads aligning to either the 5' or 3' UTR regions compared to the protein coding sequences (CDS) of the indicated gene model are shown. Ã represents the ratio of the C300FP/C300CT read counts normalized relative to the CDS region as 1.00 and illustrates the dramatic reduction in the representation of the 5' and 3' UTR regions in the footprint libraries. https://doi.org/10.1371/journal.pone.0194596.t001

Transcripts showing dynamic changes in TE values during development
Both RNA sequencing and ribosome footprint reads were normalized and presented in reads per kilobase of gene model per million mapped reads (RPKMs) [22]. Translational efficiency (TE) was calculated by dividing normalized ribosome footprint reads by the normalized RNA sequencing reads in the form of RPKM [6,10]. TE value is the indicator of how well a particular transcript is translated. Higher TE values indicate more ribosomes are bound to the mRNA and hence there is a higher chance of mRNA translation to synthesize proteins [6,10]. The complete set of ribosome profiling data (S1 File) shows the level of normalized RNA reads, footprint reads and translational efficiency of all soybean annotated genes averaged for both biological replicates and the p-values resulting from DESeq analysis of hit count data from both replicates. Gene models were filtered based on P-values (PVal), translational efficiencies (TEs) and ribosomal footprints (FP) data. Data clustering was also performed using the average of RPKM values from Biological Replicates 1 and 2. In order to obtain transcripts with higher translational potential in different seed developmental stages, all the genes were filtered to retain gene models with Pval<0.05; TE>1 and FP_RPKM> = 1for each of the developmental stages separately. This filter resulted in a final set of 216 unique gene models (excluding alternative splice variants and 13 Glyma models with abnormally high RPKMs), whose transcripts had high TE values in at least one of the developmental stages (Table 3, and the full list in S2 File). We then filtered all the genes with Pval<0.05; TE< = 0.1 and CT_RPKM> = 10 to obtain genes with the more extreme values in which the total mRNA levels were substantially higher than the ribosome footprint values (i.e., very low TE values) during the three stages of development. This resulted in a list of 370 unique gene models with very low TE values that are more translationally silent during at least one of the different developmental stages (Table 3 and S3 File).

Transcripts with higher translational capacity in at least one stage of cotyledon development
The transcripts with higher TE values shown in the S2 File by stage were then grouped into clusters with the Multi-Experiment Viewer (MeV) [26] as shown in the S3 File. Representative clusters are shown in Fig 3 with the annotations of the most abundant types represented in each cluster. The representative cluster in Fig 3A contains 108 unique gene models with high TE values ranging from 1.1 to over 43 during the early stage of seed development. In this cluster, ten different gene models, excluding splice variants are annotated, as proteases or peptidases. Other annotation groups in this cluster include a group of proteins with the PFAM annotation "bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein." Six transcripts with annotations as gibberellin-regulated proteins and seven various members of the flavonoid pathway stand out as having multiple members along with four transcription factors. The other 37 transcripts were dispersed among many annotation groups each with only one or a few members and another 35 transcripts have unknown functions.
The representative cluster in Fig 3B with 22 unique gene models represents genes that peak in the mid-developmental stage. These transcripts had TE values ranging from 1.02 to over 7.5. Most categories contained only one or at most a few genes. The over-represented annotation groups in this cluster are aldolase-type TIM barrel family protein, GDP-L-fucose pyrophosphorylase, and xyloglucan endo-transglucosylase. Using the same filtering criteria as in the early and mid-stages of development, we identified only 53 transcripts with peak TE values at the late developmental stage that fell into three different clusters. The representative cluster in Fig 3C contains 35 unique gene models with TEs ranging from 1.2 to over 10. Of these, we found the most prominent groups of genes are annotated as alcohol dehydrogenase, aquaporin-like superfamily protein, peroxidases, and eukaryotic aspartyl protease family protein.
Clearly the early C25 stage, and in particular Cluster4, contain the largest number of unique genes that have high TE values. We then searched the entire set of 216 unique transcripts with high TE values in at least one of the three developmental stages (S2 File) for transcripts with abundant annotation classes as protease and peptidases. Table 4 shows that 12 protease and peptidase genes are represented with high TE values in the C25 stage and Fig 4A and 4B show the base coverage for two of these genes, Glyma.06G022600.1 encoding a subtilisin-like serine protease and Glyma.09G249500.1 encoding a serine carboxypeptidase. Both of these have relatively low level expression of approximately 1.3 RPKM in the control RNAs but significantly increased levels of footprint RNAs leading to TE values of 25.5 and 3.4, respectively. A second group of three proteases are represented in the later C300 stage and were more abundantly expressed than those in the early C25 stage.
Interestingly, there are a total of 16 transcripts with annotations as either 2S albumins/lipid transfer proteins or low molecular weight (LMW) cysteine-rich proteins that also presented high TE values in the early C25 stage. These two classes of proteins have been shown to serve as small molecular weight storage proteins in seeds [27]. One Kunitz trypsin inhibitor and one glycinin are the only two of the 19 major seed storage proteins that met the filtering and pvalue criteria as shown in Table 2 previously.
Our data presented in Fig 3 and Table 4 suggest that there may be two phases in the appearance of storage proteins and degradative enzymes in soybean. One is represented by the low molecular weight (LMW) cysteine-rich or 2S albumin proteins and by many serine peptidases and cysteine proteinases in the 25 to 50 mg early maturation seed that have higher TE values. These all have relatively low expression levels. The other phase at the later C300 stage at the 300-400 mg weight range contains some of the more highly expressed proteases but only a few of the major storage proteins. As shown in Fig 2 and Table 2, the abundant storage proteins gradually increase in TE value by approximately 2 to 6-fold from early C25 to the late C300 stages, but most of them do not pass the <0.05 p-value cutoff when comparing the footprint to control RNA-Seq RPKMs, possibly because of the overall very high expression levels of these genes. Table 5 shows three other major categories found in the transcripts with high TE values including gibberellic acid-regulated genes, flavonoid pathway genes, and four transcription factor genes that are found only in the early C25 stage. They include four gibberellin regulated transcripts and two Glyma models annotated as GAST (gibberellic acid stimulated transcript) protein homolog 1. The occurrence of gibberellic acid is generally biphasic with one peak in early seed development promoting cell growth and expansion and another at a later stage of maturation that induces proteases and germination processes [30]. The flavonoid pathway transcripts include two members for the flavonone 3-hydroxlyases, dihydroflavonol 4-reductases, and leucoanthocyanidin deoxygenases (Table 5). Four transcription factors including two MADS box and two NMR-like negative transcriptional regulators are also represented. These four Glyma models were the only transcription factors present in the complete list of 216 genes with high TE values in at least one stage of development that passed the p-value <0.05 and with FP value >1 RPKM.

Transcripts with very low TE values in cotyledons
We then examined the genes with much lower ratios of ribosome bound transcripts versus total mRNA (ie.,very low TE values) by selecting those gene models with Pval<0.05; TE< = 0.1 and CT_RPKM> = 10. We applied the same filtering criteria separately for each developmental stage. We found a total of 302, 218, and 244 primary transcripts in the early, mid and late developmental stages, respectively, (Table 3 and S4 File). These genes of each developmental stage were separately grouped into S5 File with the Multi-Experiment Viewer (MeV) [26].
In contrast to transcript with high TE values, it was clear that the very low TE set of genes shared major annotation groupings in most clusters (S4 File). A representation of the annotation groupings for the entire set of 370 unique transcripts are shown in the pie chart of Fig 5. The largest annotation grouping was clearly the ribosomal proteins at 12% followed by photosynthesis related genes at 9%, including photosystem II reaction center proteins, photosystem I, and PsaA/PsaB proteins, then ATP Synthase/ATPases (6%), NADH dehydrogenases (6%) and translation/RNA binding transcripts at 4%. These annotation groupings included genes that might be involved in chloroplast and mitochondrial function, as for example the photosystem reaction center proteins, ATPases, the NADH dehydrogenases, and ribosomal proteins. In order to determine Glyma models that have high sequence similarity to the plastid genes, we performed nucleotide Blast alignments of these  370 unique Glyma models against the 152 kb soybean chloroplast genome (NCBI accession DQ317523.1) or the 450 kb soybean mitochondrial genome (NCBI accession JX463295) and flagged the Glyma models that had 90% identity over at least 75% of their coding sequence length (S6 File). Table 6 shows a summary of these results. Of the 48 genes that were annotated as encoding ribosomal proteins, 18 were related to the chloroplast genome and 2 to the mitochondrial genome. The remainder shown in Table 7 represents 28 ribosomal protein genes encoding different structural components of the ribosomes and 14 other Glyma models potentially related to translational processes including several polyA or RNA binding proteins and three family members of the TMA7 (translation machinery associated) proteins. All of these clearly nuclear genes had very low TE ratios indicating that they were relatively silent translationally.
Of the 95 genes related to energy/plastid functions, a total of 47 showed significant sequence similarity to the chloroplasts and 17 to the mitochondrial genome and two models had similarity to both organelle genomes (Table 6). Full data for these are shown in S6 File. To determine the profiles of the chloroplast transcripts, we conducted alignments with the original data sets directly against the 152 kb chloroplast genome (NCBI accession DQ317523.1) with no mismatches allowed. As shown in S7 File for all of the coding sequence regions of the chloroplast genome, the highest expression in the control RNA for the CT25 stage was found for the large subunit of the ribulose 1, 5 bisphosphate carboxylase/oxygenase (rbcl) gene with a normalized value of 1411 RPKMs. Many of the photosystem gene transcripts also had relatively high expression levels followed by some of the ribosomal protein genes, then the ATPases, and the NADH oxidoreductases at much lower levels. The normalized RNA footprint levels were consistently lower for all the chloroplast protein genes resulting in very low TE values. All three development stages of the cotyledons exhibited the same general pattern for both the control RNA levels and footprint levels.

Discussion
There are only a very limited number of studies on ribosome profiling of transcripts on a genome-wide scale in plants, primarily of the model plant Arabidopsis [1][2][3]31] with very few reports on agronomic crops as maize [32] or legumes [33] and none from soybean or of seed development in any plant. In order to determine the relative translational capacities of transcripts in seed development, we used ribosome profiling experiments of three stages of immature soybean cotyledons. By clustering the TE values into patterns, we found that the TE values varied dynamically for transcripts during seed development. However, the vast majority of soybean genes, including those encoding the major storage proteins, did not demonstrate high TE values during the normal progression of cotyledon development in soybean. The numbers of unique transcripts that were expressed at an arbitrary cut-off level of greater than or equal to 1 RPKM in our control samples were 14,888 in the C25 early maturation 25-50 mg  Table 7 shows the data for these Glyma models.
https://doi.org/10.1371/journal.pone.0194596.t006 cotyledons, 7,683 in the C100 mid-maturation 100-200 mg cotyledons, and 8,027 in the C300 late maturation 300-400 mg cotyledons (Table 3). Thus, the percentage of transcripts that demonstrated higher ribosome occupancy, ie., TE>1 were 0.91%, 0.94%, and 0.57% in each of the developmental stages, respectively. To our knowledge there are no reports on translational profiling of immature seed development in any plant species. In Arabidopsis, polysomes extracted from the early stages of seed hydration and germination from tissue collected at 0, 6, 26, 48, 72 HAI (hours after imbibition) were profiled on Affymetrix arrays [34]. This approach revealed two shifts in translational regulation, one during the hydration phase in which 435 transcripts had higher TE values and 1204 were low, and another in the germination phase in which 195 transcripts were high and 717 were low.
The immature seed development expression program, especially in a high protein crop as soybean, is dominated by production of large quantities of specific storage proteins. Early molecular studies showed that the abundant storage protein messages account for 50-60% of the polysomal mRNA [35,36]. We found that the number of soybean transcripts showing higher TE values was greatest in the youngest seed at the 25-50 mg stage (Table 3). Examination of our RNA-Seq data [12] from developing Williams cotyledons showed that the number of unique transcripts expressed at > = 1 RPKM was approximately 21,000 to 23,000 in each of several stages of seed development from 5-6 mg, 100-200 mg, and 400-500 mg instead of the 8,000 to 15,000 in this ribosome profiling data. We also observed that the RPKM values of many of the genes were lower by several fold in the control RNA in the ribosome profiling samples. One reason could be that, despite the large sequencing depth of generally over 100 million total sequence reads, more of the sequencing space in the ribosome profiling experiments is occupied by rDNA reads as the technique cannot use poly A selection as used for RNA-Seq, but rather requires fragmentation of the RNA for the control (CT) sample versus the protection provided by the ribosomes in the FP samples. We found only 2-3% of the total mapped sequence reads were to rDNA in the RNA-Seq samples [12] while 20 to 80% of the total mapped reads were aligned to rDNA in the ribosome profiling experiments (S2 Table) and then removed from the analyses informatically. However, the 8.1 to 24.7 million non-rDNA reads mapped to coding transcripts of the Glyma models in the ribosome profiling experiments were sufficient for quantification of most transcripts except possibly for the ones with lowest abundances.

TE values for serine proteases, 2S albumins, and other LMW storage protein suggest two phases of storage protein mobilization during seed development
The cysteine and subtilisin-like serine proteases [37,38] are the most commonly found seed proteases in soybean and are involved in seed developmental processes including the mobilization and degradation of seed storage proteins in soybean and other legumes [37][38][39][40][41][42][43]. Our results showing higher translational efficiencies of serine proteases during early seed development suggest that these proteases might be involved in turning over some proteins before the massive synthesis of the major storage proteins. The expression profiles of the seed storage proteins support this phenomenon where the lowest level of transcripts and footprints is measured in the early stage compared to the mid and late stages (Fig 2 and Table 2). On the other hand, a representative serine protease (Glyma.06G022600.1) has about a 1.3 RPKM level of transcripts in cotyledons of all three stages of seed development, whereas there is a much higher level of footprints, about 32 RPKMs, detected in the early stage compared to the other two stages (Table 4). Hence there is a high level of translational efficiency (TE), at about TE = 25 in the early stage. This implies that there are more ribosomes bound to serine protease mRNAs at this stage which enables more translation. We found eight unique gene models annotated as serine proteases and most of them showed similar expression patterns to the one described above (Table 4). It would appear that the serine proteases play a vital role in the early stages.
In the entire list of 216, there are 16 unique protease gene models that have high TE values in at least one stage of development. Interestingly, several other more abundant proteases, including two aspartyl proteases and a cysteine protease, had higher TE values in the late stage of development after the seeds begin the desiccation process at the 300-400 mg seed weight range (Table 4). Thus, there may be two types of proteases one for the early stages with lower RPKMs but higher TE values, and another class with higher expression and also increased TE values at the more mature stage after the massive increase in storage proteins. These enzymes are likely to play a role in post-germination processes.
The other overrepresented category of transcripts with high TE values possessed the full PFAM annotation of "bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein". The 2S albumins and other low molecular weight (LMW) cysteine-rich proteins have been determined to serve as storage proteins in diverse plant species [27,44]. Synthesized initially as preproalbumins, the signal sequence is removed first and the proalbumin intermediates are then proteolytically processed to smaller mature 2S large subunits of 8-10 kd and small subunits of 3-4 kd by endopeptidases and proteases [45]. The mature proteins often also have carboxy-terminal heterogeneity likely from clipping by carboxypeptidases. It has also been shown in some species that both the proteases and peptidases are initially physically separated from each other in multi-vesicular bodies on their way to the storage protein vacuoles [46,47]. Thus, these 16 proteins are likely to be substrates for some of the 12 proteases and peptidases that are present during the early stages of seed development.
It is intriguing that both types of transcripts, those encoding either degradative enzymes or 2S albumins, have generally similar expression and TE patterns in the early C25 stage and then decline in the later stages when a second set of storage proteins and degradative enzymes appears. This suggests two phases to the synthesis and processing and degradation of storage proteins-one represented by the earlier expressed and less abundant class of storage proteins (2S albumins and LMW cysteine-rich proteins) versus the more highly expressed, major globulin proteins represented by glycinin and conglycinin transcripts that accumulate in later stages.
The enhanced translation of the peptidases and 2S albumin proteins perhaps aids in the storage and turnover of amino acids during early seed development and provides a pool of amino acids for the synthesis of the major storage proteins, glycinins and conglycinins. The abundant soybean glycinin and conglycinin proteins and the seed proteins of many important legumes and grains are relatively poor in the essential sulfur amino acids, methionine and cysteine. Although it would be attractive to use some of these 2S albumins as backbones for increasing total sulfur amino acids by engineering, some of them are food allergens as is the S2 albumin of Brazil nuts [44,45] although the 2S albumins in soybean are classified as minor allergens [45].
Whether the six gibberellin-regulated proteins that also have high TE values in the C25 stages are also involved in a network that induces these early 2S albumin proteins and the proteases and peptidases is unknown. However, it has been reported that the pattern of accumulation of gibberellins during seed development shows two peaks corresponding to an early phase when they promote cell growth and expansion and the end of the maturation phase when they activate proteolytic enzymes that are used in germination [30]. The need for higher TE values of the seven flavonoid pathway enzymes at this stage is intriguing. The two copies of similar flavonone 3-hydroxylase gene have previously been cloned and described in detail at the molecular genetic level [48]. One of them was shown to be the wp gene that encodes a rare pink flower phenotype in soybean. Flavonoid compounds have many effects on plant development in addition to pigment production. Interestingly, the mutant pink flower line is also associated with a 22% higher seed weight and 4% higher protein content, but the mechanism is unknown.
The molecular mechanisms underlying the observed translational regulation of various mRNAs including alternative expression from upstream orf regions (uORFs) is an attractive subject for further investigation [49,50]. However, one of the current problems with undertaking a complete genomic study of this system would be to first define the UTR for each of the approximately 56,000 unique soybean gene models. For example, as pointed out in Fig 1, the 5'UTR for the seed lectin gene was called as 970 bases, but S1 nuclease protection studies show it to be approximately 28 bp [29]. We have noted the discrepancy in the automated UTR calls made by the Phytozome gene assembly with several of the genes in Table 1 as well. A global experimental determination of the authentic 5' and 3' UTR sequences would require correlation of the gene model calls with a number of RNA-Seq data sets from many tissues and organ systems of the soybean plant under various environmental growth conditions including biotic and abiotic stresses.

Low abundance transcripts encoding MADS box transcription factors and NmrA-like negative regulators showed higher TEs in early development
Translational regulation of mRNAs coding for transcription factors (e. g. yeast GCN4 as the founding example) is a common mechanism of directly connecting cytoplasmic signaling events to altered genome-wide transcriptional program [51,52]. Although transcription factors generally have very low expression, we found two MADS-box transcription factors that had higher TE values in the early stage cotyledons. Members of the MADS-box transcription factor family have a conserved DNA-binding MADS-box domain. The MADS-box family is an important class which is involved in numerous biological processes during plant growth and development [53]. They are mainly recognized to play essential roles in reproductive development [54]. However, MADS-box genes were also found to be expressed in vegetative tissues, root, trichomes, and fruit, suggesting their diverse roles in plant development [55][56][57]. It has been reported that the soybean MADS-box transcription factors are involved in seed development [58]. We found 9 total members of MADS-box transcription family annotations that demonstrated higher translational efficiency during early seed development, but these contained multiple splice variants of the same two gene models. One MADS-box transcription factor encoding gene, Glyma.06G324400.2, showed translational efficiency of 7.08 in cotyledons during the early stage of seed development, while the other, Glyma.04G257100.2 had a TE of 4.76 ( Table 5). The expression of the MADs box is very low in all three stages. However, there is a significant difference in translational efficiencies which indicates more ribosomes are associated with MADS-box transcripts at the early stage. This facilitates the greater potential of MADS-box transcripts to be translated into protein. Our results indicate that MADSbox transcription factors might be playing an important role during the early stage of seed development that requires more protein synthesis than inferred from their transcript levels.
The only other transcription factors with high TE values in the final list of 216 models were two gene models annotated as NmrA-like negative regulators. The wild type NmrA gene is involved in nitrogen metabolite repression in fungi such as Neurospora crassa and Aspergillus nidulans where it represses the induction of transcription factors that induce genes required to utilize alternative sources of nitrogen other than glutamine or ammonia [59]. In soybean, the principal form of nitrogen delivered to the developing embryo has been measured as glutamine at 55% and asparagine at 20% [60]. Thus, under normal conditions, nitrogen would not be limiting and the possible function of a NmrA-like negative regulator could be envisioned, at least for the early stages of the seed filling period.

Very low TE values of many ribosomal family proteins and some energy and plastid related transcripts occur across all three developmental stages
Ribosomal proteins are structural constituents of the ribosomes which are necessary for protein synthesis. Interestingly, we found that the footprint values of some ribosomal protein transcripts were much lower than their control levels across all three seed developmental stages, indicating that they are relatively translationally silent during soybean seed development. One such representative ribosomal protein gene (Glyma.14G213500.1) showed very low TE values of 0.04, 0.03 and 0.04 during early, mid and late stages, respectively (Table 7). During the early stage, the transcript level of this particular gene was 87 RPKMs, whereas the footprint level was only 3 RPKMs. Other ribosomal protein coding genes showed similar expression patterns across the developmental stages. All total, 28 of the 370 unique transcripts with very low TE levels were annotated as ribosomal proteins and an additional 14 appeared to be involved in translational machinery (Table 7). At first glance, this result would appear to be at odds with the fact that abundant seed storage protein transcripts need to be associated with functional ribosomes during the mid and late stages of cotyledon development. However, this observation could indicate that there may be specific types of ribosomes that are associated with translation of storage protein transcripts, and thus the levels of some ribosome proteins decrease during seed development to produce that switch. Previously, it has been reported that expression of over 64 ribosomal protein coding genes increased during the transition of the soybean cotyledons from storage organs to photosynthetic organs that occurs at approximately 4 days after seed imbibition and germination [61]. Recently, polysome profiling showed that ribosomes existed primarily as monosomes in Arabidopsis dry seed and the polysome fraction increased by six hours after imbibition of the seed [34]. The expression of ribosome protein genes also increased significantly during the seed to seedling transition and the authors speculated that this might affect the composition of the translating ribosomes and the selection of the translated mRNAs [34]. Thus, both transcriptional and translational regulation of ribosome proteins may be important in the immature soybean cotyledons as well as the transition of the cotyledons during seedling growth from a storage organ to a catabolic and subsequently photosynthetic organ. Interestingly, ribosome profiling in embryonic stem cells recently revealed evidence that ribosome heterogeneity at the level of core ribosome proteins facilitates ribosomes to preferentially translate specific mRNAs, mediated in part by internal ribosome entry sites [62]. Thus, in contrast to the current paradigm of ribosomes having conserved structure and function in all tissues, some ribosomes may have structural heterogeneity that leads to functional diversity in their preferences for particular transcripts and subsequent translational preferences. However, many studies would be required to directly verify key compositional differences of ribosomes in the different developmental stages of the seed or other plant parts, as opposed to changes in expression and/or translational capacity of homologous gene family members within each ribosome protein.
Some of the ribosomal proteins (20 of 48 total) and many of the energy, photosynthetic, and plastid related genes (66 of 95 total) have significant homology to the soybean chloroplast and mitochondrial genomes (Table 6). In total, 24% (87 of 370) of the unique Glymas in the very low TE set had identity to the chloroplast genome, whereas only 2% (5 of 216) of the high TE set Glymas had identity to the chloroplast genome (S2 File). The fraction with identity to the mitochondrial genome was lower at 8% (31 of 370) unique Glymas in the very low TE set and none in the higher TE transcripts.
Transfers of DNA have occurred during evolution between the genomes of the nucleus and the organelles. One recent report [63] compared the 450 kb soybean mitochondrial genome from the cultivar Aiganhuang and the 150 kb chloroplast genome [64] from PI 437654 to the assembly of the complete soybean genome from Williams 82 [18]. Their report found the nuclear assembly contained a total of 155.2 kb (0.02%) that had significant identity to the mitochondrial genome with most regions less than 500 bp in length [63]. A larger proportion of the nuclear DNA (1.1 Mb or 0.11%) had significant identities to the chloroplast genome with 24 chloroplast genes having complete ORFs, accounting for 32% of chloroplast coding genes [63].
The large number of nuclear genes or gene fragments with high identity to the chloroplast and mitochondria genomes raises the possibility that our alignments to the Glyma models with these annotations were profiling ribosomes from the organelles, especially since the sequence footprints are necessarily about 30 nt, as that is the size protected by a ribosome from the micrococcal nuclease digestion. Thus, there may be some organelle footprints that align with 100% identity to the nuclear Glyma models. If this is the case, our observations of very low TE values of < = 0.1 for some of the ATPases, NADH dehydrogenases, and photosynthetic reaction center proteins, could be reflecting the relative lack of translation for many of the energy and plastid genes during seed development. Although the early and mid-maturation cotyledons are pale green in color and changes in chloroplast ultrastructure have been observed in the immature cotyledons [65], the developing seed is a sink for photosynthate transported as sugars through the seed coat to the developing embryo. Therefore, the demand for chloroplast functions could be low in the developing cotyledons and/or the pool of RNA in the plastids may be kept high. In order to assess the status of the total RNA versus ribosome bound RNAs that map directly to the chloroplast genome, we aligned our data to the soybean chloroplast genome [64] using the same alignment and normalization criteria that were used for the nuclear coding sequences including no mismatches and using the total reads mapped from each library to the 88,515 nuclear gene models as the normalization factors. As found for the alignments to their nuclear counterparts, the ribosome footprint RNAs of the chloroplast transcripts were highly underrepresented compared to the number of alignments with the control RNA samples as shown in the S7 File representing the 102 coding regions in the chloroplast genome. Thus, it appears that the amounts of unbound polycistronic or processed RNAs in the cotyledon chloroplasts are much higher than those associated with the chloroplast ribosomes. For example, in the C25 stage cotyledons, the chloroplast encoded rbcl subunit has the highest level of all the coding sequences at 1411 RPKMs but only 74 RPKMs (0.05 TE value) were found in the footprint sample, and thus associated with active translation of the transcript. The photosystem II protein D2 had the highest TE value at 0.12 (22 FP RPKMs /180 total RPKMs).
A recent study [66] used RNA sequencing to examine the chloroplast translatome in different segments of the 9-day old maize leaf blade. The leaf segments form a natural gradient of cells and plastids at increasing stages of photosynthetic differentiation corresponding to increasing distances from the leaf base. The average TE value for all chloroplast genes in segment 1 was 0.22 and only the atpH gene (1.14 average TE) had a TE value >1. The authors found that the translational efficiency of most chloroplast genes increased during development in segments 4 and 9 which are 3-4 cm and 8-9 cm, respectively, from the leaf base, and then declined in segment 14 near the leaf tip as the photosynthetic apparatus matured. In this respect, our results from the cotyledons more closely resemble segment 1 of the leaf base than the other more photosynthetically active tissues of the leaf.

The major seed storage proteins are primarily regulated at the transcriptional level
During seed development, the accumulation of seed storage proteins is regulated by an integrated genetic and physiological network [67][68][69][70]. The glycinins and conglycinins account for up to 70% of the seed protein content with trypsin inhibitors at approximately 5-10% and lectin at 2-5%. The abundant storage protein messages account for 50-60% of the polysomal mRNA [35][36] and lectin and trypsin inhibitor mRNAs were initially enriched and cloned from immunoselection of polysomal RNAs [71]. Lectin and the other seed proteins contain signal sequences that direct their localization to the endoplasmic reticulum and transit to specialized protein body vacuoles in the seed [36,72]. The transcript levels of the abundant storage protein coding genes rise and fall from the early to mid and late maturation phases of seed development as has been shown with several different molecular [13,15,35] and genomics technologies [11][12]. It has already been reported that transcript levels for conglycinin and some other seed proteins are regulated primarily at the transcriptional level during seed development [14,15].
Here, we extensively investigated the expression profiles of the seed storage proteins such as glycinins, conglycinins, lectin, and Kunitz trypsin protease inhibitors in our ribosome profiling data. In contrast to the less abundant 2S albumins, most of the members of these seed storage protein coding genes showed similar expression patterns in cotyledons, where the highest levels of transcripts were found at the mid stage when the seed fresh weight ranges from 100-200 mg (Fig 2 and Table 2). In contrast, the peak ribosome footprint level was found at the late developmental stage even though there was not much difference in relative translational efficiencies across development (Fig 2). These data indicate that although the trend for the seed protein genes is for a gradual increase of 2 to 6-fold in the relative TE values from the early to later stages, these values were only between 1 to 1.6 TE at their highest, indicating roughly equal levels of the total and ribosome bound fractions. Our conclusion is that accumulation of the major storage proteins primarily reflects transcriptional control in line with the massive increase of their transcript levels during seed development. On the other hand, translational regulation may be important to rebalance the proteome to maintain total seed protein levels in situations when naturally occurring mutations or transgenic RNAi knockouts prevent expression of one of the superabundant major storage proteins genes as has been observed [16]. Translational control may also be very important during environmental perturbations as nutrient deprivation or abiotic stress which were not tested in this report.
In summary, in this first report of ribosome profiling of seed development, we profiled both nuclear and chloroplast genomes and identified a relatively small number of genes with high translational efficiencies in cotyledons during different seed developmental stages. Some of them showed higher TE values at only one of the stages of seed development, such as a number of proteases and peptidases and 2S seed storage albumins in the early stage of maturation, suggesting that increased translation of this class of proteins and their degradative enzymes may be needed in the early stage before the major seed proteins increase dramatically. The TE values for the major seed storage protein transcripts increased from 2 to 6 fold TE values from early to late maturation, but their massive accumulation is primarily reflecting the prominence of their transcriptional control during normal seed development. Interestingly, a number of ribosomal protein genes also displayed very low TE values over all three seed developmental stages, even though immature seeds are an extremely active tissue for synthesis of the abundant storage proteins. Genes of the chloroplast appear to be relatively low in translational output during these immature seed stages in which most of the energy is derived from transport of photosynthate into the seed.