Changes in gene expression between a soybean F1 hybrid and its parents are associated with agronomically valuable traits

Soybean [Glycine max (L.) Merr.] genetic diversity is limited because domesticated soybean has undergone multiple genetic bottlenecks. Its progenitor, the wild soybean [Glycine soja Siebold & Zucc], has not undergone the same intense selection and is much more genetically diverse than domesticated soybean. However, the agronomic importance of diversity in wild soybean is unclear, and its weedy nature makes assessment difficult. To address this issue, we chose for study a highly selected, adapted F4-derived progeny of wild soybean, NMS4-44-329. This breeding line is derived from the hybridization between G. max cultivar N7103 and G. soja PI 366122. Agronomic comparisons were made among N7103, NMS4-44-329 and their F1 and F2 progeny in replicated yield trials at two North Carolina locations. Significant F1 mid-parent heterosis was observed at each location for seed yield (189 and 223 kgha-1, P<0.05 and P<0.10, respectively), seed protein content (1.1g/100g, P<0.01) and protein production per hectare (101 and 100 kgha-1, P<0.01 and P<0.06, respectively). Increased yield, seed protein content and protein production per hectare in the hybrids suggested that wild soybean has the potential to improve agronomic traits in applied breeding. Comparisons of differentially-expressed genes in the hybrid vs. parents identified genes associated with N metabolism. Non-additive changes in gene expression in the hybrids relative to the parents could reasonably explain the improved protein levels in the F1 hybrids. Changes in gene expression were influenced by environmental effects; however, allele specific bias in the hybrids were well correlated between environments. We propose that changes in gene expression, both additive and non-additive, and changes in allele specific expression bias may explain agronomic traits, and be valuable tools for plant breeders in the assessment of breeding populations.


Introduction
Plant breeding consists primarily of creating genetic and phenotypic variation by hybridizing desirable contrasting parents, and selecting recombinant progeny which out-perform the parents. Finding parents that are both desirable and genetically diverse among elite cultivars PLOS  can be challenging for a plant breeder in a crop that has undergone multiple genetic bottlenecks, such as soybean [1,2]. Molecular tools have become more common in plant breeding in recent years to address this problem, because they are extremely useful for identifying and tracking genetic diversity. For example, researchers used a 50K SNP platform to characterize the diversity inherent in domesticated soybean [Glycine max (L.) Merr.] and its wild progenitor [Glycine soja Siebold & Zucc.] [3,4]. This information can be used by plant breeders to identify and select parents which not only have desirable traits, but are also genetically diverse. In many crops, the chance of developing agronomically desirable progeny which are superior to the parents is increased by focusing on the genetic distance between parents. In corn, increased genetic distance is thought to enhance heterosis, which is defined as the improved performance of the F 1 hybrid over either parent [5]. However, in maize, the association between genetic distance and heterosis is not always sufficiently great to have predictive value [6,7]. In soybean, because of the self-pollinating mode of reproduction, heterosis and hybrid vigor do not play a significant role in current production, although the existence of heterosis has been reported [8][9][10]. When F 1 heterosis is present for quantitative traits such as seed yield, it may indicate that the parents contrast sufficiently so that the diversity between the parents can be captured in a transgressive recombinant progeny. The use of heterosis as a screening tool to select desirable parental combinations has been proposed by Brownie and Burton (1993) in soybean. Genetic markers have been widely adapted for plant breeding in many crops, including soybean. Another tool that is just beginning to be utilized by plant breeders is genotyping by sequencing (GBS) [11]. Next generation sequencing allows the generation of sequence data from a variety of plant samples. Analyses of these data requires some specialized knowledge of handling large data sets, infrastructure to handle the data sets, and sufficient funds to sequence the number of samples required for statistical power. Nonproprietary programs, well supported web interfaces, and cloud computing address the first two requirements for dealing with sequencing data. The third requirement, the cost of sequencing, is nearing the threshold of being an affordable alternative to marker based analyses. Both deep sequencing of genomes and sequencing of expressed genes (RNAseq) are becoming more affordable and can be used in GBS.
Although the genetic diversity in the USDA domesticated soybean germplasm collection is substantial and agronomically useful, the genetically diverse progenitor species, wild soybean has received little attention by applied breeders. Thus, the true agronomic importance of diversity preserved in the USDA wild soybean collection is unclear with respect to plant breeding [12]. A central problem in the study of wild soybean is that its viny nature makes agronomic assessment difficult. Not only is the wild soybean itself completely unadapted to row crop production, but so are most of the progeny derived from crosses between G. soja and G. max. To circumvent the problem of poor plant architecture in the wild soybean and, thus, facilitate assessment, we chose to study a rare, highly-selected, adapted and upright F 4 -derived progeny of wild soybean, NMS4-44-329. NMS4-44-329 was developed in the USDA soybean breeding program at Raleigh, NC [13] and is derived from the hybridization of G. max cultivar N7103 [14] and G. soja PI 366122. NMS4-44-329 was hybridized (backcrossed) with its domesticated parent N7103 to produce 3900 F 1 hybrid seeds and 23 F 2 populations.
In this study, we evaluate the agronomic impact of the wild grandparent on yield and seed composition of the F 1 and F 2 progeny from the cross between N7103 and NMS4-44-329. We also correlate detailed gene expression data from RNAseq analyses with agronomic data from the two soybean parents with their F 1 progeny in replicated yield trials in two environments. The parents, N7103 and NMS4-44-329, have similar agronomic performance, but differ substantially in sequence diversity as a consequence of the wild soybean in the pedigree of NMS4-44-329. The availability of NMS4-44-329 allowed us to correlate agronomic traits that contrast between the F 1 hybrids and the parents with differentially expressed genes in the F 1 hybrids. We also took advantage of the diversity contributed by the wild grandparent (PI 366122) to identify alleles that had biased expression (>73% one allele) in the F 1 hybrids. Finally, we used analyses of the genes that were differentially expressed in the F 1 hybrids to gain insight into the molecular basis of agronomic traits and propose that aspects of gene expression and allele specific expression may be useful tools for soybean breeding.

Materials and methods
Parental pedigree NMS4-44-329 is a F 4 -derived breeding line from the hybridization of USDA cultivar N7103 and the USDA G. soja accession PI 366122 [14]. N7103 has 100 percent G. max pedigree and is a F 4 -derived cultivar from the hybridization of USDA breeding line NTCPR90-143 and USDA cultivar 'Pearl' (PI 583367) [15]. Both breeding line and cultivar are maturity group VII materials adapted to the Southeastern USA. NMS4-44-329 is a rare adapted and upright breeding line developed through extensive selection among F 3 plants followed by pedigree selection and yield trials.

Production of seed for the experiment
In 2013 and 2014, hybridizations were made between N7103 as the female parent and NMS4-44-329 as the male parent. Approximately 3,900 F 1 seed were generated at the Central Crops Research Station (CCRS) in Clayton NC. The CCRS is a part of the North Carolina Department of Agriculture and the North Carolina State University Research system. In the winter of 2014-2015, 23 F 1 plants were grown and individually harvested at the USDA Tropical Agriculture Research Station (winter nursery) at Isabella, Puerto Rico, to produce F 1:2 seed. Self-pollinations were removed from the winter nursery prior to harvest, by discarding plants with green hypocotyl color at the seedling stage or white flowers at mid-bloom. Purple flower color and purple hypocotyl are dominant to white flower and green hypocotyl color and pleiotropic. Because the female N7103 has white flowers and green hypocotyls and male breeding line NMS4-44-329 has purple flowers and purple hypocotyls, F 1 plants are expected to have purple color [16]. Parental seed stock sources for the experiment, as well as seed for the four maturity checks (USDA NC-Roy, NCC06-1090, USDA N7003CN and USDA N8002), were produced in 2014 in North Carolina.

Agronomic evaluation
An experiment was developed to collect agronomic data in standard-size breeder plots for seed yield, seed protein and oil content, plant maturity, plant erectness, plant height and 100-seed weight. The experiment was conducted at two locations in 2015 (Caswell Research Station (CRS), Kinston, NC; CCRS, Clayton, NC), and utilized a randomized complete block design (RCBD), with five blocks per location. The CRS is a part of the North Carolina Department of Agriculture and the North Carolina State University Research system. Each block included one F 1 entry and three F 2 entries. Six entries of NMS4-44-329 and N7103 were also included. In addition, each block included the following maturity checks: USDA NC-Roy, NCC06-1090, USDA N7003CN and USDA N8002 [17][18][19]. All maturity checks were replicated two times per block, except for NCC06-1090, which was replicated three times per block. Repeated entries were included within a block to increase the testing precision for heterosis analysis. Over the two locations, a total of ten replications were grown for the F 1 , 30 for the F 2 and 60 for the parents, NMS4-44-329 and N7103. We have included the raw field data as S1 Table. Field evaluation The experiment was planted at Clayton on 28 May 2015, and at Kinston on 13 June 2015, in North Carolina under authority of North Carolina State University and North Carolina Department of Agriculture. Each experimental unit consisted of 3-rows planted with an interrow spacing of 0.97 m, and a planting length of approximately 3.35 m. Approximately two weeks after planting, putative F 1 plants were examined and those which did not have the expected purple hypocotyl color were treated as contaminants and removed. Approximately one percent of seedlings were removed from the hybrid plots using this protocol. At maturity, plots were end trimmed to a uniform length of 2.74 m. All agronomic data for the study were collected only on the center bordered row.

Seed composition analysis
Seed protein and oil content measurements were taken using a Perten DA 7250 near-infrared (NIR) spectrometer (Perten Instruments, Springfield, IL). Equations used for the NIR analysis were provided by Perten for the year 2015. Concentrations were reported in a zero moisture basis.

Statistical analysis
Statistical analyses of all agronomic and morphological data were performed using the Statistical Analysis System 9.4 (SAS Institute, Cary). To identify potential outliers in the agronomic data set, an initial set of analyses of variance was performed for each location separately, using Proc GLM. Genotype was treated as a fixed effect and block as a random effect. Subsequently, a trend analysis of residuals was added using fourth-degree polynomials for the row and column positions of plots, which were treated as fixed effects [20]. Externally studentized residuals were calculated for each plot and trait, and observations were dropped if associated with a studentized residual greater than three. Two observations were dropped from the data set using this approach. The reduced data set was then subjected to an additional analysis of variance combined with trend analysis, again employing row and position effects as covariates. Sequential sums of squares (Type 1) output were evaluated and row and column covariates which were not significant (P < 0.05) were dropped, starting with the highest order covariate. If a higher order covariate was significant, but a lower order covariate was not, the lower order covariate was left in the model. A final model retaining significant (P < 0.05) row and position effects was used in all further analysis of agronomic traits.
After outlier removal and trend analysis, isotropic and two-dimensional anisotropic exponential covariance structures were examined for the dependent variable yield using Proc Mixed. A final model was selected based on likelihood ratio testing (LRT). For the CRS location, the standard independent and identical error distribution (IID) variance-covariance structure was the best fit. For the CCRS location, a two-dimensional geometrically anisotropic variance-covariance structure was the best fit. At a given location, the best-fitting type of variance-covariance structure for yield was then applied to other agronomic trait analyses, since yield was assumed to best capture true field variation. After a final model was selected, single degree of freedom linear contrasts between genotypes were performed using the contrast statement. In addition to separate location analyses, pooled analyses over locations were also performed for each trait using Proc Mixed. Genotype and trend covariates nested within location served as fixed effects, whereas location, genotype x location (GxE) and block nested within location served as random effects. Three different residual covariance structures were evaluated in the analyses over locations: 1) IID, 2) Different variance for each location, IID within location and 3) Separate variance for each location, two-dimensional anisotropic exponential variance within location. Using LRT, the best model had a separate residual variance for each location, with an IID structure within locations. This same model was applied to all other agronomic traits. After final model selection, linear contrasts were again performed using contrast statements. We observed that two samples, representing an F 1 hybrid in Clayton and an N7103 parent in Kinston, were outliers for both the agronomic data and subsequent analyses of gene expression.

RNA isolation and analysis
For RNA analysis, the most recently fully expanded trifiolate was harvested from five plants of the F 1 and parental genotypes in each of 4 blocks at each location, just prior to flowering. Although parental lines were replicated six times per block, only one parental replicate per block was used. Leaves were harvested from Kinston on 15 July 2015 at 9 AM and from Clayton on 10 July 2015 at 9 AM. RNA was isolated using the Qiagen plant RNeasy isolation kit (Qiagen, Redwood City, CA) with on-column DNAse digestion. RNA quality was validated in a 2100 Bioanalyzer (Agilent Technologies, Santa Clara CA). TruSeq RNA libraries from four replicates of each parent and F 1 hybrid were sequenced by HiSeq (Illumina, San Diego, CA). After removing primer sequences and low quality sequences with a Phred score less than 34, between 11 million and 32 million high quality sequences remained. Table 1 shows the number of high quality sequences for each sample. The high quality RNA sequences were aligned to the reference genome (Wm82.a2) using TopHat and the number of sequences that align to the reference genome are shown in Table 1 [21]. Gene expression was analyzed using the Cuffdiff 2 program [21]. Differential expression files and raw sequence files are available from the Gene Expression Omnibus (GSE86608). Information for the sequence and alignment information including, the sample names, a description of the samples, the total number of sequences after trimming, the number of sequences that aligned to the reference genome and the percent of sequences that aligned to the reference genome.
The alignments to the reference sequence were also used to generate variant call files using mpileup. Sequences 2 bp either side of splice junctions were omitted because they have a higher chance of misalignment. Parent specific SNPs were identified that were present in all parental plots sequenced at CRS and their presence in all appropriate parental sequences from CCRS was confirmed. SNPs unique to each parent were identified to generate a list of high quality SNPs specific to each parent. Next, reads that over-lapped these parent-specific SNPs were identified in each F 1 hybrid sequenced and the read depths (number of reads) were recorded from each F 1 sample sequenced. The F 1 hybrids contain one copy of each parental SNP and there was considered to be an allele bias if one allele accounted for more than 73% of the reads consistently across a location. The gene identity and the impact of the variants were determined using SNPEFF [22].
Singular Enrichment Analysis (SEA) of the gene ontologies (GO) was performed at AgriGO using the soybean reference genome Williams 82 [23]. The reference gene list used for SEA contained all of the genes expressed in the leaf tissue.

Agronomic comparison
Agronomic data were evaluated for both parents, the F 1 hybrid, F 2 progeny and the maturity checks. Results from Kinston and Clayton showed that the F 1 hybrid yielded, on average, 189 and 223 kg ha -1 more than the mean of the parents (mid-parent value) (P = 0.05 and 0.10, respectively) (Tables 2 and 3). The combined data over locations revealed an increase in yield in the F 1 hybrids over the mid-parent of 182 kg ha -1 (P = 0.15). The F 2 yields were lower relative to the mid-parent value in both environments. Increases in seed weight of 0.6g (P < 0.01), 0.9g (P< 0.01) and 0.7g (P = 0.11) were observed in the F 1 hybrid relative to the mid-parent in Kinston, Clayton and over both locations, respectively. Maturity Date was significantly different (P < 0.05) in the F 1 hybrid compared to the mid-parent at one location, but the numerical difference was only 1 day. The higher-yielding Clayton environment was associated with taller plants and reduced lodging, but this association was not observed in Kinston.
Agronomic contrasts between the parents and their progeny for soybean grown at Clayton and Kinston, NC in replicated trials in 2015. Differences are from single degree of freedom contrasts and their associated P-values are in square brackets. Yield and harvestable protein P-values are from one-sided hypothesis testing. P = 0.01 represent values of 0.01 or less.
Analysis of seed composition data revealed that the F 1 hybrid had significantly (P < 0.01) greater seed protein content than the mid-parent value at each location (1.1 g/100 g sample on a zero moisture basis, Tables 2 and 3). At both Kinston and Clayton, the F 1 hybrids had significantly (P < 0.01) higher seed protein content than NMS4-44-329. Also, in contrast to the yield data, the elevated protein levels observed in the F 1 hybrids persisted in the F 2 progeny. Protein increased in the F 2 progeny relative to the parents when analyzed by location, and over locations. There was no consistent significant difference in oil levels between the F 1 hybrids and the mid-parent values across environments.
Seed protein production per hectare was also greater for the F 1 hybrid than the mid-parent value at Kinston and Clayton (101 and 100 kgha -1 , respectively, P 0.02, P 0.06). The seed protein production of the F 1 was also numerically superior to each parent at each location and significantly (P<0.05) greater than that of the mid-parent value (P<0.06) averaged over locations (Table 3).

Gene expression
We next determined if differences in gene expression could explain improved yield and seed protein levels observed in the F 1 hybrids in field experiments. Table 4 shows the total number of genes expressed, as well as those genes differentially regulated (! 2 fold change in expression) between the parents and the hybrids. Over 33,000 genes were expressed at both locations. At Kinston, a total of 326 (0.96%) or 321 (0.95%) genes were differentially regulated in the F 1 relative to N7103 or NMS4-44-329, respectively. At Clayton, a total of 452 (1.29%) or 320 (0.91%) of genes were differentially regulated in the F 1 relative to N7103 or NMS4-44-329, respectively. When considering all genes differentially regulated in either parent relative to the hybrids, the correlation between differentially expressed genes between locations was low (R 2 < 0.39). However, 95 genes differentially expressed between N7103 and the F 1 hybrids at both locations were well correlated (R 2 = 0.72). Ninety-four genes were differentially expressed in the same direction in the F 1 hybrids relative to the NMS4-44-329 parent in both locations and with highly correlated expression (R 2 = 0.91). This subset represents genes with genotypespecific expression that, at least in Kinston and Clayton, were not sensitive to environmental effects. Total counts of genes sequences from all samples harvested at Kinston or Clayton (total genes). Counts of transcripts in the F 1 hybrids that: were greater than one parent (>NMS4-44-329 or >N7103); were less than one parent (<NMS4-44-329 or <N7103); were greater than both parents (>both); were less than both parents (<both). Genes that are greater than or less than both parents are the "non-additive" category.
SEA uses ontologies of genes altered in expression in the F 1 hybrid relative to each parent to gain insight into potential physiological differences between the parents and their hybrid at each location. Analyses of the genes altered in expression in the F 1 hybrids relative to the parents in both locations did not show any enrichment in GO categories. Table 5 shows a selected list of relevant enriched ontologies based on comparisons of the F 1 hybrid with individual parents where enrichments were observed. No clear pattern of enrichment emerged for some ontologies. For example, categories related to oxidation-reduction were enriched in both locations among genes that are both up-regulated and down-regulated in the F 1 hybrids relative to the parents. The category of "pectinesterase" was enriched among genes with elevated levels of expression in the parents in Kinston and may be related to the "cell wall" category with a similar pattern of expression. Perhaps the most interesting observation was that genes associated with nitrogen metabolism were enriched in expression in the N7103 parent relative to the F 1 in Clayton where we saw a trend toward increasing seed protein. The subset of genes with genotype-specific expression in the hybrid compared to N7103 were enriched in genes associated with "amine metabolic process" (GO:0009308, P = 0.0004). However, the enriched ontologies associated with N metabolism were not significant among genes associated with the other parent even though the F 1 hybrids have substantially more protein that NMS4-44-329. Expression of most genes in the F 1 hybrids fell between the parental values. We interpreted this to mean that differential expression of these genes was largely additive. Recombining additive gene expression patterns is likely to be the basis of a substantial number of inherited phenotypes. However, analyses of genes up-regulated or down-regulated between the F 1 hybrids and both parents identified a unique category of expression that was not additive. These genes were expressed at a lower or higher level in the F 1 hybrid than both parents and the number of genes falling into these categories is shown in Table 4. This category accounts for less than 0.1% of genes expressed in leaves. At Kinston, 25 genes were expressed at a higher level in the F 1 hybrid than in either parent and 6 genes were expressed a lower level ( Table 6). The Gene ID, the fold change in expression and information about the possible function of these genes, are shown in Table 6. Of note is how few of these genes were well annotated. Two genes, Glyma.03G016300 and Glyma.03G016400, are 68% identical and are associated with ethylene biosynthesis. Both genes were expressed at a higher level in the F 1 hybrid than in either parent. Three calcium-binding proteins were up-regulated in the F 1 hybrids while one calcium-binding gene was down regulated. Two of the up-regulated calcium-binding genes, Glyma.20g034200 and Glyma.07g229500, are over 75% identical to each other. Glyma.20g034200 and Glyma.07g229500 are annotated as a class of calcium binding proteins important in signal transduction called calmodulins. Sequence comparisons of genes with a "non-additive" pattern of expression in Kinston. Identity refers to genes that share amino sequence identity, a 0 means no sequences with > 20% identity were identified, the value in parenthesizes is the identity shared by similar nonadditively expressed sequences. The longest protein sequence for each gene model present in soybase was used for the alignment. Sequence ID is the gene model from soybase. Expression (EXP) is the average fold change between Kinston and Clayton. Function was selected from the soybase, and was the most likely informative option available in soybase. At Clayton, 22 genes were expressed at higher levels and 35 genes were expressed at lower levels in the F 1 hybrid relative to either parent, Table 4. Table 7 shows information about the non-additively expressed genes in Clayton. Many genes down-regulated in the F 1 hybrid were annotated as cell wall related and pectin methyl esterase inhibitors that may be associated with cell expansion. Four genes annotated as asparagine synthases are up-regulated in the hybrid. Alignment of these asparagine synthases sequences revealed that three of them, Glyma.15G071300, Glyma.12G150500 and Glyma.13G181000, are between 62 and 82% identical. However, the other asparagine synthase gene, Glyma.11G171400, is less than 20% identical to the other three. Transcripts of 3 myo-inositol oxygenases, encoding proteins ranging from 68% to 72% identical, increased in expression in the F 1 hybrids. Other related genes also appeared to be differentially expressed in the hybrid, but as a family their pattern was hard to discern. For example, three genes with similar annotations, Glyma.15G093900, Glyma.11g230800, Glyma.11g027700 were differentially regulated in the F 1 hybrids. Glyma.15G093900 may be a Gibberellin 2-beta-dioxygenase and was increased in expression in the F 1 hybrids. The other 2 genes decreased in expression in the F 1 hybrids and may be Leucoanthocyanidin dioxygenase. These three digoxygenases range from 24 to 34% identical. Alignment of proteins encoded by genes with nonadditive patterns of expression has identified poorly annotated genes that share substantial identity with other genes on the list. Glyma.15G211300 and Glyma.03G113200 are 96% identical. Glyma.08G081800 and Glyma.05G126800 are 78% identical. Glyma.10G248900 and Glyma.20G144800 share 86% identity. Glyma.10G130600 and Glyma.20G081400 share 93% identity and are annotated as LEA protein. Glyma.10G168500, Glyma.03G252600 and Glyma.19G250100 are 73-97% identical and are annotated as GDSL like lipases.

Allele specific bias
One advantage of using NMS4-44-329 as a parent is that alleles/genetic regions inherited from G. soja are likely to have more polymorphic SNPs than alleles inherited from a typical G. max genotype. It is possible that some SNPs occur de novo in the recurrent parent, N7103, but it is likely that most SNPs were inherited from the wild grandparent (PI 366122). In most cases, total gene expression is represented by about a 50% contribution of each allele, although in some cases allele specific expression can occur. We defined allele specific bias as one allele representing over 73% of total expression which is greater than 2 standard deviations in this experiment. Over 4390 high quality SNPs representing about 3,157 genes were identified that contrast between N7103 and NMS4-44-329 in the RNAseq data. The total read counts of each SNP were measured in the F 1 hybrids and the ratio of the counts from each SNP determined. Unlike differential gene expression in the F 1 hybrids, the ratio of alleles expressed in the F 1 hybrid for each high quality SNP was well correlated between Kinston and Clayton (r 2 = 0.49). The correlation of gene expression bias greater than or equal to two standard deviations between locations was extremely well correlated (r 2 = 0.97). No allele switched bias between locations. Only 241 SNPs, shown in S2 Table, had expression biases of greater than or equal to 2 standard deviations. These 241SNPs represented variants in 191 genes. Of the 29 genes represented by more than one SNP, only three (noted by an asterisk in S2 Table) were inconsistent in their allele biases. We note that the NMS4-44-329 alternative alleles were down regulated more than the N7103 alleles. One explanation for this bias was that the NMS4-44-329 plants used in the hybridization were derived from a single plant in the F 4 generation. Due to residual heterozygosity (expected 12.5%) in the F 4 plant selection, a heterogeneous seed-lot would have been produced through further self-pollination, so a small amount of genetic variation would have existed between NMS4-44-329 plants used for hybridization. As a result, some F 1 hybrids could be lacking an allele based on the NMS4-44-329 parent lacking the allele. To address this concern, alleles were only used that were not polymorphic in any parental line sequenced. The distribution of alleles is shown in Fig 1. The bins with lower ratios of the NMS4-44-329 alleles were shifted to slightly higher percentages. This might reflect a shift toward reduced expression of NMS4-44-329 alleles in a predominantly N7103 genetic background or a slight underestimation of NMS4-44-329 alleles due to some residual heterozygosity in the original parent as described above. In rice an increased frequency of C to T and G to A SNPs has been reported that is a consequences of the higher frequency of mutation of methylcytosine [24]. We investigated the frequency of this class of SNPs among alleles with a biased gene expression because methylcytosines are epigenetic marks capable of affecting transcription. However, there was no significant difference in the frequency of relevant SNPs between genes with biased expression and those without. The SEA analyses of the ontology of genes displaying expression bias indicated that genes associated with "adenyl ribonucleotide binding" were significantly enriched (3.4X10 -6 ) among genes biased in both directions. No genes were identified with non-additive expression and an allele bias.

Discussion
One goal of this research was to determine whether or not the presence of wild soybean in the pedigree of a parent would lead to heterosis in the progeny. Reported estimates of heterosis for seed yield, and other agronomic traits are generally small in soybean and many other self-pollinated crops (often less than 20%), as compared to cross pollinated species such as corn [9]. However, the F 1 -hybrid had numerically improved yields, seed protein content, and seed protein production per hectare relative to the mid-parent and each parent at each location. Seed weight was also greater for the F 1 hybrid than for the mid-parent value at each location. All of these measures of increased productivity for the F 1 hybrid were consistent with heterosis. Inbreeding depression, another hallmark of heterosis was noted for yield in the F 2 populations at each location, as compared to the F 1 . However, no inbreeding depression was noted for seed protein content suggesting that the F 1 heterosis may be derived from epistasis rather than dominance for that trait. To our knowledge this is the first report of heterosis for seed protein content of soybean. Comparison of seed yield, seed protein content, and seed protein production per hectare in the hybrids and parents suggested that alleles from the wild soybean are potential sources for improvement of agronomic traits. The yield of the F 1 was similar to that of modern elite cultivar N7003CN of similar maturity, indicating that the yield levels in this study were sufficiently high to suggest that the G. soja derived alleles may have important agronomic potential in plant breeding. The frequently reported negative correlation between seed protein content and Gene ontologies and sequence comparisons of genes with a "non-additive" pattern of expression in Clayton. Identity refers to genes that share amino sequence identity, a 0 means no sequences > 20% identical were identified, the value in parenthesizes is the identity shared by similar non-additively expressed sequences. The longest protein sequence for each gene model present in soybase was used for the alignment. Sequence ID is the gene model from soybase. Expression is the average fold change between Kinston and Clayton, negative values indicate higher expression in the hybrid. Function was selected from soybase, and was the most likely informative option available in soybase. https://doi.org/10.1371/journal.pone.0177225.t007 Soybean heterosis seed yield was not observed in this study, providing a rare opportunity to improve both yield and seed protein content simultaneously, using alleles from the wild soybean [25]. The second goal of this research was to compare genes differentially expressed in the leaves of the F 1 hybrid to that of the parents to gain insight into the physiological basis of the yield increase. Identifying genes associated with improved yield was challenging because this is a complex trait, there was a strong environmental effect and the increase in yield of the F 1 hybrids compared to the mid-parent was small. Comparison of the ontologies enriched in each location relative to the mid-parent expression (F 1 >both or F 1 <both in Table 4) failed to identify a GO term consistently enriched or depleted in the F 1 compared to the parents. The biggest improvement in yield compared to the hybrid was between NMS4-44-329 in Kinston (270 kg ha-1) and N7103 in Clayton (307 Kg ha-1). No clear overlap is observed in the ontologies enriched in the F 1 hybrid compared to NMS4-44-329 in Kinston and N7103 in Clayton. Genes with a non-additive pattern of expression could play an important role in yield improvement of the F 1 hybrids. In many cases families of genes that share substantial identity and presumably similar functions shared a non-additive pattern of expression. Analyses of these genes identified good candidates to impact a complicated trait like yield. For example, calmodulin plays a role in transmitting signal cascades and remodeling calmodulin expression could impact multiple traits [26,27]. One gene had a non-additive pattern of expression in the F 1 hybrids compared to NMS4-44-329 in Kinston and N7103 in Clayton. This gene is a methionine sulfoxide reductase annotated as being involved more broadly in oxidation reduction. This gene has been well studied in G. soja as important in tolerance to alkaline stress [28]. Perhaps it plays a role in improving yield as well. Genes with an allele specific expression bias in the F 1 hybrids were enriched for "ADP-binding proteins". We note that this category overlaps a category enriched in high yielding F 1 hybrids in Clayton. Therefore, the impact of "ADPbinding proteins" on yield may be worth further investigation.
The third goal of this research was to compare genes differentially expressed in the leaves of the F 1 hybrid relative to the parents to gain insight into the physiological basis of improved seed protein content. The improvement of the protein content of seed and seed production per hectare derived from the F 1 hybrids over the mid-parent value in both locations has provided a robust dataset to identify genes associated with improved protein content in soybean. Genes associated with the "amine metabolic process" that could potentially impact N metabolism in the leaf and may ultimately impact seed protein content were up-regulated in leaves in one location. Four asparagine synthetase genes were increased in expression in the F 1 compared to both parents in Clayton. Asparagine and asparagine synthase are thought to play an important role in N use efficiency and alterations in this metabolic pathway in the leaves could impact seed composition by affecting N availability [29][30][31]. In addition to four asparagine synthase genes, three of which were highly identical, 15 other genes were non-additively expressed with closely related genes based on sequence identity. Co-expression of orthologous sequences is consistent with the proposal of Burton and Brownie that duplicate gene interactions may be associated with heterosis in soybean. Evaluation of the progeny of the F 1 hybrids for elevated asparagine synthetase expression may provide a method to identify the best lines for elevated seed protein early in the breeding process. Some differentially expressed genes in the relevant categories discussed above are not represented in enriched ontologies. For example, one of the genes with an allele specific bias is annotated as a nitrate transporter that may play a role in improving N balance. The inheritance and expression of these genes can be evaluated in the progeny of the F 1 hybrids.
This research suggests that additive gene expression, non-additive gene expression and allele specific expression bias could be used to evaluate the likelihood that F 1 hybrids will produce agronomically valuable progeny. Non-additive gene expression reported for maize heterotic hybrids relative to their parents appeared to be more common than we report for soybean [32,33]. Historically, genetic markers linked to difficult to select traits and markers associated with QTLs have played an important role in plant breeding. Sequencing based approaches are no longer uncommon in plant breeding, and with less expensive sequencing, it is worth considering using sequencing to evaluate crosses early in the breeding process. F 1 hybrids are generated in the breeding process and are available for nondestructive analyses. Creating these initial crosses is not trivial, but it is not the most difficult or expensive part of the breeding process. If multiple promising parents are selected it may be possible to pick the most promising crosses for further analyses by evaluating the parameters listed above. Future evaluation of progeny between NMS4-44-329 and N7103 will provide a test case to determine if increased expression of asparagine synthase is associated with improve seed protein content in multiple environments and if this pattern of expression persists during seed fill when seed protein content is likely to be impacted. One limitation of this approach is the location specific expression pattern of many of the relevant genes. If the expression of genes varies across years or locations, they may be difficult to associate with the desired trait. Therefore it is worth focusing on changes in gene expression that are not affected by the environment. As the correlation between gene expression and phenotypes are investigated in more environments, it may be possible to gain insight into physiological pathways associated with traits influenced by environmental effects by evaluating changes in gene expression that are not similar across environments. In this study, the allele specific bias was also very stable across locations. In cases where this bias is associated with a targeted trait it may be possible to use this biased expression as leverage for progeny selection in a breeding program.
Supporting information S1