Insights into the Loblolly Pine Genome: Characterization of BAC and Fosmid Sequences

Despite their prevalence and importance, the genome sequences of loblolly pine, Norway spruce, and white spruce, three ecologically and economically important conifer species, are just becoming available to the research community. Following the completion of these large assemblies, annotation efforts will be undertaken to characterize the reference sequences. Accurate annotation of these ancient genomes would be aided by a comprehensive repeat library; however, few studies have generated enough sequence to fully evaluate and catalog their non-genic content. In this paper, two sets of loblolly pine genomic sequence, 103 previously assembled BACs and 90,954 newly sequenced and assembled fosmid scaffolds, were analyzed. Together, this sequence represents 280 Mbp (roughly 1% of the loblolly pine genome) and one of the most comprehensive studies of repetitive elements and genes in a gymnosperm species. A combination of homology and de novo methodologies were applied to identify both conserved and novel repeats. Similarity analysis estimated a repetitive content of 27% that included both full and partial elements. When combined with the de novo investigation, the estimate increased to almost 86%. Over 60% of the repetitive sequence consists of full or partial LTR (long terminal repeat) retrotransposons. Through de novo approaches, 6,270 novel, full-length transposable element families and 9,415 sub-families were identified. Among those 6,270 families, 82% were annotated as single-copy. Several of the novel, high-copy families are described here, with the largest, PtPiedmont, comprising 133 full-length copies. In addition to repeats, analysis of the coding region reported 23 full-length eukaryotic orthologous proteins (KOGS) and another 29 novel or orthologous genes. These discoveries, along with other genomic resources, will be used to annotate conifer genomes and address long-standing questions about gymnosperm evolution.


Introduction
Gymnosperms have undergone 300 million years of evolution since their divergence from the ancestors of modern angiosperms, and they possess enormously complex genomes in comparison [1][2][3]. Increased ploidy level and individual repeats in high copy number are common in angiosperms but are rarely seen in gymnosperms [4,5]. While rapid progress has been made in characterizing the genomes of angiosperms, the same is not true for gymnosperms, in part due to an order of magnitude increase in their size and complexity. Conifers are by far the most important representatives of the gymnosperms, prevalent in a variety of ecosystems and representing 82% of terrestrial biomass [6]. Comparative studies have demonstrated that they are characterized by reduced coding region evolution, retrotransposon proliferation, highly diverged repetitive sequences, accumulation of noncoding regions, and extensive gene duplication [3,4,7,8]. Until recently, sequencing conifers has been nearly impossible due to the assembly complexity and sheer magnitude of sequence (Taxodium distichum: 9.7 Gbp, Picea abies: 19.6 Gbp, Pinus banksiana: 22.3 Gbp) [9]. Recent advances in the cost and utility of second generation, high-throughput sequencing technologies have made it possible for ten conifer reference genomes to be assembled (http://www. pinegenome.org/pinerefseq). Together, these will aid breeding efforts and illuminate mechanisms behind adaptive diversity for managing forest populations.
The largest genera in the order Coniferales, Pinus, comprises over 100 species and accounts for over 40% of global forest plantations [10]. Loblolly pine, a relatively fast-growing and economically important representative of the conifers, is a forest tree species native to the Southeastern United States. Traditional commercial markets for loblolly pine have included lumber, pulp, and paper, but more recently, it has become a major bioenergy feedstock in lignocellulosic ethanol production [11]. Recent estimates [9] place the size of the loblolly pine genome between 21 and 24 Gbp. In the context of completed genome projects, this is approximately seven to eight times larger than the human genome, and about four times the size of the largest angiosperm for which we have a reference genome, Hordeum vulgare (barley) [12]. Large-scale EST discovery projects have generated significant transcriptomic resources in the absence of a reference genome, including 328,662 ESTs, 17,379 UniGenes, and over 200,000 contigs from Roche/454 sequencing [13,14]. The current genomic resources for loblolly pine include 103 Bacterial artificial chromosomes (BACs) totaling 10 Mb. BACs are the most popular genomic DNA construct and have been used extensively for genome characterization and in hierarchical sequencing projects like the human genome [15,16]. BAC libraries offer a source of large genomic inserts (up to 200 Kbp) that facilitate studies of genomic content without a reference sequence [17]. They have been used in genomic studies of other gymnosperms, including Pinus pinaster [18], Picea glauca [19] and Taxodium distichum [20]. In addition to the BAC resource, we have generated just over 90,000 fosmid sequences totaling 265 Mb. Fosmids have been used to characterize several plant genomes, including sugar beet [21], Maire yew [22], cucumber [23], and strawberry [24]. Fosmids are less expensive to generate than BACs and have less cloning bias; their insert size can also be more narrowly controlled [25]. Like BACs, they are an ideal resource for preliminary genome characterization and for improving whole-genome shotgun assemblies.
While several categories of transposable elements (TEs) exist in plant genomes, the Class I elements known as long terminal repeat (LTR) retrotransposons are most frequently associated with genome size variation [26]. Studies across several plant species have noted that polyploidization and rapid proliferation of TEs can each lead to genome expansion [27]. Polyploid angiosperms such as Oryza sativa, Vitis vinifera, and Hordeum vulgare have repetitive content estimates of 35%, 41.4%, and 84% respectively [12,[28][29][30]. The diploid genome of a wild species of rice, Oryza australiensis, with a repetitive content estimate of 76%, has doubled in size over the last three million years through the expansion of a select few retrotransposon families [31]. Phylogenetic inference of gymnosperm lineages provides strong evidence for a recent, large increase in genome size in pines without associated acceleration of phenotypic diversity, suggesting periods of retrotransposon expansion [32]. The repetitive content of conifers Picea glauca, Pinus taeda, and Taxodium distichum have been estimated, through analysis of BAC sequences, to be anywhere from 20% to 90% [7,19,20]. Characterization of Taxus mairei fosmid sequences yielded a repetitive content estimate of 20.8%, excluding tandem repeats [33]. As in many plant genomes, an abundance of TEs has been noted in conifers, specifically those of the Copia and Gypsy families [34]. A handful of these LTRs have been characterized, and their distributions suggest that conifer genomes have very diverse and abundant repetitive sequences [3,7,35]. Most of these elements, however, remain uncharacterized.
In this study, we present a comprehensive analysis of repeats in both the existing BACs (11 sequences provided by [7] and 92 sequences available in Genbank [35] and an extensive novel fosmid resource. The fosmid set is the largest genomic resource generated to date for a gymnosperm, and consists of 90,954 scaffolds assembled from fosmid pools. Interspersed and tandem repeats were detected using both homology-based searches, which identify repetitive elements based on similarity to existing elements in a repeat library, and de novo searches, which scan for repeats without prior knowledge of repeat element sequence. Of particular interest are previously undiscovered repetitive element families, which were analyzed and annotated. In addition, a combination of orthologous proteins and ab initio gene prediction approaches were employed to characterize the gene space. These efforts generated a more comprehensive view of repetitive content for loblolly pine, as well as a bioinformatic framework for full genome annotation.

Fosmid DNA Prep, Library Construction, Sequencing
Genomic DNA from Pinus taeda genotype 20-1010 diploid needle tissue was isolated and extracted using a modified version of the method [36]. The extracted genomic DNA was then sheared to an average size of approximately 40 Kbp and size-purified by pulsed-field electrophoresis. This was followed by ligation to excess vector, pFosTH, (dephosphorylated ends) and vector packaging (extracts from mcrA, B, C strains) to create a particle library. The particle library was then titered, portioned, and converted into E. coli colonies. This procedure resulted in 11 pools of approximately 580+/210% E. coli colonies, each colony containing a single fosmid.
The 11 harvested bacterial colony pools were then amplified in vivo. Fosmid DNA was subsequently purified and digested with the homing endonuclease PI-SceI, which has a 35 bp recognition site. With the isolated DNAs quantified by PicoGreen, the purified insert DNAs were portioned out to create an equimolar super-pool of all 11 component fosmid pools, as well as a set of three nested equimolar super-pools of 8, 4, and 2 small fosmid pools. The two smallest pools were not considered further in this analysis.
The fosmid DNA was processed into paired-end Illumina libraries for subsequent second generation sequencing. The largest super pool (of 11 pools) was converted to a long insert jumping library, and the smaller nested super pool (of 8 pools) was converted to a short insert fragment library. Standard Illumina reagents and protocols were used for library construction. Large fragments were generated with a HydroShear and small (subkilobase) fragments were generated with a Diagenode Bioruptor.
Paired-end libraries were subsequently sequenced on an Illumina GA2x [SCS version 2.9.35, RTA version 1.9.35]. Approximately 45 million 125 bp paired-end reads were obtained from the fragment library while 27 million 125 bp paired-end reads were obtained from the jumping library. Subsequent alignment of the reads to the E. coli and vector genomes was used to determine median insert size. Ungapped single seed alignment was done using Illumina's CASAVA pipeline version 1.7.0. GERALD reported a median insert size of 3,300 bp for the jumping library with a coefficient of variation of approximately 5%. The shorter fragment library had a median insert size of 258 bp with a coefficient of variation of approximately 4%. The reported rates of the non-canonical paired-end alignment orientations were also consistent with high quality libraries.

Fosmid Assembly
Prior to assembly, the reads were processed with multiple methods designed to improve the overall assembly. The reads were quatity (Q20) and adapter trimmed using the ''fastq-mcf'' program, part of the ''eautils'' [37]. Following the trimming, the reads were aligned using BWA to a contaminant database composed of pFosTH, Ecoli, chloroplast and mitochondrion sequences [38]. The mate pairs that had one or both reads aligned to this database were discarded. The fosmid pool sequences were assembled using SOAPdenovo v1.05 [39] with a k-mer size of 63. Four iterations of GapCloser (v1.12) with a SOAPdenovo rescaffolding step in the middle were used for gap closing and assembly improvement. All scaffolds of lengths .200 bp were submitted to Genbank (WGS: APFE01000000).

BAC Sequencing and Assembly
Two sets of loblolly pine BAC sequences derived from the Mississippi Genome Exploration Laboratory (MGEL) and Clemson University Genomics Institute (CUGI) libraries [35,36] were analyzed here. Ten BAC clones (Genbank Accession Nos. GU477256-GU477266) were individually Sanger sequenced and assembled with Arachne [7]. The average read depth of each BAC assembly ranged from 6x to 16x. Nine of the BAC assemblies were resolved into a single scaffold and the last resolved into two unoriented contigs. An additional set of 92 BAC clones (Genbank Accession Nos. AC241263-AC241362) from the same libraries were sequenced with 454/Roche Pyrosequencing with an average read depth of 7x.
To eliminate potential redundancy between the independently derived fosmid and BAC datasets, three fosmid scaffolds identified as exact matches (99% identity and full coverage) against the longer BAC sequences were removed. In addition, 85,675 assembled fosmid contigs with lengths ,201 bp were removed from the final set.

Repeat Identification
A combined methodology of homology-based and de novo approaches was used to identify the tandem and interspersed repetitive content in the BAC and fosmid sequences ( Figure 1).

Tandem Repeat Identification
Tandem Repeat Finder (TRF) v4.04 [40] was run with default parameters to identify tandem repeats in the BAC and fosmid sequence sets. Repeat units, or periods, of sizes 2 to 500 bp were considered. Mononucleotides (period size of 1) were not included due to the high likelihood of repeat collapse in the assembly process. To accurately assess the coverage of tandem repeats, we filtered overlaps by discriminating against hits with low TRFbased alignment scores. For comparison, these methods were applied to the reference genomes of Arabidopsis thaliana v1.6.7, Vitis vinifera v1.4.5, Cucumis sativus v1.2.2, and Populus trichocarpa v2.1.0, available through Phytozome [41]. In addition, BAC sequences (Table S1) from Picea glauca (white spruce) and Taxus mairei (Maire yew) were analyzed (Genbank Accession Nos: 262263013, 221149199; 327410373-327412295). NCBI-BLASTN v2.2.26 was run with default parameters to verify whether particular satellites were specific to Pinus taeda.

Interspersed Repeat Identification (Homology)
Censor v4.2.27 [42] was used to annotate repeats through homology. Censor utilizes NCBI-BLASTN (by default) to align queries against a database of curated repetitive elements. For this analysis, a custom plant repeat database (CPRD) was constructed using a subset of Repbase v17.07. This database is composed of plant repeats (plnrep, grasrep, mcotrep, dcotrep, oryrep and athrep) as well as five TEs available in the literature and previously classified in forest trees: TPE1 from Pinus elliottii [43], PpRT1 from Pinus pinaster [44], Gymny from Pinus taeda [3], PtIFG7 from Pinus taeda [7], and Corky from Quercus suber [45]. Censor makes available three sensitivity modes (normal, rough and sensitive) along with varying BLAST parameters to balance sensitivity and speed. The normal mode (blastall -p blastn -b 10000000 -Y 1000000 -e 0.00002 -a 64 -r 3 -q -4 -y 80 -X 130 -Z 150 -W 10 -J T -m 8 -i QU -d DB -F 'm D') was selected, as it yielded the same or improved alignments when compared against the sensitive mode for our datasets. Runs were performed in both redundant (a region may be annotated more than once) and non-redundant (the annotation with the highest alignment score is chosen) modes to provide a basis for species comparisons. Full-length hits and repeats that belonged to previously characterized families were identified using a modified version of the 80-80-80 standard [46]. This dictates that alignments should be .80 bp in length with at least 80% identity, and should cover 80% of the reference sequence. In this instance, percent identity was substituted with Censor's sim metric, and is defined as: (match_count/( alignmen-t_length -query_gap_length -subject_gap_length + gap_count).

De Novo Interspersed Repeat Identification
De novo repeat identification focuses on the discovery of repetitive sequence through self-alignment and/or pattern matches for structural features. To implement de novo repeat identification, the BACs (103 sequences) and fosmids (90,954 sequences) were run through the REPET TEdenovo pipeline (v2.0) with default settings [47]. This software wraps several open-source tools to identify, characterize, and derive a consensus sequence for each element. The first stage is self-alignment (all vs all) with NCBI BLAST 2.2.25 to detect high-scoring segment pairs (HSPs) [48]. This is followed by three separate rounds of clustering with three independent algorithms for repeat identification, including Grouper [48], Recon [49] and PILER [50]. Up to three clusters for each element are aligned with the multiple sequence aligner, Map [51], to derive a consensus. A structural search is run simultaneously with LTRharvest to identify highly diverged LTR retrotransposons [52]. The results from this secondary search are clustered with BLASTCLUST to group candidate sequences [53]. The results from both pipelines are combined, and subsequently filtered to reduce the redundancy generated from independently derived consensus sequences.
TEclassifier, a secondary REPET application, provides a firstpass annotation of the consensus sequences. This application delivers annotations by running BLASTER against four databases [47]: Repbase 17.07 (in both nucleotide and amino acid formats), a Pfam protein domain HMM database specifically formatted for REPET, a database of known genes from the host organism (for which we used 366 full-length Pinus taeda cDNAs and 945 Pinus radiata full-length cDNAs [54]), and a database of ribosomal DNA sequences from the host organism (for which we used 52 ribosomal Pinus sequences). Poly-A tails, tandem repeats, open reading frames, terminal repeats, and tRNAs are also are detected at this stage. After the database searches and filters, order-level annotations are provided according to the repeat classification system [46].
To contend with false positives and partial sequences, all sequences identified by REPET were treated as repetitive element seeds. In this context, seeds are de novo consensus sequences used as a library. The seeds from both sequence sets were combined and submitted to a non-redundant Censor search in normal sensitivity mode against the BAC and fosmid references. This enabled the full-length repeats to be mapped and aligned back to the reference. Alignment statistics and positions were recorded in a local MySQL database to facilitate downstream queries. All raw full-length hits were extracted from this database, observing the modified 80-80-80 standard described above.

De Novo Interspersed Repeat Annotation
To effectively combine the homology and de novo search results, sequences corresponding to the full-length hits were extracted and compared against the homology database, CPRD. These alignments were manually classified and annotated into (Class: Order: Superfamily) based on the curated descriptions available from the database following a variation of the three-step classification procedure [46]. Combined with the 80-80-80 standard, BLASTN was used to further classify elements into families. USEARCH 6.0.203 [55] followed to compare the translated de novo set with the ORFs of CPRD to classify repeats at the superfamily level. ORFs were identified using the findorfs utility in USEARCH with default recovery parameters. Finally, since REPET analyzes the structural motifs of each element through TEclassifier, the assignments provided by REPET were used to classify the repeats by Order. The high quality, novel repeats discovered were stored as a multiple sequence FASTA file, and referred to as the Pine Interspersed Element Resource (PIER).
To characterize sequences without a qualifying match against CPRD, clustering techniques were applied to form groups of unique elements. These were subsequently sorted into repeat element families through all versus all alignment. These unannotated sequences were clustered using USEARCH's cluster_fast utility set to 80% similarity. A consensus sequence was generated for each cluster through multiple sequence alignment in MUSCLE [56] and PILER. The coverage of a family was determined by summing the number of base pairs of all full-length sequences belonging to all clusters within the family. The cluster with the most full-length sequences was selected as a representative for further analysis. Multiple sequence alignments of clusters representing high coverage families were visualized in Jalview [57], and the consensus sequence of each cluster was extracted. The consensus of each cluster was analyzed using BLASTN to target the LTR regions, and annotated using LTRdigest [58] with Pfam profiles [59]. In the challenging case of one nested, high-copy element, PtOzark, PILER proved superior at generating an informative consensus. A database of Populus trichocarpa tRNAs [60] was used to find matches to known repetitive elementassociated proteins. Consensus repeats of ten novel, high-copy clusters were run in CENSOR with the normal sensitivity setting against four plant genomes available through Phytozome [41]: Arabidopsis thaliana v1.6.7, Vitis vinifera v1.4.5, Cucumis sativus v1.2.2, and Populus trichocarpa v2.1.0. Two conifer BAC resources were also compared: Taxus mairei [33] and Picea glauca [61] (Genbank Accession Nos: 262263013, 221149199; 327410373-327412295). All references were evaluated to determine whether similar elements exist in other plant species.

Gene Identification
A combination of similarity and ab initio gene prediction approaches were implemented to characterize the coding region of the BAC and fosmid sequence sets. In order to train ab initio gene prediction algorithms, a set of high confidence transcripts were extracted from the Newbler version 2.3 (454 Life Sciences, Branford, CT) assemblies of four Pinaceae species (Pinus taeda, Pinus lambertiana, Pinus palustris, and Picea abies) [14]. Full-length loblolly pine mRNA sequences available in Genbank were also included. Direct evidence for exon prediction was collected from ESTs and cDNAs of the following Pinus species closely related to loblolly pine, according to chloroplast phylogenetic analysis [62]: P. ponderosa, P. attenuata, P. contorta, P. banksiana, P. thunbergii, P. resinosa, P. pinaster, P. elliottii, P. sylvestris, P. radiata, P. patula, P. torreyana, and P. palustris. Alternative EST and cDNA evidence was provided by species further diverged, including: Pinus lambertiana, Picea abies, Picea glauca, Picea sitchensis, Pseudotsuga menziesii and Picea mariana. These transcripts were combined, and redundant sequences removed using USEARCH set to 95% identity. The resulting consensus sequences were used to train gene identifica-tion parameters for Augustus [63] inside of MAKER2 [64]. Identification of genes from the eukaryotic orthologous proteins (KOGS) was performed via CEGMA [65]. Repeat masking was performed in MAKER2 via Repeatmasker [66] and used the PIER library.

Sequence sets
The cumulative base pair lengths of the BAC and fosmid libraries are 11,858,447 and 265,480,119, respectively. Combined, these resources represent approximately 1.26% of the estimated 22Gbp Pinus genome ( Table 1). The molecular technologies used in library construction and sequencing gave rise to different read lengths, and, subsequently, assemblies, between the two sets. There was a three fold difference in their respective N50 lengths (127,167 bp for the BACs and 16,205 bp for the fosmids). Average GC content across the BACs was 37.98%, and 38.09% in the fosmids. The GC content was higher than in other conifers (Taxus mairei with 37.05% and Picea glauca with 37.11%) and dicots (Juglans regia L. with 37.7% [67], Cucumis sativus with 32.25%, Vitis vinifera with 34.55%, and Populus trichocarpa with 33.75%), but lower than in sequenced monocots, such as Oryza sativa (43%) [68].

Tandem repeats
A total of 63,622 tandem repeats were identified in the BACs and fosmids (Table S3). Combined, they covered 7.2 Mbp of the sequence sets (2.60%) ( Table 2). Comparative analysis of other plant genomes revealed species with higher tandem content estimates including: Vitis vinifera (4.44%), Populus trichocarpa (4.24%), and the Picea glauca BACs (3.54%). Others were shown to have a lower tandem estimate, including: Cucumis sativus (1.56%) and the fosmid sequences of Taxus mairei (1.54%). The average length of an annotated tandem repeat array in loblolly pine was 116.72 bp in the BACs and 113.34 bp in the fosmids. Copies of repeat units ranged from 1.8 to 1,780 copies, with an average length of 6 bp and 4.4 bp in the BACs and fosmids, respectively. Minisatellites made up the largest physical portion of the genome (51,997 loci, 1.56%), followed by satellites (6,482 loci, 0.96%), and microsatellites (5,143 loci, 0.09%) ( Table 2, Table 3). Copy number, however, was skewed towards microsatellites, with an average of 25 per array, 10 times that of either the minisatellites or satellites. Within the microsatellites, dinucleotides dominated, making up 74.5% of all the microsatellites. The (AT/TA)n motif alone had 1,869 arrays (38,976 copies, 47% of microsatellites), and constituted the largest physical area of the genome for microsatellite motifs, at 0.03%. AT-rich trinucleotides (ATT/AAT/ATA/ TTA/TAT/TAA)n also made up a significant portion of the total microsatellites, with 347 arrays and 5,328 copies (8.72%). Interestingly, the single array with the highest copy number was a 25 bp period (TGCTTTGCTGCTTAGTCTCTCATAG) with 654 copies (,16 Kbp) in the fosmid contig Pita_fosmi-d_APFE01000000_90533. This novel array, Pita_MSAT16, had no significant similarities when compared against the nucleotide (nt) database of NCBI and Repbase. After filtering for redundancy, tandem repeats annotated 3.31% of the BACs (3.04% of the 11 Sanger sequenced BACs) and 2.59% of the fosmids. However, after removing those that overlapped with interspersed content, Figure 1. Repeat detection methodology. BAC sequences (103) and fosmid sequences (90,954) were analyzed for tandem repeats (TRF), interspersed repeats by homology (CENSOR against CPRD), and interspersed repeats via de novo methods (REPET). Details of the annotation process are also shown. doi:10.1371/journal.pone.0072439.g001 these estimates were reduced to 0.93%, 0.53% and 0.54% of the BACs, fosmids and combined sets, respectively (Table 4).
Among species compared, microsatellite densities, from highest to lowest, were as follows: Populus trichocarpa, Vitis vinifera, Cucumis sativus, Picea glauca, Taxus mairei, Arabidopsis thaliana, and Pinus taeda ( Figure 2, Table S2). In comparisons with other species, the ratio of minisatellites to satellites was large, as expected ( Potential centromeric, telomeric, and heterochromatic repeats were identified using Censor [42], though most were spurious, with an average coverage of 28% and a maximum of 75% identity. Of these spurious matches, 247 fragments were from Zea mays, four   from Zingeria biebersteiniana, one from Sorghum bicolor, and one from Secale cereale. The previously characterized plant telomeric repeat (TTTAGGG)n [69] was detected a total of 237 times across 23 different arrays.

Interspersed Repeat Identification (Homology)
Homology searches against Repbase, using Censor, yielded a total of 14,470 and 175,004 hits to repeat fragments in the BACs and fosmids, respectively. A large portion of elements annotated by unfiltered Censor runs were previously characterized in Populus trichocarpa (15.7% of alignments), followed by Zea mays (13.0%), Sorghum bicolor (12.6%) and Malus domestica (9.9%) ( Figure 3A). After filtering for full-length copies, 94 were identified in the BACs and 989 in the fosmids. Of these, 69.25% were previously characterized in conifers, the majority annotated as the Copia element, TPE1, originally characterized in Pinus elliottii (30.53%) ( Figure 3B). These filtered alignments represent 2.94% of the BAC sequence and 1.34% of the fosmid sequence (Table 4). When accounting for both partial and full-length hits, 57 unique repeat families spanning 14 repeat orders aligned to the fosmid set. The BACs aligned to 28 distinct families spanning 11 orders. All of the families identified in the BACs were also present in the fosmids.  Table S5).
The estimate of partial and full-length repetitive content by homology of Pinus taeda is 27.50% (Table 5). The primary contributions include 6.27% Copia elements, 11.97% Gypsy elements, 3.95% DNA transposons and 0.49% LINEs (Table S5). This estimate is slightly lower than Picea glauca (33.24%), Populus trichocarpa (33.91%), and Vitis vinifera (42.03%) ( Figure 4A). Among the full-length hits, Gypsy represented 2.54% and 1.32% of the sequence in the BACs and the fosmids, respectively. Copia represented 0.86% and 0.43% ( Figure 4B). Alignments with high overall similarity and coverage (88.75% and 93.87%, respectively) to ribosomal RNA were detected in the BACs and represented 0.22% of the sequence set. Two 414 bp PILN1 LINE elements previously characterized in Pinus thunbergii were identified in the fosmids.
The fosmid scaffolds produced significant alignments with IFG7 from Pinus radiata. This LTR aligned 23 times with an average similarity of 88.8% and coverage of 91.5% to the internal portion, and nine times with an average similarity of 91.4% and coverage of 99.1% to its LTRs. IFG7's complete presence was estimated at 0.58% in the BACs and 0.22% in the fosmids (Table 4). Copia4-PTR_I from Populus trichocarpa aligned four times to the internal portion (average similarity of 90% and coverage of 89.7%) with no full-length hits corresponding to the LTRs ( Table 4). The longest alignments in both the BACs and the fosmids were to full-length, presumably autonomous, RLG_Gymny-1 elements [3]. In the BACs, three of these elements were discovered, with an average similarity of 80% and 100% coverage against the consensus. In the fosmids, there were 11 alignments, with an average similarity of 82% and 100% coverage. The Copia element, TPE1, was the highest in copy number in both sets, with 26 alignments in the BACs (average similarity 94.4%, average coverage 98.6%) and 287 in the fosmids (average similarity 94.4%, average coverage 98.6%). In addition, TPE1 was the highest in overall coverage in the fosmids, with 0.4% coverage (Table 4). PGGYPSYX1 from Picea glauca was also discovered ( Figure 3B, Table 4). Homology searches for full-length elements in the selected angiosperms (Arabidopsis thaliana, Cucumis sativus, Populus trichocarpa, Vitis vinifera) yielded less informative results.

Interspersed Repeat Identification (de novo)
The consensus sequences generated from the REPET pipeline were used as seeds to validate the repeats against the original BAC and fosmid sequences. Initial self-alignment in REPET resulted in 3,433 unfiltered hits in the BAC sequences and 1,654,975 unfiltered hits in the fosmid sequences. Clustering with Grouper, Recon, and Piler resulted in 166, 139, and 27 clusters, respectively, in the BACs and 11,405, 3,814, and 10,33 clusters, respectively, in the fosmids. Among the BACs, 1,256 high-scoring segment pairs (HSPs), representing 325 seeds, and covering 21.64% of the BAC sequence, were annotated as LTRs. In the fosmid set, 60,467 HSPs, representing 5,061 seeds covering 13.95% of the fosmid sequence, were annotated as LTRs. Combined, the LTR content spans 61,723 HSPs constructed from 5,386 seeds, and covers 14.28% of the sequence set. 1,518 HSPs covering 28.74% of the BAC sequence were built from alignments of 488 Class I seeds against BAC sequences. 70,860 HSPs covering 19.54% of fosmid sequence were built alignments of 8,097 Class I seeds against fosmid sequences. The total Class I content for both sets is represented by 72,378 HSPs built from 8,585 seeds, and covers 19.94% of the sequence set. Seeds annotated as Class I retrotransposons included LTR content. Comparatively few seeds were annotated as DNA transposons: only 0.66% of the BAC sequence corresponding to 13 seeds and 0.23% of the fosmid sequence corresponding to 155 seeds, for a combined total of 0.25% (168 seeds). Uncategorized sequence accounted for 81 seeds and 842 HSPs totaling 3.26% of BAC sequence and 3,034 seeds and 75,060 HSPs totaling 4.68% of fosmid sequence. In all cases, BAC-derived HSPs were longer than fosmid-derived HSPs, and the ratio of HSPs to seeds for each category was larger in the BACs than in the fosmids. In the BACs, 2,444 HSPs representing 591 seeds covered a total of 32.94%. 5,061 HSPs representing 11,631 seeds covered 24.93% of the fosmid sequence. With the BAC and fosmid sequence sets combined, 155,686 HSPs representing 12,222 repeat seeds covered 25.27% of the sequence (Table S6).
The final set of non-redundant repeat seeds (12,222), with 11,631 from the fosmid set and 591 from the BAC set, returned alignments against 15,747 full-length sequences across both datasets (Table S6). These sequences had an average length of 4,414 bp and represented 25.98% of the sequence. 489 of these could be classified as one of the six characterized Gypsy or Copia families IFG7, Gymny, Corky, TPE1, Copia4-PTR_I, or TY1_PE, and represent 1.04% of the sequence set. Of the remaining 15,258 sequences, none aligned with confidence to families in CPRD. 11,119 of these sequences, however, could be classified manually at various resolutions. These repeats had an average length of 5,577 bp and represented 22.36% of the sequence set. At the Class level, 10,431 sequences, with an average length 5,803 bp, were classified as retrotransposons, while 688, with average length 2,148 bp, were classified as DNA transposons, covering 21.82% and 0.53% of the sequence set, respectively (Table S6). At the Order level, LTRs composed the bulk of the repetitive content, with 6,666 sequences representing 15.3% of the sequence set. DIRS, Penelope, and LINE elements represented a small portion of Class I elements, with coverage of the combined sequence. Within Superfamily, 617 sequences with an average length of 5,282 bp were annotated as Gypsy, and 317 sequences with an average length 5,078 bp were annotated as Copia. They represented 1.18% and 0.58% of the sequence set, respectively. Unclassified sequences composed 3.62% of the sequence sets (13.95% of the repeats), with an average length of 2,172 bp. These sequences did not align significantly against known sequences, and TEclassifier was unable to assign annotations to them.
Clustering of all unannotated sequences yielded 9,415 clusters (subfamilies), of which 7,015 were singletons, 1,357 contained two sequences, 471 contained three sequences, 195 contained four sequences, and 125 contained five sequences. The top 1% of clusters each contained at least nine full-length sequences. All versus all alignments resulted in 6,270 families. 5,155 elements were considered single-copy families, and the remaining 3,125 clusters were grouped into 1,115 families. In total, 10,057 elements were grouped into families, while 5,155 elements remained singlecopy. As a result of the all versus all alignment, 559 full-length elements were grouped into the ten highest-coverage novel families, representing about 2% of the sequence (Table 6). Elements grouped into the top 100 highest coverage families (including known elements) account for about 19 Mbp (Table S7), or 7% of the sequence set, while the top 400 highest families account for over 11% of the sequence set ( Figure 5). Sequences annotated as members of known repeat families account for most of the largest families when compared side by side with the de novo families. 159 elements annotated as TPE1 comprised 0.39% of the sequence set, 162 elements aligned to IFG7 represented 0.34% of the sequence set, 78 elements aligned to Corky accounted for 0.17% of the sequence set, and 24 elements aligned to Gymny comprised 0.11% of the sequence set ( Table 6).
The top ten de novo LTRs and top four previously characterized families (as annotated through homology searches) account for over 7 Mbp of sequence, or 2.56% ( Table 6). None of the novel de novo families annotated in this study have significant alignments, as defined by the 80-80-80 standard, against CPRD, or when compared against the other six plant genomic datasets. These ten represent six Gypsy elements, three Copia elements, and one unknown LTR. The largest family, PtPiedmont, is an LTR retroelement that contains 133 sequences from six different clusters across almost 1 Mbp of sequence (0.35%). The average sequence length of each element in the representative cluster of PtPiedmont is 7,340 bp. This element is characterized by LTRs that are about 1,000 bp long, and has a primer binding site (PBS) directly adjacent to the 39 LTR. The absence of alignments in the internal region limits the possibility of assigning this element to a superfamily. The six novel Gypsies include: PtOuachita, PtBastrop, PtOzark, PtAppalachian, PtAngelina, and PtTalladega. The second largest de novo family, PtOuachita, contains 47 sequences from 2 clusters across 577 Kbp, or 0.21%, of the sequence set. The average sequence length of PtOuachita's representative cluster is 13,058 bp. This element is characterized by LTRs that are about 1,000 bp long and alignments to RNase_H, RVT_3, rve, RVT_1, Asp_protease_2, and Retrotrans_gag protein families ( Figure 6A). PtOuachita aligns spuriously to LTR retroelements found in Brachypodium distachyon, Sorghum bicolor, Physcomitrella patens, Vicia pannonica, and Arabidopsis thaliana. Translated searches yield a 1,750 bp alignment to Gypsy-2_SMo-I at 40.3% similarity. PtBastrop, with 38 sequences covering 379 Kbp (0.14%) of the sequence set, is 15,520 bp long, and is characterized by a 15 bp primer binding site, LTRs that are about 1,100 bp long, and alignments to Retrotrans_gag, Asp_protease_2, RVT_1, rve, RNase_H, and RVT_3 protein families. PtBastrop aligns trivially to LTR retroelements found in Populus trichocarpa, Vicia pannonica, and Medicago truncatula. Translat-   Figure S1). PtOzark aligns trivially to many LTR elements in over ten species in CPRD. Three novel elements were characterized as Copia LTRs. The third largest novel family, PtCumberland, with 67 sequences covering 431 Kbp (0.16%) of the sequence set, has a length of 9,092 bp. This element is characterized by LTRs that are about 1,500 bp long, an 11 bp polypyrimidine tract (PPT) adjacent to the 59 LTR, and alignments to RVT_2, rve, gag_pre-integrs, zf-CCHC, and UBN2 protein domains. PtCumberland aligns to GmCOPIA10 for 3 Kbp at 67.4%. PtPineywoods, with 68 fulllength copies covering 323 Kbp (0.12%) of the sequence set, is 5,373 bp long, and is characterized by LTRs that are 510 bp long, a 39 PPT, and alignments to UBN2, zf-CCHC, gag_pre-integrs, rve, and RVT_2. Translated searches yield a 1,600 bp alignment to Copia-4_PD-I at 59.2% similarity ( Figure 6C). PtConagree, with 50 full-length copies covering 286 Kbp (0.10%) of the sequence set, is 15,552 bp long, and is characterized by 1,060 bp LTRs. It is annotated with a 39 PBS, and alignments to RVT_3, RNase_H, rve, RVT_1, Asp_protease_2, and Retrootrans_gag protein families. PtConagree aligns across a 4 Kbp to Copia-31-I_VV at 70.3% similarity.

Gene Identification
Of the 458 Conserved Eukaryotic Genes considered in CEMGA, only 23 full-length (70% similarity) genes were present in the BAC and fosmid sequence sets. Transcription factors, heat shock, and ribosomal proteins were among the categories represented. The majority of these identifications were found in the fosmid sequences; just one of the 23 was identified in the BAC set. Analysis of the full set of plant orthologous proteins provided for an additional eight full-length proteins. Three of the eight were exostosin proteins thought to be involved in cell-wall synthesis. Other annotations included an aquaporin, ATP synthase, pyrophosphorylase, and two hypothetical proteins. These orthologous proteins were identified in a range of species including Zea mays, Medicago truncatula, Glycine max, Oryza sativa, Populus trichocarpa, and Ricinus communis. Augustus provided the de novo gene identifications, and a total of 21 well-supported and full-length identifications were confirmed. The combined 52 genes identified have lengths ranging from 3 Kbp to just over 8 Kbp, contributing to the annotation of 298 Kbp of sequence. The intron sizes were overall small, all less than 2 Kbp in length. The substantial annotation of genic content given the small percentage of the genome analyzed could be reflective of the pseudogenes known to be prevalent in conifers.

Data Availability
The multi-FASTA CPRD Repbase library, as well as the library generated from this study, PIER, are available at TreeGenes [13] for download (http://dendrome.ucdavis.edu/resources/ downloads.php). The annotated fosmid and BAC sequences can be viewed through GBrowse, also hosted at TreeGenes (http:// dendrome.ucdavis.edu/treegenes/gbrowse/). The ten novel repeats characterized here have been submitted to Repbase. All fosmid sequences have been submitted to Genbank as WGS: APFE01000000.

Discussion
We performed an extensive characterization of repetitive elements in loblolly pine with sequence constructs representing just over 1% of the estimated 22Gbp loblolly pine genome. Our estimate of the total repetitive content comes in two flavors: an estimation that considers all partial alignments to known or de novo repetitive elements (about 86%), and an estimation that considers only elements that satisfy the 80-80-80 rule (about 27%) ( Table 5). Few studies have explored repetitive content by using full-length elements to quantify the relative frequencies of different families. As in previous studies, we noted that LTRs dominated the repeat landscape, contributing to over 60% of the overall repetitive content. As seen in most plant genomes, the Gypsy and Copia LTRs were most prevalent. Our de novo investigation was able to identify a multitude of LTRs that have not been characterized in conifers and showed very little similarity to elements characterized in other species. In addition, we annotated 52 full-length genes through orthologous and de novo techniques that cover approximately 298 Kbp. Here, we discuss these novel contributions to the expansive genome.

Tandem repeats
Tandem repeats are traditionally divided into three classes: microsatellites, minisatellites, and satellites. They can arise from polymerase slippage, can serve as recombination hotspots, act as effectors of gene expression, and are linked to variation [69,[71][72][73]. While the total tandem content for loblolly pine was estimated at 2.6%, it should be noted that the BAC sequences alone had an estimated content of 3.3%, which is much closer to the estimate of 3.4% for the Picea glauca BACs. The large difference between the fosmids and BACs could be explained by the different scaffold lengths, sequence types, and assembly technologies used. A subset of the Pinus BACs were assembled from Sanger data, which is less likely to experience the repeat collapse common in short read assemblies [74].
Microsatellites (simple sequence repeats) are characterized by a repeat unit of 1-8 bp. They are numerous in genomes and are useful as genetic markers due to their polymorphisms [75].
Loblolly pine had a low microsatellite density when compared to other species within and outside the conifer division (Table 2), a finding consistent with previous analyses [76]. For example, Vitis vinifera and Picea glauca are 1.8 to 2.6 times richer in microsatellite content. Dinucleotide repeats lead in density for every species compared. AT/TA motifs ranged from 42.04% of the microsatellite arrays in Vitis vinifera to 55.03% in Cucumis sativus. In all but Taxus mairei, trinucleotides followed, with an average density of 18.3 (microsatellite/Mbp). The third densest microsatellites for Pinus taeda, Arabidopsis thaliana, Vitis vinifera, and Taxus mairei were heptanucleotides. Most significantly, Pinus taeda and Arabidopsis thaliana heptanucleotide arrays make up 16% and 17.2% of their respective microsatellite arrays. Microsatellite density is generally higher in intergenic regions and introns than in genic regions [77], which supports their role in gene regulation. However, the low microsatellite density in loblolly pine is likely not due to higher genic content, but to the prevalence of more complex repeats, such as interspersed retrotransposons.
Minisatellites (repeat unit between 9 and 100 bp) evolve quickly and are predominantly GC-rich. Both micro-and minisatellites exhibit length/copy number polymorphisms, and originate in similar ways. Certain period sizes, especially 20-25 bp (Table 2), are prevalent within Pinus taeda, suggesting they may be conserved. One hypothesis states that not only are these repeats highly conserved, but also species specific [76,78]. Longer satellites and minisatellites have been observed to have unique hybridization patterns when compared between Picea and Pinus [76]. Our discovery of a ,16 Kbp minisatellite, Pita_MSAT16 (Table S4), with a period size of 23 bp, had no significant homology to previously annotated tandem repeats.
Satellites (.100 bp) are prevalent in centromeres, telomeres, and heterochromatin. The distribution of perfect satellites in the five angiosperms differed from previous studies such as [77] and [23]. The methodologies employed to determine these estimates also varies in these studies. Among the gymnosperms examined here, significant variation was noted. The most prevalent satellite from Taxus mairei had 230 bp periods, almost double the size of those in Pinus taeda (123 bp) and Picea glauca (121 bp) ( Table 3). Regions that returned spurious homologies to known telomeric sequences may be due to telomere-like repeats that are highly amplified and form large intercalary and pericentric blocks [76]. Thus, they are present not only on the ends of chromosomes but as repeated components of similar structure elsewhere.
Our results found the (AT/TA)n motif to be most prevalent, while previous studies noted that the (AC)n and (AG)n motifs are most common in conifers [76,79,80]. The frequency of this motif was, however, less than those computed for Vitis vinifera (42.04%) and Cucumis sativus (55.03%). The (AT/TA)n motif was also one of the most common dinucleotides observed in papaya (Carica papaya L) [78]. AT-richness, defined as A%+T%.60% for any given sequence, prevailed in every class of tandem repeat across all species evaluated. In this study, minisatellites in Pinus taeda were 61.45% and 65.22% AT-rich in the BACs and fosmids, respectively. There is a slight trend for lower AT-richness in conifer micro-and minisatellites when compared to angiosperms as seen when Pinus taeda or Picea glauca (at 61.43% AT-richness) are compared against the 85.77% of Cucumis sativus (Table S3). ATrich repeats prevail in dicots but not monocots [77], and apparently not in conifers. However, due to the limited availability of resources for other gymnosperms considered, it cannot be said with confidence that the tandem repeats identified show gymnosperm-specific patterns.

Interspersed content
A typical investigation of repetitive content involves homologybased searches against a database of known repetitive elements. The primary repository for this purpose, RepBase, contains only 15 elements that have been characterized in gymnosperms. The custom database that we created, CPRD, includes another five elements from conifers described in the literature. These 19 and the full contribution of angiosperm elements in RepBase could only annotate 1.4% of the sequence set as full-length elements and represented just four of the top 14 high-coverage novel elements (Table 6). Even with partial element annotations, the total sequence attributed to tandem and interspersed repeats by homology is 28%. Based on the large genome size and knowledge of retrotransposon expansion, we expected a much higher estimate. A de novo approach was critical in describing the significant number of highly diverged retrotransposons. The REPET pipeline combines both sequence self-alignments and structural identifications to do this. The self-alignment portion uses three different local alignment/pattern detection packages (GROUPER, RECON, and PILER) as well as subsequent processes to reduce redundancy and identify a consensus sequence for each element [47]. The structural identification of LTR retrotransposons through LTRharvest is ideal for characterizing low or single-copy elements. The combination of similarity and de novo methods allowed us to identify 29% of the sequence as fulllength elements and nearly 86% as full or partial. In short, we were able to apply both approaches to maximize the sensitivity and specificity needed to find and characterize diverged repeats.
The combined, full-length and partial element estimate of 86% falls just outside the range provided previously for loblolly pine (24% -80%) according to [7], who first examined ten of the 103 BAC sequences. It is comparable to the estimate in Taxodium distichum (90%), but much higher than that in Picea glauca (40%) [19,20]. While this estimate is higher than most angiosperms, a few species including Zea mays (85%) and Hordeum vulgare (84%) [12,81] are reported to have similar amounts of repetitive content. The estimate of full-length retrotransposons in the sequence sets surveyed was 22%, which represents about 87% of the full-length repetitive content. Many angiosperm species have comparable ratios for retroelements, including Sorghum bicolor (70-76%), Zea mays (88%), and Glycine max (72%) [82]. The full-length repetitive sequence captured with both similarity and de novo approaches was greater in the BAC sequences (39%) than in the fosmids (26%). In addition, both classes of TEs had over 1.5x the number of fulllength identifications in the BACs when compared with the fosmids. The LTRs characterized had lengths up to 29 Kbp and could easily be missed in the smaller fosmid sequences. As mentioned previously, the Sanger sequenced BACs may also have a superior assembly in repetitive regions allowing for improved identification.
Two primary divisions exist to describe interspersed repetitive content, generally known as transposable elements. Class I retrotransposons require reverse-transcription. They can be divided into two types, based on the presence or absence of direct repeats at the ends of the element, known as long terminal repeats (LTRs). They are often characterized by their pol and gag domains, which are closely related to retroviral proteins. Class II DNA transposons are much less common, and do not require a reverse-transcription step to integrate into the genome. Instead, a transposase, an enzyme that catalyzes transposition, recognizes the terminal inverted repeats (TIRs), excises the TE, and integrates the transposon into the new acceptor site. Both Class I and Class II exhibit complex biological roles in regulation, suppression, and expression [83]. They can evolve to become fully functional genes or duplicate genes to modify regulation [84,85]. TEs are also known to insert within the sequence of another transposable element, within tandem repeats, or within genes [86][87][88]. In this study, the ratio of Class I to Class II elements (full-length and partial) for Pinus taeda was 41:1. Several full-length DNA transposons were identified with the de novo methodology, while none were confirmed via homology-based methods due to the lack of characterized DNA transposons for conifers. DNA transposons represents 0.53% of the full-length repetitive content and 1.52% of the unfiltered repetitive content in our de novo analysis. A few angiosperms, such as Oryza sativa, are noted to have a much more substantial contribution of DNA transposons relative to retrotransposons [89]; however, most studies have noted that they are at low frequencies when compared with Class I elements [82]. For the angiosperm genomes compared in this study, based on homology, only Arabidopsis and cucumber had near equal contributions from both repeat classes ( Figure 4A).
Among Class I elements, non-LTR retroelements classified as LINEs were only identified in the de novo portion of the analysis and represented 0.71% of the sequence set. LINEs have been found at low frequencies in conifers [90] and vertical transmission has been surmised to be the dominant cause of LINE proliferation in angiosperms [91]. Two ancient LTR superfamilies, Gypsy and Copia, dominate plant genomes and are widespread across chromosomes (consistent with propagation via RNA intermediate) [92]. Though Copia and Gypsy LTR retroelements differ only by the ordering of their RT and INT domains, the ratio between these families varies across plants ( Figure 4A). The conifer BACs and fosmids analyzed here (Pinus taeda, Picea glauca, and Taxus mairei) appear LTR retrotransposon-dense, and DNA transposon and LINE deficient, when compared to the five angiosperms (Figure 4 A). Our evaluation of BAC and fosmid sequences estimated the ratio of Gypsy to Copia in Pinus taeda at 1.9:1, Picea glauca at 1:1.2, and Taxus mairei at 1.7:1 (Table 4; Figure 4A). Picea glauca, and Cucumis sativus showed slightly greater contributions of Copia over Gypsy elements. Populus trichocarpa, however, had a much greater contribution with a Copia to Gypsy ratio of about 3:1 ( Figure 4A). Our full-length analysis was consistent with the homology-based estimates for Pinus taeda ( Figure 4B), with both yielding a ratio of 2:1.
Many angiosperm TEs, like the LTR elements of grass genomes, are evolutionarily young and distinguishable [7]. Gymnosperms are markedly different; one hypothesis is that a few elements inserted early, propagated heavily, and diverged via vertical transmission [90]. Phylogenetic analysis based on hybridization studies of 100 RT fragments of Gypsy and Copia elements in 22 conifer species revealed many gymnosperm-specific elements with similar diversity estimates [90]. Support for this hypothesis is reported in the analysis of Picea glauca BACs in which transposons appear to have accumulated multiple mutations, indels, and rearrangements [61]. This was again supported in the analysis of Pinus taeda BAC sequence, where retroelement frequency distributions support the theory that the genome complexity is largely due to retrotransposon derivatives [3]. In this study, the families identified are numerous, and few annotate to species outside of the Pinus genus (Fig. 3B). In addition, we note that only 26% of the genomic sequence sampled is full-length, while 59% is from partial elements. Fourteen (a combination of ten novel and four previously characterized) of the high-copy families constitute only 2.56% of the sequence set, with none exceeding 0.5% individually. The largest novel LTR family, PtPiedmont, accounts for only 0.35% of the sequence set (Table 6). Together with diverged elements that are still actively transposing, RT polymerase domains, the most conserved regions of retrotransposons [83], constitute most of our spurious alignments to other genomes.

Novel high coverage repeats
Among the ten high coverage novel families, six show similarity to known Gypsy elements, while three show similarity to known Copia elements. Nine of the high coverage novel families (PtConagree, PtPiedmont, PtCumberland, PtOuachita, PtBastrop, PtPineywoods, PtAngelina, PtTalladega, PtOzark) had no significant (fulllength) alignments and are not found in other plant species. The exception, PtAppalachian, aligned to several sequences from other plants due to the embedded PGGYPSYX1-like sequence. Interestingly, there are virtually no LTR artifacts at either end of the PGGYPSYX1 insertion within PtAppalachian. Observation of the high coverage alignments (Figure 6; Figure S1) shows a substantial amount of LTR degradation in PtPiedmont, PtOuachita, PtBastrop, PtAppalachian, PtPineywoods, and PtAngelina, and each family consensus sequence lacks a detectable PBS, PPT, or both. Many sequences within each cluster are thus nonfunctional and not actively proliferating. Within the high coverage elements, a few have multiple copies of reverse transcriptase, but these copies are usually from different RT lineages. In some cases, the LTR aligns partially within the consensus sequence. More specifically, we observed that the region of conservation in the sequence may extend up to 96 bp past the aligned LTR region, as seen in PtAppalachian. Embedding of retroelements within other retroelements, especially in flanking regions, has been found in Beta vulgaris, where separate Copia and Gypsy elements inserted into an older LTR retrotransposon [93]. This is also seen in Brassica species, where nested elements were observed to share similar LTRs [94]. Nesting may have occurred here in PtOzark, as well as in PtPiedmont and PtAngelina. LTR regions delineating the insertion in PtOzark can be detected, along with many Pfam alignments within the region of insertion. The fact that internal LTRs in PtOzark can be detected seems to conflict with the lack of internal LTRs in PtAppalachian. However, the LTRs aligned at only 81% identity, suggesting that significant degradation has already taken place. Furthermore, full-length copies of PtOzark are rare, so it may be a recent insertion. PtPiedmont and PtAngelina both show alignments characteristic of nesting and novel regions unique to each element. These observations, coupled with the fact that about 86% of the sequence set is comprised of low-copy repetitive elements (with individual families accounting for no more than 3% of the sequence set), suggest a complex system of LTRelement retrotransposition. The mechanisms behind the colonization and proliferation of LTRs in gymnosperms are still being elucidated, and remains an interesting topic of research for future studies. Figure S1 Annotated high copy LTR repeat families. Multiple alignments of seven high coverage and novel elements were performed using MUSCLE and visualized in Jalview. The final consensus sequence was exported with substitutions resolved, annotated (LTRdigest), and visualized (AnnotationSketch). High coverage elements include PtPiedmont, PtCumberland, PtBastrop, PtOzark, PtAngelina, PtConagree, and PtTalladega. These, in addition to PtOuachita, PtAppalachian, and PtPineywoods (Figure 6), represent the 10 high coverage, novel elements. (TIFF)     Table S6 REPET seed library composition. Counts for each classification provided by TEclassifier were aggregated for BAC and fosmid datasets. No. HSPs is the number of high scoring hits used to construct the seeds in a particular category. Length is the total length in base pairs of all HSPs in a particular category. Percentage of sequence set is a function of Length and the length of the sequence set. Percentage of repeats is the relative ratio of the category of repeats to all repetitive content. (XLSX) Table S7 De novo annotation summary. Each full-length sequence found in CENSOR was run through our annotation pipeline. Counts for each category were tracked for BAC and fosmid datasets, then combined. Length is the sum of all full-length repetitive content in a particular category. (XLSX) Table S8 High coverage families. Each novel repeat family, aside from the named elements, is represented by one of its members, and each novel repeat element is assigned a unique ID number. Cumulative length and Cumulative percentage of sequence sets represent the aggregation of the top X highest coverage families, with X denoted by the column No. families.

(XLSX)
Table S9 Selected Pfam profiles. 147 profiles were queried through keywords such as ''retrotransposon''. Selected profiles were used in the annotation of the top ten high coverage elements. (XLSX)