Improving the Annotation of Arabidopsis lyrata Using RNA-Seq Data

Vimal Rawat; Ahmed Abdelsamad; Björn Pietzenuk; Danelle K. Seymour; Daniel Koenig; Detlef Weigel; Ales Pecinka; Korbinian Schneeberger

doi:10.1371/journal.pone.0137391

Abstract

Gene model annotations are important community resources that ensure comparability and reproducibility of analyses and are typically the first step for functional annotation of genomic regions. Without up-to-date genome annotations, genome sequences cannot be used to maximum advantage. It is therefore essential to regularly update gene annotations by integrating the latest information to guarantee that reference annotations can remain a common basis for various types of analyses. Here, we report an improvement of the Arabidopsis lyrata gene annotation using extensive RNA-seq data. This new annotation consists of 31,132 protein coding gene models in addition to 2,089 genes with high similarity to transposable elements. Overall, ~87% of the gene models are corroborated by evidence of expression and 2,235 of these models feature multiple transcripts. Our updated gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved.

Citation: Rawat V, Abdelsamad A, Pietzenuk B, Seymour DK, Koenig D, Weigel D, et al. (2015) Improving the Annotation of Arabidopsis lyrata Using RNA-Seq Data. PLoS ONE 10(9): e0137391. https://doi.org/10.1371/journal.pone.0137391

Editor: Nicholas James Provart, University of Toronto, CANADA

Received: June 12, 2015; Accepted: August 17, 2015; Published: September 18, 2015

Copyright: © 2015 Rawat et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: Short read data presented in this paper are available through NCBI Short Read Archive under accession number GSE69222. Short reads from the cold treatment are deposited at the European Nucleotide Archive under accession number PRJEB6701.

Funding: This work was funded by German Research Foundation (dfg.de) grant AP1829-2 to AP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Arabidopsis lyrata is a predominantly self-incompatible, perennial plant species that diverged from a common ancestor with A. thaliana approximately 10 million years ago [1]. Despite its evolutionary closeness, its genome size is estimated to be between 230 to 245 Mb, or one and a half times as large as the A. thaliana genome [2,3]. Except for A. thaliana, A. lyrata is the only species within the family of Brassicaceae with a reference assembly exclusively based on high quality dideoxy sequencing. This 207 Mb A. lyrata reference assembly attributed the difference in genome size between the two species to the accumulation of many small deletions in the A. thaliana genome primarily in non-coding regions and transposable elements (TEs) [1]. In addition, A. lyrata has experienced recent genome expansion due to activity of TEs, in particular Copia long terminal repeat (LTR) retrotransposons [1,4,5], which is the basis for species-specific patterns in DNA methylation [6].

As A. lyrata is the closest fully assembled relative of A. thaliana, it serves as an important out-group for evolutionary studies within A. thaliana [7–9]. Moreover, recent advances in sequencing technology have enabled the assembly of an increasing number of Brassicaceae genomes and their close relatives [4,5,10–19], which, together, are leveraged for comparative genomics in this family. Intra- as well as inter-species comparisons, however, heavily rely on the gene annotations of each species involved and high quality annotations even in the non-model species become essential.

Methods for gene model annotation profited considerably from the invention of high-throughput RNA sequencing (RNA-seq) [20,21]. Identification of genuine transcription start and termination sites as well as intron/exon borders is a non-trivial task when using only reference sequences and homology data. Now, information on spliced alignments from RNA-seq data can improve the identification of gene models [22,23] and also enable the annotation of variant isoforms [24]. In particular, the gene annotations of model species have been updated regularly despite only minor changes to the reference genome sequence [25].

The current gene annotation of the A. lyrata includes 32,670 genes and was generated using a combination of ab initio gene prediction, homology to known proteins, as well as gene sequences and expression data from related species [1]. Even though the gene models were analyzed for their expression support using RNA-seq data, gene prediction methods integrating RNA-seq alignment information had not been developed at the time the assembly had been generated. In a recent study, Haudry and colleagues supplemented the original annotation with additional putatively transcribed regions in order to study the conservation of non-coding sequences among related Brassicaceae species [14]. They integrated the results of additional ab initio gene predictions, RNA-seq data alignments and homology searches against the genes of A. thaliana in order to mask potentially un-annotated coding sequences and regions that recently lost coding potential due to mutations.

Building upon the major efforts of the first annotation of A. lyrata genome (version-1 from hereon) we have updated the gene models using diverse RNA-seq samples. Our annotation (version-2 from hereon) has changed the coordinates of 29,141 of the original 32,670 gene models, removed 1,286 and added 1,295 new models. This update corrected hundreds of gene models, which were wrongly merged or split in version-1, and also separated transposable element genes from other protein coding genes. Finally, we have analyzed the transcriptional response of A. lyrata to heat stress to show the improved utility of version-2 for the identification of differential isoform usage and pre-mRNA splicing.

Results and Discussion

Improving the A. lyrata gene annotation using transcriptional data

We sequenced the transcriptome of various A. lyrata aerial tissues, including whole rosettes, dissected shoot apices, complete inflorescences, as well as vegetative rosettes exposed to cold and heat stress (see Materials and Methods). In total, we generated over 290 million single-end, strand unspecific short reads using Illumina sequencing technology after poly-A purification (Table 1). Short reads were aligned to the A. lyrata reference assembly [1] using Bowtie v2.1.0 [26] and the splice junction mapper TopHat v2.0.9 [27] (see Materials and Methods). We could align 89% of all reads, out of which 85% aligned uniquely and were used for further analysis. The proportion of unaligned reads was comparable to the proportion of unaligned reads in similar experiments with A. thaliana, which presumably has one of the most complete reference genome sequences. Over 10% of the aligned reads matched to putative intergenic regions indicating that some gene models may have been missed in the original version of the A. lyrata gene annotation. Visual inspection of these intergenic alignments revealed the expected patterns for spliced transcripts indicating instances of unidentified gene models and cases where transcription exceeded known gene boundaries (see Fig A in S1 Dataset).

Download:

Table 1. Short read statistics (read numbers in millions).

https://doi.org/10.1371/journal.pone.0137391.t001

New gene models were predicted from short read alignment data using Cufflinks 2.1.1 [22] independently for each tissue. In total, Cufflinks predicted 31,194 distinct gene models across all samples. An additional RNA-seq alignment-guided gene prediction using Augustus v.3.0.1 [28] identified 40,728 gene models, including 27,830 genes, which were supported by at least five RNA-seq reads. Moreover, 30,483 and 30,837 of Augustus predicted gene models overlapped with version-1 and Cufflinks predictions, respectively (see Materials and Methods and Fig B in S1 Dataset).

We combined 31,793 Augustus gene models with evidence of transcription or that overlapped with version-1 gene models to update the A. lyrata gene annotation (Fig 1A). To ensure that we were not excluding any true gene models in version-1, we included 1,430 version-1 gene models that were not overlapping with any of the new gene models, but showed either evidence of expression or featured an ortholog in at least one of the Brassicaceae species A. thaliana [29], Capsella rubella [4], Brassica rapa [10], Schrenkiella parvula [11] and Arabis alpina [5]. This increased the number of gene models to 33,223 (see Materials and Methods). To identify and to correct cases where incorrect gene models may have been introduced into the version-2 annotation, we utilized the very close phylogenetic relationship between A. lyrata, A. thaliana and C. rubella. We compared all gene models that were considerably different between version-1 and version-2 to A. thaliana and C. rubella orthologs (see Materials and Methods). If the length of the version-1 open reading frame was closer to that of the orthologs, we retained the version-1 gene model. This resulted in 548 version-2 gene models being replaced with 688 of the original version-1 gene models (Fig 1B). After additional removal of redundant gene models we obtained a final set of 33,221 non-redundant gene models.

Download:

Fig 1. Updating the gene model annotation of A. lyrata.

(A) Left, version-2 gene models predicted by Augustus [28]. Number of gene models overlapping with version-1 (yellow), genes predicted with Cufflinks (red), and genes with expression evidence (blue). Right, gene models of the version-1 annotation. Number of models without overlap to version-2 models (yellow), without orthologs in five other Brassicaceae (red), and without significant expression evidence (blue). (B) Correlation of the lengths of A. lyrata gene models with the length of their orthologous gene models in A. thaliana. Left, A. lyrata version-1 gene models. Correlations using version-1 gene models (left), version-2 gene models before (middle) and after (right) the homology-based correction of gene models. (C) Length distribution of gene models including genes that were removed or newly added in the version-2.

https://doi.org/10.1371/journal.pone.0137391.g001

Based on a recent annotation of A. lyrata TEs [14] and sequence similarity to TE genes of A. thaliana [25], we annotated 2,089 of the protein coding gene models as TE protein coding genes (see Materials and Methods). Without these, version-2 comprised of 31,132 gene models, which is ~13% more than in A. thaliana [25]. Although tRNA genes were described in the original analysis of the A. lyrata genome [1], version-1 lacks information regarding these loci. By rerunning tRNAScan [30], we identified 660 tRNA genes coding for all 20 amino acids. For completeness, we also incorporated 170 recently published miRNA genes into the new annotation file [31].

Altogether, we updated the coordinates of 29,141 of the original gene models, removed 1,286 entire (mostly short) gene models, and added 1,295 new models (Fig 1C). Only 2,243 remained unaltered (including 688 version-1 gene models re-introduced due to their superior similarity to orthologs). The new annotation accounted for 31,132 non-TE related gene models including 27,084 multi-exonic genes of which 2,236 featured at least one alternative isoform (Table 2).

Download:

Table 2. Comparison of version-1 and version-2 annotations.

https://doi.org/10.1371/journal.pone.0137391.t002

Validating differences in gene model structure

Even after the above-mentioned homology-based gene length adjustments, we found cases where the corresponding gene models from the two annotations varied drastically in length. This included instances where multiple version-1 gene models were fused to form a single gene model in version-2 or vice versa (Fig 2). In total, 161 version-1 genes were split (accounting for 530 genes in version-2) and 1,729 version-1 gene models were merged (accounting for 775 gene models in version-2). We randomly selected 14 version-1 gene models that had been split into multiple gene models in version-2, and another 14 gene models that had been merged in version-2, for PCR validation (see Fig C and D in S1 Dataset). One split case did not yield gDNA bands indicating a technical problem in primer design. For three merge cases we obtained cDNA bands of the expected size, but were not able to amplify genomic DNA for primer validation. This was most likely due to large gDNA amplicon size (2.4–5 kbp) and rendered the results of these cases inconclusive. For all 24 remaining cases, PCR results fully confirmed the annotation of the new gene models.

Download:

Fig 2. Examples of version-1 gene models split and merged in A. lyrata gene annotation version-2.

(A) Example of a gene model that was split into two gene models in version-2. Reverse transcription-PCR could not confirm the connection of both. (B) Example of version-1 gene models that were merged during the annotation update. Reverse transcription-PCR confirmed presence of a transcript bridging the two version-1 genes.

https://doi.org/10.1371/journal.pone.0137391.g002

A. lyrata version-2 annotation in contrast to other Brassicaceae annotations

For both A. lyrata annotations we predicted orthologous relationships between A. lyrata and five other Brassicaceae species (see Materials and Methods). Using version-2 gene models, 77.5% of genes had an ortholog in at least one species (24,146 out of 31,132) (Fig 3A), compared to 73% for version-1 (23,996 out of 32,670) (see Fig E in S1 Dataset). The number of genes with orthologs in all five Brassicaceae was also slightly higher for version-2 with 15,105 genes versus 14,850 genes with version-1.

Download:

Fig 3. Comparing the A. lyrata gene annotation version-2 with the annotations of five other Brassicaceae.

(A) Orthologous gene models shared between A. lyrata (version-2), A. thaliana [29], A. alpina [5], B. rapa [10], C. rubella [4] and S. parvula [11]. (B) Gene, Protein and UTR length distributions of above-mentioned species including the new and old A. lyrata annotations. UTR distribution is only shown for A. lyrata and A. thaliana because of poor UTR annotation in some of the other species.

https://doi.org/10.1371/journal.pone.0137391.g003

The removal of many short gene models in version-2 changed the distribution of gene model lengths (Figs 1C and 3B). Version-1 has an excess of gene models shorter than 1 kb with a second mode around 1.5 kb, which describes a bimodal distribution that was only reflected by gene length distribution of B. rapa In contrast version-2 had only a single mode around 1.7 kb, similar to the four other species. The length distribution of predicted protein sequences in version-1 had also been distinct from the other Brassicaceae species, and this discrepancy largely disappeared with version-2. A third factor that contributed to the length differences between the genes of version-1 and version-2 were differences in UTR annotations (Fig 3B). In version-1 33% of the genes were annotated without UTR information, however, in version-2 only 5% remained witout 3’ and 5’ UTR annotation. The absolute and relative contributions of individual features are shown in Fig F in S1 Dataset. Though, absolute increase in genomic space for all gene features was observed but CDS and UTRs are benefited the most. We also observed little decrease in intronic genome space, which can be explained by introduction of splice variants previously missing from version-1 annotation.

Whether the bimodal distribution in B. rapa reflects similar ambiguity in gene annotations, or mirrors particular characteristics of B. rapa, including its ancient genome triplication and subsequent fractionation, is not known.

New annotation enabled improved identification of alternative splicing events

The availability of multiple isoforms from individual gene models in version-2 enables quantitative expression comparisons between annotated isoforms. We analyzed RNA-seq data from A. lyrata rosette tissues from untreated (WT), heat stressed (HS), and recovered (REC) samples in duplicate (see Materials and Methods). We first analyzed the data for differential gene expression using Cuffdiff v.2.0.2 [32]. WT and REC differed from HS at 3,114 and 2,962 genes, whereas only 106 genes differed between WT and REC. This indicates, as expected, a strong effect of heat stress on gene expression (see Materials and Methods). Cuffdiff was also used to estimate differential expression between isoforms. We identified differential isoform expression at 283, 15 and 119 genes when comparing WT with HS, WT with REC, and HS with REC, respectively. In contrast, as version-1 does not include different isoforms, which are prerequisite for isoform expression analysis as implemented in Cuffdiff, it was not possible run this analysis using version-1.

We investigated differential splicing using a second tool, MATS v3.0.8 [33], which does not rely on prior isoform annotations and only identifies differences in individual splicing events, but not between entire transcripts. With version-2, MATS identified 177, 0 and 130 differential splicing events distributed over 187 distinct gene models in the three comparisons (Fig 4; see Materials and Methods). MATS reported only 99, 1 and 67 events affecting 103 gene models using version-1. The overlap of different splicing events was very high (95 out of 103 (version-1) and 187 (version-2) gene models). Thus, almost all gene models with differential splicing events predicted based on version-1 were also predicted using version-2, however, the results based on version-2 revealed many more gene models. This was partially due to newly added genes (10 cases), but the most improvement came from the updates to exon-intron boundaries of existing gene models indicating that the new gene annotation improved the overall utility of this resource.

Download:

Fig 4. Heat stress induces alternative splicing events.

(A) Examples of differentially expressed isoforms in response to heat stress in A. lyrata. AL3G42820 expresses a second isoform that lacks the middle exon in heat-treated samples (HS). Transcripts from wild-type (WT) and recovery (REC) samples contain all three exons. AL2G15640 retains an intron in response to heat stress (HS) while wild-type (WT) and recovery (REC) samples show partial intron splicing. (B) Number of differential splicing events, including alternative 5’ and 3’ splice sites, mutually exclusive exons, intron retention, and exon skipping events identified with MATs based on version-1 and version-2 annotations.

https://doi.org/10.1371/journal.pone.0137391.g004

The isoform-dependent (Cuffdiff) and-independent (MATS) analyses identified only 37 common gene models. Even though Cuffdiff revealed fewer events as compared to the MATS analysis, it did identify 100 genes with differential isoform usage that were not included in the set of genes with multiple isoforms. This suggests that differential isoform expression analysis profits from prior isoform annotation, however, should not only rely on existing isoforms.

Availability of the annotation and gene naming conventions

The version-2 annotation can be found in S2 Dataset. The gene identifiers have been updated following the annotation principles applied to A. thaliana [29]. Gene numbering follows the physical order of genes on chromosomes, where each gene is named “AL” followed by scaffold number, then a “G” (for the first 8 scaffolds corresponding to the eight chromosomes) or “U” (for unanchored scaffolds) and finally a unique number incremented by 10, to leave flexibility for genes that were missed in this annotation. Genes that were removed from version-1 can be found in S3 Dataset. A mapping of the gene model identifiers of version-1 to version-2 can be found in S4 Dataset.

Conclusions

The updated annotation includes 31,132 gene models with 35,805 transcripts. We also reported 1,304 gene models that were erroneously split or merged in the previous annotation. Validation of these models strongly supported our updates highlighting the importance of employing species-specific RNA-seq data for annotating genomes.

We also provided a first annotation of alternative splicing events in A. lyrata. Using RNA-seq samples for a heat stress experiment we demonstrated the improved utility of the version-2 annotation for differential isoform expression studies. This revised genome annotation advances the reference sequence of A. lyrata as a community resource for comparative and functional studies.

Material and Methods

Plant material

Arabidopsis lyrata subsp. lyrata MN47 plants were grown in soil under long day conditions (16 hours light, 21°C: 8 hours dark, 16°C). Vegetative rosettes and dissected shoot apices of three week old plants and entire inflorescences of flowering plants were harvested as mock treated samples. For heat stress and recovery treatments we incubated three week old plants at 37°C for 6 hours or for an additional 48 hours at 21°C, respectively. Cold stressed samples were treated as described [6].

Nucleic acids isolation and RNA-seq library preparation

DNA was isolated using Nucleon Phytopure kit (GE Healthcare). For total RNA isolation, samples were flash frozen in liquid nitrogen and used with Qiagen RNeasy® Plant Mini Kit, including an on-column DNase I digestion. Total RNA integrity was confirmed on the Agilent BioAnalyzer. Barcoded libraries were constructed using the Illumina TruSeq RNA kit with average of 1 μg of total RNA as starting material. The manufacturer's protocol was precisely followed with one exception in the cold-treated samples where 12 PCR cycles were used instead of the recommended 15. The library quality was monitored on a Bioanalyzer 2100 (Agilent) and the libraries were sequenced as 100-bp single end reads using Illumina sequencing.

RNA-seq read mapping and gene predictions

RNA-seq data was mapped to the A. lyrata reference genome assembly [1] using Bowtie v1.0.0 [26] and TopHat v2.0.10 [27]. Cufflinks v2.0.2 [22] was used for de novo transcript identification in all tissues separately. Cuffmerge (from the Cufflinks suite) was used to merge transcript annotation files obtained for three tissues separately. In addition, all short reads were aligned to the reference assembly of A. lyrata using BLAT v.34 [34] to generate an evidence file for guided gene prediction using Augustus v3.0.0. A. lyrata specific configuration file was generated using the version-1 annotation. To estimate agreement between Augustus and version-1 gene models, gene models with > = 30% overlap (in respect to the shorter gene model) were considered. Gene models supported by five or more RNA-seq reads were considered as expressed irrespective of gene length.

To identify cases where wrong gene models were introduced in version-2, we first compared version-2 proteins (23,181 comparable proteins) with corresponding version-1 proteins. A total of 1,037 proteins were identified as outliers, where protein length difference was outside the range of +/- standard deviation of the distribution of length differences. For these cases version-1 and version-2 protein sequences were further compared against the proteins of their orthologs in A. thaliana [29] and C. rubella [4]. If both orthologs were more similar in length to the protein of version-1, the respective version-2 gene model was replaced with version-1.

Ortholog identification

Orthologous gene identification for both version-1 and version-2 was done separately at protein level using reciprocal best hits using blastall v2.2.25 [35] and an e-value cutoff 0.001 among five Brassicaceae species.

Identification of TE genes in version-2

Version-2 gene models harboring complete TEs [14] within their coding regions or were entirely spanned by a TE were annotated as “TE coding genes”. In addition 3,909 A. thaliana TE genes [25] and TIGR Brassicaceae specific repeat database [36] were used to identify TE genes using blastn v2.2.25 [35].

cDNA preparation and PCR

Plants were grown on soil under long day conditions until the five-leaf stage reached after approximately three weeks. cDNA samples were prepared from 1 μg total RNA of mock-treated rosettes using RevertAid First Strand cDNA Synthesis Kit with oligo d(T) primers (Thermo Scientific). Reverse transcriptase minus samples were processed in the same way without enzyme addition. PCR reactions were done in an Eppendorf thermal cycler using a standard program and the products were visualized on agarose gels stained with ethidium bromide. The PCR primer sequences can be found in S5 Dataset.

Differential gene expression and alternative splicing

Cufflinks [22] was used to calculate differential gene expression level (FPKM) with p-value < 0.01 and log2-fold change difference of more than 2. MATS [33] was used to investigate differential splicing events with over 0.01% splicing difference at a p-value < 0.01 and a false discovery rate of less than 1%. To control for false positives, genes with 10,000 fold or more expression difference were excluded.

Supporting Information

S1 Dataset. Supplementary figures.

https://doi.org/10.1371/journal.pone.0137391.s001

(DOCX)

S2 Dataset. General feature formatted (GFF) file describing version-2 annotation.

https://doi.org/10.1371/journal.pone.0137391.s002

(ZIP)

S3 Dataset. GFF file describing genes that were removed from version-1.

https://doi.org/10.1371/journal.pone.0137391.s003

(ZIP)

S4 Dataset. Table describing the mapping of version-1 to version-2 gene models.

https://doi.org/10.1371/journal.pone.0137391.s004

(XLSX)

S5 Dataset. Primer information for gene model validation.

https://doi.org/10.1371/journal.pone.0137391.s005

(XLSX)

Acknowledgments

We thank Geo James Velikkakam for helpful discussion on gene annotation tools, J. de Meaux for seeds of A. lyrata, B. Eilts and R. Gentges for technical support, A. Platts, S. Wright and M. Blanchette for insights in their gene annotation and annotation of conserved non-coding sequences, and M. Koornneef for critical reading of the manuscript.

Author Contributions

Conceived and designed the experiments: KS AP. Performed the experiments: AP AA. Analyzed the data: VR. Contributed reagents/materials/analysis tools: AA BP DKS DK DW. Wrote the paper: VR KS AA AP BP DKS DK DW.

References

1. Hu TT, Pattyn P, Bakker EG, Cao J, Cheng J-F, Clark RM, et al. The arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet 2011, May;43(5):476–81. pmid:21478890
- View Article
- PubMed/NCBI
- Google Scholar
2. Johnston JS, Pepper AE, Hall AE, Chen ZJ, Hodnett G, Drabek J, et al. Evolution of genome size in brassicaceae. Ann Bot 2005;95(1):229–35. pmid:15596470
- View Article
- PubMed/NCBI
- Google Scholar
3. Lysak MA, Koch MA, Beaulieu JM, Meister A, Leitch IJ. The dynamic ups and downs of genome size evolution in brassicaceae. Mol Biol Evol 2009, Jan;26(1):85–98. pmid:18842687
- View Article
- PubMed/NCBI
- Google Scholar
4. Slotte T, Hazzouri KM, Ågren JA, Koenig D, Maumus F, Guo Y-L, et al. The capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat Genet 2013, Jul;45(7):831–5. pmid:23749190
- View Article
- PubMed/NCBI
- Google Scholar
5. Willing E-M, Rawat V, Mandáková T, Maumus F, James GV, Nordström KJ, et al. Genome expansion of arabis alpina linked with retrotransposition and reduced symmetric DNA methylation. Nature Plants 2015;1(2).
- View Article
- Google Scholar
6. Seymour DK, Koenig D, Hagmann J, Becker C, Weigel D. Evolution of DNA methylation patterns in the brassicaceae is driven by differences in genome organization. PLoS Genet 2014, Nov;10(11):e1004785. pmid:25393550
- View Article
- PubMed/NCBI
- Google Scholar
7. Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, et al. Reference-guided assembly of four diverse arabidopsis thaliana genomes. Proc Natl Acad Sci U S A 2011, Jun 21;108(25):10249–54. pmid:21646520
- View Article
- PubMed/NCBI
- Google Scholar
8. Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, et al. Whole-genome sequencing of multiple arabidopsis thaliana populations. Nat Genet 2011, Aug 28;43(10):956–63. pmid:21874002
- View Article
- PubMed/NCBI
- Google Scholar
9. Long Q, Rabanal FA, Meng D, Huber CD, Farlow A, Platzer A, et al. Massive genomic variation and strong selection in arabidopsis thaliana lines from sweden. Nat Genet 2013, Jun 23;45(8):884–90. pmid:23793030
- View Article
- PubMed/NCBI
- Google Scholar
10. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, et al. The genome of the mesopolyploid crop species brassica rapa. Nat Genet 2011, Aug 28;43(10):1035–9. pmid:21873998
- View Article
- PubMed/NCBI
- Google Scholar
11. Dassanayake M, Oh D-H, Haas JS, Hernandez A, Hong H, Ali S, et al. The genome of the extremophile crucifer thellungiella parvula. Nat Genet 2011, Aug 7;43(9):913–8. pmid:21822265
- View Article
- PubMed/NCBI
- Google Scholar
12. Wu H-J, Zhang Z, Wang J-Y, Oh D-H, Dassanayake M, Liu B, et al. Insights into salt tolerance from the genome of thellungiella salsuginea. Proc Natl Acad Sci U S A 2012, Jul 9;109(30):12219–24. pmid:22778405
- View Article
- PubMed/NCBI
- Google Scholar
13. Yang R, Jarvis DE, Chen H, Beilstein MA, Grimwood J, Jenkins J, et al. The reference genome of the halophytic plant eutrema salsugineum. Front Plant Sci 2013;4:46. pmid:23518688
- View Article
- PubMed/NCBI
- Google Scholar
14. Haudry A, Platts AE, Vello E, Hoen DR, Leclercq M, Williamson RJ, et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat Genet 2013, Aug;45(8):891–8. pmid:23817568
- View Article
- PubMed/NCBI
- Google Scholar
15. Cheng S, van den Bergh E, Zeng P, Zhong X, Xu J, Liu X, et al. The tarenaya hassleriana genome provides insight into reproductive trait and genome evolution of crucifers. Plant Cell 2013, Aug;25(8):2813–30. pmid:23983221
- View Article
- PubMed/NCBI
- Google Scholar
16. Oh D-H, Hong H, Lee SY, Yun D-J, Bohnert HJ, Dassanayake M. Genome structures and transcriptomes signify niche adaptation for the multiple-ion-tolerant extremophyte schrenkiella parvula. Plant Physiol 2014, Apr;164(4):2123–38. pmid:24563282
- View Article
- PubMed/NCBI
- Google Scholar
17. Kitashiba H, Li F, Hirakawa H, Kawanabe T, Zou Z, Hasegawa Y, et al. Draft sequences of the radish (raphanus sativus L.) Genome. DNA Res 2014, May 16;21(5):481–90. pmid:24848699
- View Article
- PubMed/NCBI
- Google Scholar
18. Lobréaux S, Manel S, Melodelima C. Development of an arabis alpina genomic contig sequence data set and application to single nucleotide polymorphisms discovery. Mol Ecol Resour 2014, Mar;14(2):411–8. pmid:24128264
- View Article
- PubMed/NCBI
- Google Scholar
19. Liu S, Liu Y, Yang X, Tong C, Edwards D, Parkin IAP, et al. The brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes. Nat Commun 2014;5:3930. pmid:24852848
- View Article
- PubMed/NCBI
- Google Scholar
20. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell 2008, May 2;133(3):523–36. pmid:18423832
- View Article
- PubMed/NCBI
- Google Scholar
21. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet 2009, Jan;10(1):57–63. pmid:19015660
- View Article
- PubMed/NCBI
- Google Scholar
22. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, May;28(5):511–5. pmid:20436464
- View Article
- PubMed/NCBI
- Google Scholar
23. Li Z, Zhang Z, Yan P, Huang S, Fei Z, Lin K. RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics 2011;12:540. pmid:22047402
- View Article
- PubMed/NCBI
- Google Scholar
24. Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, et al. Genome-wide mapping of alternative splicing in arabidopsis thaliana. Genome Res 2010, Jan;20(1):45–58. pmid:19858364
- View Article
- PubMed/NCBI
- Google Scholar
25. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The arabidopsis information resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Res 2012, Jan;40(Database issue):D1202–10. pmid:22140109
- View Article
- PubMed/NCBI
- Google Scholar
26. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods 2012, Apr;9(4):357–9. pmid:22388286
- View Article
- PubMed/NCBI
- Google Scholar
27. Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering splice junctions with rna-seq. Bioinformatics 2009, May 1;25(9):1105–11. pmid:19289445
- View Article
- PubMed/NCBI
- Google Scholar
28. Stanke M, Waack S. Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 2003, Oct;19(Suppl 2):215–25.
- View Article
- Google Scholar
29. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature 2000;408(6814):796–815. pmid:11130711
- View Article
- PubMed/NCBI
- Google Scholar
30. Lowe TM, Eddy SR. TRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997, Mar 1;25(5):955–64. pmid:9023104
- View Article
- PubMed/NCBI
- Google Scholar
31. Fahlgren N, Jogdeo S, Kasschau KD, Sullivan CM, Chapman EJ, Laubinger S, et al. MicroRNA gene evolution in arabidopsis lyrata and arabidopsis thaliana. Plant Cell 2010, Apr;22(4):1074–89. pmid:20407027
- View Article
- PubMed/NCBI
- Google Scholar
32. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nat Protoc 2012, Mar;7(3):562–78. pmid:22383036
- View Article
- PubMed/NCBI
- Google Scholar
33. Shen S, Park JW, Huang J, Dittmar KA, Lu Z-X, Zhou Q, et al. MATS: A bayesian framework for flexible detection of differential alternative splicing from rna-seq data. Nucleic Acids Res 2012, Apr;40(8):e61. pmid:22266656
- View Article
- PubMed/NCBI
- Google Scholar
34. Kent WJ. BLAT—the blast-like alignment tool. Genome Res 2002;12(4):656–64. pmid:11932250
- View Article
- PubMed/NCBI
- Google Scholar
35. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990, Oct 5;215(3):403–10. pmid:2231712
- View Article
- PubMed/NCBI
- Google Scholar
36. Ouyang S, Buell CR. The TIGR plant repeat databases: A collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res 2004, Jan 1;32(Database issue):D360–3. pmid:14681434
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Hu TT, Pattyn P, Bakker EG, Cao J, Cheng J-F, Clark RM, et al. The arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet 2011, May;43(5):476–81. pmid:21478890
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Johnston JS, Pepper AE, Hall AE, Chen ZJ, Hodnett G, Drabek J, et al. Evolution of genome size in brassicaceae. Ann Bot 2005;95(1):229–35. pmid:15596470
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Lysak MA, Koch MA, Beaulieu JM, Meister A, Leitch IJ. The dynamic ups and downs of genome size evolution in brassicaceae. Mol Biol Evol 2009, Jan;26(1):85–98. pmid:18842687
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Slotte T, Hazzouri KM, Ågren JA, Koenig D, Maumus F, Guo Y-L, et al. The capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat Genet 2013, Jul;45(7):831–5. pmid:23749190
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Willing E-M, Rawat V, Mandáková T, Maumus F, James GV, Nordström KJ, et al. Genome expansion of arabis alpina linked with retrotransposition and reduced symmetric DNA methylation. Nature Plants 2015;1(2).
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref6] 6. Seymour DK, Koenig D, Hagmann J, Becker C, Weigel D. Evolution of DNA methylation patterns in the brassicaceae is driven by differences in genome organization. PLoS Genet 2014, Nov;10(11):e1004785. pmid:25393550
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Schneeberger K, Ossowski S, Ott F, Klein JD, Wang X, Lanz C, et al. Reference-guided assembly of four diverse arabidopsis thaliana genomes. Proc Natl Acad Sci U S A 2011, Jun 21;108(25):10249–54. pmid:21646520
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, et al. Whole-genome sequencing of multiple arabidopsis thaliana populations. Nat Genet 2011, Aug 28;43(10):956–63. pmid:21874002
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Long Q, Rabanal FA, Meng D, Huber CD, Farlow A, Platzer A, et al. Massive genomic variation and strong selection in arabidopsis thaliana lines from sweden. Nat Genet 2013, Jun 23;45(8):884–90. pmid:23793030
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, et al. The genome of the mesopolyploid crop species brassica rapa. Nat Genet 2011, Aug 28;43(10):1035–9. pmid:21873998
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Dassanayake M, Oh D-H, Haas JS, Hernandez A, Hong H, Ali S, et al. The genome of the extremophile crucifer thellungiella parvula. Nat Genet 2011, Aug 7;43(9):913–8. pmid:21822265
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Wu H-J, Zhang Z, Wang J-Y, Oh D-H, Dassanayake M, Liu B, et al. Insights into salt tolerance from the genome of thellungiella salsuginea. Proc Natl Acad Sci U S A 2012, Jul 9;109(30):12219–24. pmid:22778405
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Yang R, Jarvis DE, Chen H, Beilstein MA, Grimwood J, Jenkins J, et al. The reference genome of the halophytic plant eutrema salsugineum. Front Plant Sci 2013;4:46. pmid:23518688
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref14] 14. Haudry A, Platts AE, Vello E, Hoen DR, Leclercq M, Williamson RJ, et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat Genet 2013, Aug;45(8):891–8. pmid:23817568
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref15] 15. Cheng S, van den Bergh E, Zeng P, Zhong X, Xu J, Liu X, et al. The tarenaya hassleriana genome provides insight into reproductive trait and genome evolution of crucifers. Plant Cell 2013, Aug;25(8):2813–30. pmid:23983221
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref16] 16. Oh D-H, Hong H, Lee SY, Yun D-J, Bohnert HJ, Dassanayake M. Genome structures and transcriptomes signify niche adaptation for the multiple-ion-tolerant extremophyte schrenkiella parvula. Plant Physiol 2014, Apr;164(4):2123–38. pmid:24563282
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Kitashiba H, Li F, Hirakawa H, Kawanabe T, Zou Z, Hasegawa Y, et al. Draft sequences of the radish (raphanus sativus L.) Genome. DNA Res 2014, May 16;21(5):481–90. pmid:24848699
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Lobréaux S, Manel S, Melodelima C. Development of an arabis alpina genomic contig sequence data set and application to single nucleotide polymorphisms discovery. Mol Ecol Resour 2014, Mar;14(2):411–8. pmid:24128264
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Liu S, Liu Y, Yang X, Tong C, Edwards D, Parkin IAP, et al. The brassica oleracea genome reveals the asymmetrical evolution of polyploid genomes. Nat Commun 2014;5:3930. pmid:24852848
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell 2008, May 2;133(3):523–36. pmid:18423832
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref21] 21. Wang Z, Gerstein M, Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nat Rev Genet 2009, Jan;10(1):57–63. pmid:19015660
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref22] 22. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by rna-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 2010, May;28(5):511–5. pmid:20436464
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref23] 23. Li Z, Zhang Z, Yan P, Huang S, Fei Z, Lin K. RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics 2011;12:540. pmid:22047402
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref24] 24. Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, et al. Genome-wide mapping of alternative splicing in arabidopsis thaliana. Genome Res 2010, Jan;20(1):45–58. pmid:19858364
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref25] 25. Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, et al. The arabidopsis information resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Res 2012, Jan;40(Database issue):D1202–10. pmid:22140109
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref26] 26. Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods 2012, Apr;9(4):357–9. pmid:22388286
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref27] 27. Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering splice junctions with rna-seq. Bioinformatics 2009, May 1;25(9):1105–11. pmid:19289445
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref28] 28. Stanke M, Waack S. Gene prediction with a hidden markov model and a new intron submodel. Bioinformatics 2003, Oct;19(Suppl 2):215–25.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref29] 29. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature 2000;408(6814):796–815. pmid:11130711
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref30] 30. Lowe TM, Eddy SR. TRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997, Mar 1;25(5):955–64. pmid:9023104
View Article
PubMed/NCBI
Google Scholar

[116] View Article

[117] PubMed/NCBI

[118] Google Scholar

[ref31] 31. Fahlgren N, Jogdeo S, Kasschau KD, Sullivan CM, Chapman EJ, Laubinger S, et al. MicroRNA gene evolution in arabidopsis lyrata and arabidopsis thaliana. Plant Cell 2010, Apr;22(4):1074–89. pmid:20407027
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref32] 32. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nat Protoc 2012, Mar;7(3):562–78. pmid:22383036
View Article
PubMed/NCBI
Google Scholar

[124] View Article

[125] PubMed/NCBI

[126] Google Scholar

[ref33] 33. Shen S, Park JW, Huang J, Dittmar KA, Lu Z-X, Zhou Q, et al. MATS: A bayesian framework for flexible detection of differential alternative splicing from rna-seq data. Nucleic Acids Res 2012, Apr;40(8):e61. pmid:22266656
View Article
PubMed/NCBI
Google Scholar

[128] View Article

[129] PubMed/NCBI

[130] Google Scholar

[ref34] 34. Kent WJ. BLAT—the blast-like alignment tool. Genome Res 2002;12(4):656–64. pmid:11932250
View Article
PubMed/NCBI
Google Scholar

[132] View Article

[133] PubMed/NCBI

[134] Google Scholar

[ref35] 35. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990, Oct 5;215(3):403–10. pmid:2231712
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref36] 36. Ouyang S, Buell CR. The TIGR plant repeat databases: A collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res 2004, Jan 1;32(Database issue):D360–3. pmid:14681434
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

Figures

Abstract

Introduction

Results and Discussion

Improving the A. lyrata gene annotation using transcriptional data

Validating differences in gene model structure

A. lyrata version-2 annotation in contrast to other Brassicaceae annotations

New annotation enabled improved identification of alternative splicing events

Availability of the annotation and gene naming conventions

Conclusions

Material and Methods

Plant material

Nucleic acids isolation and RNA-seq library preparation

RNA-seq read mapping and gene predictions

Ortholog identification

Identification of TE genes in version-2

cDNA preparation and PCR

Differential gene expression and alternative splicing

Supporting Information

S1 Dataset. Supplementary figures.

S2 Dataset. General feature formatted (GFF) file describing version-2 annotation.

S3 Dataset. GFF file describing genes that were removed from version-1.

S4 Dataset. Table describing the mapping of version-1 to version-2 gene models.

S5 Dataset. Primer information for gene model validation.

Acknowledgments

Author Contributions

References