De novo transcriptome assembly of the green alga Ankistrodesmus falcatus

Ankistrodesmus falcatus is a globally distributed freshwater chlorophyte that is a candidate for biofuel production, is used to study the effects of toxins on aquatic communities, and is used as food in zooplankton research. Each of these research fields is transitioning to genomic tools. We created a reference transcriptome for of A. falcatus using NextGen sequencing and de novo assembly methods including Trinity, Velvet-Oases, and EvidentialGene. The assembled transcriptome has a total of 17,997 contigs, an N50 value of 2,462, and a GC content of 64.8%. BUSCO analysis recovered 83.3% of total chlorophyte BUSCOs and 82.5% of the eukaryotic BUSCOs. A portion (7.9%) of these supposedly single-copy genes were found to have transcriptionally active, distinct duplicates. We annotated the assembly using the dammit annotation pipeline, resulting in putative functional annotation for 68.89% of the assembly. Using available rbcL sequences from 16 strains (10 species) of Ankistrodesmus, we constructed a neighbor-joining phylogeny to illustrate genetic distances of our A. falcatus strain to other members of the genus. This assembly will be valuable for researchers seeking to identify Ankistrodesmus sequences in metatranscriptomic and metagenomic field studies and in experiments where separating expression responses of zooplankton and their algal food sources through bioinformatics is important.


Introduction
Ankistrodesmus is a genus of unicellular, freshwater algae in the family Selenastraceae. These chlorophytes are model organisms for studying cellular physiology in phytoplankton because they are able to survive under many different growth conditions and exhibit rapid growth rates compared to other algal species. For example, Brown and Weis, studied the metabolic interconnections between photosynthesis and respiration in A. braunii [1], and Shatilov, et al. used the same species to further our understanding of chloroplast-encoded enzymatic activity within the cell [2]. More recently, Asselborn, et al. showed the potential effects of insecticides on phytoplankton communities [3] and Skorupskaite, et al. determined the best ways to disrupt cell membranes for biofuel production [4], both using Ankistrodesmus as models.

Strain source
We obtained our strain of A. falcatus (Fig 1) from the lab of A. J. Tessier, who originally acquired it in the late 1970s from the lab of C. E. Goulden at the Academy of Natural Sciences in Philadelphia, PA, USA. The strain's provenance prior to that is unknown. The earliest known published work with the strain is Goulden and Hornig, [22] in which the authors state that the strain has an unknown origin. Here, we designate this strain AJT.

Growth conditions
We grew A. falcatus in semi-continuous culture under a 24 h:0 h L:D photoperiod with a light intensity of~100 μmol photons/m -2 s -1 PAR, from fluorescent lamps (CH Lighting F32T8/841/ ECO) arranged laterally on one side of the culture vessels. Cultures are left to grow at ambient room temperatures (20-23˚C). We grew cultures in ASM-1 freshwater algal medium [50] with added vitamins. The added vitamin solution included biotin, thiamine, pyridoxine, calcium pantothenate, B12, nicotinic acid, nicotinamide, folic acid, riboflavin, and inositol at the concentrations specified in Goulden and Hornig [22]. The culture was kept in 5L bottles set up with constant aeration and stirring at 400 rpm to prevent settling. Samples for RNA extractions were taken when the cultures were in exponential phase.

RNA extraction protocol
We extracted total RNA from 100 mL of the A. falcatus stock using a modified procedure for the Qiagen RNeasy Plant Mini Kit RNA extraction protocol. We split the 100 mL sample into two 50 mL aliquots in 50 mL centrifuge tubes and then spun them down at 7000 rpm (5927g; Beckman Coulter J2-21 centrifuge; JA-20 rotor) for 15 minutes. After supernatant was removed, we transferred the pellets to two 2-mL centrifuge tubes. These tubes were spun at 5000 rpm (2340g; Eppendorf AG centrifuge 5424; Eppendorf rotor FA-45-24-11 5424/5424R) for 10 minutes. We again removed the supernatant, and then froze the pellets in liquid nitrogen. Once frozen, we added 450 μL of Buffer RLT with added β-mercaptoethanol (prepared by adding 10 μL β-mercaptoethanol to 1 mL of Buffer RLT) to each tube, and disrupted the cells using a handheld tissue homogenizer. We used the standard Qiagen RNeasy Plant Mini Kit RNA extraction procedure for the remainder of the extraction process. Once extracted, we checked the purity of the RNA using a Nanodrop 2000 and obtained the concentration with a Qubit 4 Fluorometer. We checked the integrity of the RNA by running a sample of the extracted RNA on a 2% agarose gel at 60V for 1 hr. RNA samples were considered good quality if the 260/280 and 260/230 ratios were greater than 1.8 and clear rRNA bands could be observed on the gel without signs of degradation. The sample with the best quality control metrics overall was sent to Vanderbilt Technologies for Advanced Genomics (VANTAGE) for 150bp paired-end (PE) NovaSeq 6000 sequencing targeting 100 million reads per sample. Library preparation was performed at VANTAGE using the Illumina Tru-seq RNA sample library prep kit.

Transcriptome assembly
We checked the quality of the raw reads with FastQC [51]. Reads were trimmed and adapter sequences were removed using Trimmomatic [52] with the following parameters: ILLUMINA-CLIP:TreSeq3-PE.fa:2:30:10 HEADCROP:20. After Trimmomatic, 76.82% of the raw reads (38,277,563 out of 49,830,437 paired reads) remained, and we used these for the transcriptome assembly.
We first created a transcriptome with Trinity [53]. Then, we created several assemblies using kmer lengths of 35, 45, 55, 65, 75, 85, and 95 with Velvet-Oases [54]. The Velvet-Oases assemblies were merged to create one final Velvet-Oases assembly. We combined the Trinity final Velvet-Oases assemblies with EvidentialGene mRNA transcript assembly software (Evi-Gene; [55]) with a kmer length of 75. We used EviGene to correct for the various biases attributed to different assemblers. Additionally, EviGene is useful for pulling out potential isoforms and splice variants of each gene, and separating these potential variations into an independent 'alternative' file so that the final assembly is less likely to be full of gene duplicates or isoforms, increasing the confidence that each contig that remains is indeed a unique gene. This ensured that we were left with the most comprehensive and accurate transcriptome assembly across both assembly methods.

Quality control and statistics
We removed any remaining rRNA sequences from the final assembly by downloading the small and large rRNA subunits for A. falcatus from the SILVA database [56] and blasting these sequences against our A. falcatus assembly. Only 6 contigs came back with hits as rRNA subunit sequences and these were removed from the final assembly. We then used Benchmarking Universal Single Copy-Orthologs (BUSCO, version 3) to assess the completeness of the transcriptome by searching our assembly against the BUSCO Chlorophyta_odb10 (creation date: 2017-12-01) and the Eukaryota_odb9 (creation date: 2016-11-02) datasets [57]. We used TransRate [58] to obtain descriptive statistics and to assess the overall quality of the transcriptome assembly. We considered any contigs that had a "good" read mapping percent ("p_good" in the TransRate contig result file) of 0 to be poor quality and removed these contigs from the final assembly.

Gene annotation
We used the de novo transcriptome annotator dammit [59] to annotate our final assembly. This pipeline uses Transdecoder to build gene models and then searches the Pfam-A, Rfam, OrthoDB, and uniref90 protein databases for annotation information with an E-value cutoff of 1x10 -5 . The putative transcripts were also run through InterProScan to obtain a broader sense of functional annotations.

Genetic distance to other species
Our transcriptome produced a sequence for rbcL (ribulose bisphosphate carboxylase, large subunit) from A. falcatus strain AJT. Since no phylogeny for the genus is available, we sought to evaluate the genetic distance from other Ankistrodesmus species using rbcL. We downloaded all available Ankistrodesmus rbcL sequences from the NCBI nucleotide database, including one of A. falcatus. We aligned the sequences using MUSCLE as implemented in MEGA X (Kumar, et al. 2018) [60]. We visualized genetic distances by creating a neighborjoining tree [61] and tested it with 500 bootstrap replicates, again using MEGA X.

Transcriptome assembly statistics
After quality control, our assembly had 17,997 contigs with an average contig length of 1,737bp and a GC content of 64.8%. The N50 length was 2,462bp and the N70 was 1,726bp (Table 1). This is a substantial improvement over the only available reference transcriptome for Ankistrodesmus (an unknown species with a strain designator of UCP0001), which had an N50 of 1,038bp and an average contig length of 508bp [44]. Differences in sequencing depth, assembly methods, and species' biological variation could all contribute to these differences in assembly metrics.
We used TransRate to examine the alignment and read mapping characteristics of the final assembly. The TransRate results showed that a total of 79.5% of the total reads mapped back to our final assembly.

Gene annotation
The dammit pipeline recovered 68.89% (12,399 out of 17,997) transcript annotations that were homologous to proteins across the Pfam-A, Rfam, OrthoDB, and uniref90 databases, which is comparable to the currently available Ankistrodesmus transcriptome. Only 9 of these recovered annotations came back as hypothetical proteins, and the remaining 31.11% of transcripts did not have an annotation hit across the protein databases.
We used InterProScan to obtain broad functional groupings along with the dammit annotation output to confirm that annotations that we expected to see in photosynthetic unicellular

PLOS ONE
eukaryotes were present. These results suggested that the greatest proportion of the annotated genes were related to oxidation-reduction biological processes (including photosynthesisrelated functions and electron transport), protein phosphorylation, transmembrane transport, lipid and carbohydrate synthesis and metabolism, and DNA replication, regulation, and repair ( Fig 2).

Genetic distance to other Ankistrodesmus
We obtained 16 sequences of rbcL in Ankistrodesmus from NCBI (Table 3) and constructed a neighbor-joining tree. We included sequences of two other members of the Selenastraceae, Raphidocelis microscopica and Kirchneriella aperta. No rbcL sequence was available for

PLOS ONE
Monoraphidium convolutum (syn. A. convolutus) or for Ankistrodesmus sp. UCP0001. The rbcL sequence of the AJT strain was virtually identical to that of A. falcatus UTEX101, and the two sequences grouped together in 100% of bootstrap replicates (Fig 3). A. falcatus grouped most closely to one strain of A. stipitatus, but not closely to three other A. stipitatus strains. In general, deeper nodes were weakly supported, and rbcL distances suggest seven or more similarly related subgroups within Ankistrodesmus.

Availability of supporting data
Raw sequence data has been deposited in the Sequence Read Archive (SRA) under the accession PRJNA631045. This Transcriptome Shotgun Assembly project has been deposited at DDBJ/EMBL/GenBank under the accession GIOC00000000. The version described in this paper is the first version, GIOC01000000.
A. falcatus is one of the most promising biofuel candidates due to its high lipid productivity compared to other algal species, [9, 10] and is often used as a model for studying how changes to resource availability impacts the lipid content important for biofuel production. For example, Alvarez-Diaz, et al. showed that manipulating the concentration of phosphorus or nitrogen and altering the light availability increases A. falcatus' lipid productivity substantially [78]. Even though Ankistrodesmus species are common freshwater chlorophytes, the phylogenetic relationships of the genus are poorly defined. We used the results from our transcriptome assembly to investigate the genetic distance between A. falcatus and other Ankistrodesmus species with publicly available rbcL sequences. Our neighbor-joining tree is based on a single chloroplast gene and should not be taken as an attempt to identify phylogenetic relationships among Ankistrodesmus species. However, it is the best available representation of genetic diversity across the genus and indicates that A. falcatus is a reasonable representative of Ankistrodesmus for genomic purposes. In fact, considering that the generic relationships among the Selenastraceae are poorly resolved and most genera appear to be polyphyletic, [48] A. falcatus may be a reasonable representative of the whole family.
While discrepancies in assembly statistics are common due to differences in sequencing protocols, assembly methods, and biological variation, [80,81] our A. falcatus assembly is comparable to other publicly available, high quality algal transcriptomes with closely-related organisms and similar assembly methods. Chlorophyte transcriptomes range upwards of 100,000 genes depending on assembly method, with average gene lengths between~1000-3000bp. Wang, et al. assembled the transcriptome of the green algal model Chlamydomonas  Table 3.
https://doi.org/10.1371/journal.pone.0251668.g003 reinhardtii with 91,242 genes, an average contig length of 2,691, and a N50 of 4,554. Desmodesmus sp. [82] WR1, another chlorophyte, has been assembled with 32,823 unigenes and a N50 of 1,905bp [83]. A transcriptome assembled for Scenedesmus acutus has 51,846 genes with a N50 of 1,351 and an average gene length of 824bp [84]. Yu, et al. created an assembly of the chlorophyte Chlorella minutissima UTEX2341 which had 14,905 contigs with an average contig length of 2998bp [85]. While their study did not focus on a chlorophyte, Lauritano, et al. used a similar assembly pipeline (a combination of Trinity and Velvet/Oases) to create a de novo transcriptome for a dinoflagellate which had an average contig length of 1,490bp and N50 of 2,055bp [86]. Our A. falcatus transcriptome has a total of 17,997 contigs with a N50 of 2,462bp and an average contig length of 1,737bp, which falls within published ranges of expected values for similar unicellular eukaryotes. While we do not have independent information regarding the actual gene lengths of A. falcatus, our N50 statistic and the average contig length reported here is an improvement over the short N50 (1038bp) and average contig length (508bp) observed in the currently available Ankistrodesmus sp. transcriptome. It is possible that the currently available assembly is fragmented or missing information due to differences in sequencing depth and assembly methods, resulting in the shorter average contig lengths and a smaller N50 statistic.
Our BUSCO results suggest that 7.9% of the supposedly single-copy orthologs are duplicated in the A. falcatus genome. The percent of duplicated BUSCOs is expected to be low because they evolve under single-copy control, but duplication percentages have been shown to range from 1.5% to 13% in other eukaryotes (including Drosophila melanogaster, Caenorhabditis elegans, Homo sapiens, Lottia gigantea, and Aspergillus nidulans; [47]). Our BUSCO results are comparable to these expectations, and substantially lower than many other available algal transcriptomes, where duplication is reported as high as 52% [87][88][89]. Because our assembly does not suggest a high level of gene duplication, it indicates that though gene duplications occur, there has not been a whole genome duplication in Ankistrodesmus. Duplicated genes offer material for evolutionary forces to act upon, and some duplication events have been linked to stressful environmental conditions in algal species [90,91]. Selection on these duplicated genes may lead to adaptation within changing environments, and it is possible that the observed, retained gene duplications within the A. falcatus assembly may be a result of such scenarios.

Conclusion
Our A. falcatus transcriptome presented here is of high quality and is an improvement over the currently available Ankistrodesmus assembly. Using data that emerged from our sequencing efforts, we created a simple neighbor-joining tree of Ankistrodesmus species. This revealed that A. falcatus appears to be a suitable representative of the Selenastraceae, as well as a good candidate for genomic studies. Though based on limited data, our tree also reinforces prior sequenced-based phylogenies of Ankistrodesmus in suggesting the genus is in serious need of taxonomic revision. In both our analysis and other recent reports, distinct strains that are nominally the same species often do not group together. The transcriptome we report here is an important development for studies where community field sample identification may require genomic resources, such as in metagenomic and metatranscriptomic research in freshwater systems where Ankistrodesmus species may be prevalent. Additionally, A. falcatus could potentially be used for biofuel production, and is commonly used as a food source in zooplankton research. This assembly will be valuable to both of these fields as they move further into using genomics and bioinformatics techniques for addressing their central questions.