A pseudomolecule assembly of the Rocky Mountain elk genome

Rocky Mountain elk (Cervus canadensis) populations have significant economic implications to the cattle industry, as they are a major reservoir for Brucella abortus in the Greater Yellowstone area. Vaccination attempts against intracellular bacterial diseases in elk populations have not been successful due to a negligible adaptive cellular immune response. A lack of genomic resources has impeded attempts to better understand why vaccination does not induce protective immunity. To overcome this limitation, PacBio, Illumina, and Hi-C sequencing with a total of 686-fold coverage was used to assemble the elk genome into 35 pseudomolecules. A robust gene annotation was generated resulting in 18,013 gene models and 33,422 mRNAs. The accuracy of the assembly was assessed using synteny to the red deer and cattle genomes identifying several chromosomal rearrangements, fusions and fissions. Because this genome assembly and annotation provide a foundation for genome-enabled exploration of Cervus species, we demonstrate its utility by exploring the conservation of immune system-related genes. We conclude by comparing cattle immune system-related genes to the elk genome, revealing eight putative gene losses in elk.


Introduction
Rocky Mountain elk (Cervus canadensis) were once distributed across much of North America but now inhabit remote areas. Rocky Mountain elk were nearly exterminated from the Rocky Mountains of Alberta and British Columbia in the early 1900s [1], but were restocked between 1916-1920 with elk from the Greater Yellowstone Area [2][3][4][5]. By 1940 elk populations expanded so greatly, that periodic culling was necessary [3,6]. While elk have been reintroduced to many areas, the densest populations are maintained in mountainous remote areas, like the Greater Yellowstone Area. Elk typically avoid the presence of domesticated livestock, yet they will utilize the same grounds for grazing when livestock are absent [7]. This can be problematic for ranchers occupying areas near elk populations like the Greater Yellowstone Area. Elk are known reservoirs for brucellosis, (Brucella abortus) a disease that is highly contagious and poses a risk to livestock and humans [8][9][10]. Because of the potential for causing abortion in cattle, the USDA used vaccines and serologic testing to nearly eradicate B. abortus from domestic herds [11]. Yet in the last 15 years, over 20 cases of transmission to cattle have been traced to wild elk populations in the Greater Yellowstone Area. Attempts to establish long-term immunity through vaccination have proven unfruitful, as elk have negligible adaptive cellular immune responses to existing Brucella vaccines [12]. Because the eradication of B. abortus from cattle herds can cost hundreds of thousands of dollars and current tools make it unfeasible to control infection in wild elk, there is a need to dissect the genetic nature of limited immune responses in elk. With advances in sequencing technology (PacBio, Illumina and Hi-C), we are now able to investigate difference in adaptive immune response at the genomic level by examining the presence and absence of immune system-related genes. Here, we report a chromosomal level reference genome assembly and annotation of the Rocky Mountain elk and perform a preliminary investigation of immune gene loss between elk and cattle.

Animal selection
A long-captive herd in Minnesota provided a healthy adult male Rocky Mountain elk for Pac-Bio sequencing, and another for HiC and Chicago sequencing. White blood cells from six females from the aforementioned herd and six females from Wyoming were used for paired end sequencing, while an an elk calf, captive-born in Iowa, was used for RNA-seq. The research protocol was approved by the National Animal Disease Center Animal Care and Use committee and all animals under the protocol were maintained in accordance with animal care regulations.

Sequencing
For the initial contig assembly we generated a hybrid data set with Illumina PCR-free 150bp paired end reads and PacBio RSII reads produced with P6-C4 chemistry. Chicago and Hi-C libraries were prepared as described previously [13,14]. Both Chicago and Hi-C libraries were prepared similarly, though Hi-C libraries were nuclear-fixed. Briefly, formaldehyde-fixed chromatin was digested with DpnII, and 5' overhangs were sealed with biotinylated nucleotides. Blunt ends were ligated, followed by crosslink were reversed for DNA purification from protein. We then removed biotin that was not internal to ligated fragments. DNA was sheared to a mean length of~350 bp for library construction with NEBNext Ultra enzymes and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of the libraries. Both Chicago and Hi-C libraries were sequenced on an Illumina HiSeqX at 2x150bp, attaining totals of 470 million and 500 million reads, respectively.
To prepare samples for PacBio and Illumina sequencing, DNA from purified peripheral blood mononuclear cells was isolated using a Gentra Puregene Blood Kit (Qiagen) and Genomic-tip 500/G kit (Qiagen), respectively, in accordance with manufacturer recommendations. Resulting DNA preparations were quantified using Qubit Broad Range Assay (Ther-moFisher) and assessed for quality via Nanophotometer Pearl (Implen). Prior to Pacific Biosciences (PacBio) library preparation, DNA fragment size was evaluated using the HS Large Fragment 50 Kb method on fragment analyzer (Advanced Analytical Technologies, Inc.) and determined to have an average size of approximately 40 kb. The DNA was sheared to approximately 20kb, size separated using a Blue Pippin using the PAC-30 KB cassette (Sage Science). Libraries were prepared for PacBio sequencing using the large insert library protocol and Illumina sequencing using the TruSeq PCR-free kit per manufacturer recommendations. Long read sequencing was conducted on the Pacific Biosystems RS II. Illumina short read sequencing (150 bp PE) was conducted on the HiSeq 3000 platform in accordance with manufacturer recommendations.
For preparation of RNAseq data tissue samples (skeletal muscle, spleen, kidney, lung, prescapular lymph node and mesenteric lymph node) were collected and stored in RNAlater™ (Ambion) at 4˚C. Excess RNAlater™ was removed following overnight incubation, and samples were stored at -80˚C. For RNA isolation, approximately 50 mg of each tissue were added to 1 ml of TRIzol© (ThermoFisher) and processed according to manufacturer's instructions. Following collection of the aqueous phase, samples were purified using the Purelink© RNA Mini kit (ThermoFisher), following manufacturer's recommendations. RNA quality was assessed using an Agilent Bioanalyzer using the RNA 6000 Nano kit. RNA concentrations were determined using a Nanodrop (ThermoFisher). Sequencing libraries were prepared after ribosomal RNA depletion using the Ribo-Zero H/M/R kit (Illumina) and stranded total RNA-seq libraries were prepared using the Ultra II RNA library prep kit (New England Biolabs) per manufacturer's recommendations. Resulting libraries were sequenced using a HiSeq 3000 (Illumina) and 100 cycle paired-end chemistries.

Genome assembly
An initial genome assembly was generated with Masurca version 3.2.3 [15], attaining a 2,559.8 Mbp genome size in 29,125 contigs with N50 size of 1,224,689bp. Dovetail Genomics scaffolded this assembly using an iterative HiRise analysis informed via alignments of Chicago and then Hi-C libraries with a modified SNAP aligner (http://snap.cs.berkeley.edu). This assembly contained 2,560.5 Mb, with an L90 of 31 scaffolds, and a N90 of 43.374 Mb. 1,004,453,472 Chicago and Hi-C reads were used to scaffold this Dovetail assembly with a Juicer 1.5.6, 3D-DNA 180922, and JuiceBox 1.9.8 [16,17]. Reads were extracted from bam files with Picard 2.9.2 [18]. The Dovetail assembly was masked using RepeatModeler 4.0.7 [19] and RepeatMasker 1.0.8 [20], prior to the alignment of Hi-C reads with BWA mem 0.7.17 [21]. Alignments were processed using Juicer, 3D-DNA [22], and Juicebox [16,17]. The Juicebox assembly strategy consisted of: manually placing all contigs greater than 10kb, incorporating scaffolds at the highest Hi-C signal, placing scaffolds in non-repetitive regions when Hi-C signal was equal between a repetitive and non-repetitive region, repeats were clustered whenever possible, and only obvious mis-joins were edited. The initial Juicebox scaffolding created 34 pseudomolecules, which was then compared to the Cervus elaphus hippelaphus genome assembly (GCA_002197005.1) [23] to reveal the merger of the X and Y chromosomes. A BLASTn [24] of the C. elaphus hippelaphus genome sequence was used to identify coordinates, allowing the correct separation the X and Y chromosome via the heatmap in Juicebox. The 3D-DNA assembly finished with 22,557 scaffolds.
The contigs that could not be integrated into the pseudomolecules were eliminated based on repetitiveness, duplicated heterozygous contigs, RNA-seq mapping potential, and contig size (>500 bp). BEDTools 2.25.0 [25] was used to merge coordinates from mapping these contigs to the pseudomolecules with BLAST+ 2.9 (score >300) and RepeatMasker 1.0.8 [20] masking coordinates. 22,065 contigs were eliminated that were less than 1kb, had at least 90% query coverage, and lacked a single unique mapping RNA-seq read, leaving 35 pseudomolecules, 457 contigs, and a mitochondrial genome. . Due to uneven and excessive coverage in repetitive regions, paired end alignments were set at a max coverage of 30x using jvarkit [32]. Due to the excessive repetitiveness of Chromo-some_14, 50Mbp of this chromosome was not polished.
After polishing, another round of small contig elimination was performed by merging RepeatMasker [20] coordinates and coordinates from BLAST+ 2.9 [24] (score >300, width 1000bp) to the pseudomolecules with Bedtools 2.25.0 [25]. If 90% of query length was repetitive and contained within the pseudomolecules, it was eliminated. BlobTools 1.11 [33] was run with PacBio subread alignments to the genome, and contigs annotated with BLAST [24] to the NT database (S1 Fig). All scaffolds passed contamination screening, resulting in a final assembly containing 35 pseudomolecules, 151 contigs, and the mitochondrion.

Mitochondrial identification and annotation
BLAST+ 2.9 [24] was used to identify the mitochondrial genome by querying the mitochondrial scaffold of the C. elaphus hippelaphus GCA_002197005.1 [23]. Though the mitochondrial genome was identified, it contained three juxtaposed mitochondrial genome duplications. The scaffold was manually corrected using genomic coordinates with faidx in Samtools 1.9 [31]. Genes were annotated in the mitochondrial genome using the Mitos2 webserver [34] with RefSeq 89 Metazoa, a genetic code of 2, and default settings.

Repeat prediction
A final version of predicted repeats was obtained using-sensitive 1 and-anno 1 for EDTA 1.7.9 [35] and with default parameters for RepeatModeler 1.0.8 [19] with RepeatMasker 4.1.0 [20].

Gene prediction
A total of 753,228,475 RNA-seq reads aligned to the genome using Hisat2 2.0. 5

BUSCO
Universal single copy orthologs were assessed using BUSCO 4.0 [53,54], with the eukaryo-ta_odb10 and cetartiodactyla_odb10 datasets in both genome and protein mode.

Identification and verification of immune system-related genes
Immune system-related genes from Bos taurus were found in the GENE-DB database of the International ImMunoGeneTics website (www.imgt.org) [59]. This database is comprised of immunoglobulins (IG), T cell receptors (TR) and major histocompatibility (MH) genes from vertebrate species. A tblastn (2.9.0+) [24] was performed against the elk and cattle genome assembiles (GCF_002263795.1_ARS-UCD1.2) [55], with an e-value cutoff of 1e-3. We removed candidate missing genes based on whether a similar isoform was present in the elk genome. To continue finding candidate missing genes in the elk genome, not found by tBLASTn, we investigated using Bedtools 2.25.0 extracted cattle nuceotide sequences with a BLASTn to the elk genome. Those genes that were still not found via BLASTn [24], were modified to retain 20 bp border sequences with Bedtools 2.25.0, and subjected to another BLASTn [24] to the elk genome. If a gene was still not found, hit sequences in the cattle genome were expanded by 100bp with Bedtools 2. 25

Results and discussion
Here we present the first pseudomolecule assembly of C. canadensis, generated with 1.7 trillion base pairs of sequencing at a 686-fold coverage of the genome.

Genome assembly
An initial assembly was created with MaSuRCA [15,60] generating 23,302 contigs, an L90 of 2,500 contigs, and an N90 of 197,963bp. Through collaboration with Dovetail Genomics and then additional implementation of the Juicer/JuiceBox/3D-DNA pipeline [16,17,22], we generated an assembly of 33 autosomes, an X chromosome, a Y chromosome, a mitochondrial genome, and 151 unincorporated contigs. This result is supported by published cytological studies revealing a haploid set of 34 chromosomes [61]. We utilized synteny to identify homologous chromosomes between elk and red deer, and found that nearly always, elk chromosome sizes fell within the estimated size of the red deer's assembled chromosomes [23] (S1 Table). The only exception is the Y chromosome, which was nearly twice (7.6 Mb) the largest predicted size (4 Mb) of the red deer chromosome. We investigated all putative contaminant contigs from Blobtools [33], and ruled out contamination (S1 Fig), but also took additional steps to ensure the completeness of the genome by mapping reads back to the assembly. We found that we captured the majority of genome, with 90.7% and 87.3% of PacBio CCS reads Illumina DNA-seq aligning to the genome (S2 Table). To evaluate the completeness of the genome we ran BUSCO 4.0.2 [54] (Benchmarking Universal Single Copy Orthologs) on genome. Of the possible 255 and 13,335 genes in the eukaryota and certartiodactyla odb10 datasets, 62% and 88.1% were complete, 2.4% and 2.1% were duplicated, and 3.1% and 2.1% were fragmented, and 32.5% and 9.8% were missing, respectively.

Genome annotation
To obtain a high-quality elk gene prediction, we pursued an extensive annotation of repeats in the genome using two repeat predictors. While EDTA [35] utilizes a comprehensive set of repeat prediction programs to create a repeat annotation, Repeatmodeler/Repeatmasker [19,20] is a long-standing and comparable annotator of repeats that is more reliant on copy number. With EDTA, 25.8% of the genome was marked repetitive, with DNA transposons comprised the largest percentage of repeats in the genome, at 16% (S3 Table). In contrast, RepeatMasker assessed 36.5% of the genome as an interspersed repeat, with 28.8% of the genome being comprised LINE retrotransposons. We merged these repeat annotations with BEDTools [25] to reveal that 38% of the genome is repetitive. This is in contrast to the repetitive content in red deer, estimated at 22.7%. This difference could be due to technological improvements and could stem from the large proportion of gaps in the red deer genome (1.5Gbp) [23]. While together these differences could account for a large disparity in chromosome sizes, only the elk Y chromosome was outside the gapped and sequence length range in red deer chromosomes [23].
To annotate the genes in the genome we generated 1.5 billion paired end reads of sequencing from six tissues, including kidney, lung, mesenteric lymph node, muscle, prescapular lymph node, and spleen. After masking repeat sequences using Repeatmodeler [19] and Repeatmaker [20], we performed five de novo transcript/gene predictions with a soft-masked genome and RNA-seq. The best transcripts were discerned using Mikado [47], followed by clustering with Cufflinks [50] using B. taurus mRNAs to cluster transcripts into gene loci. Using this approach 18,013 genes were predicted to encode 33,433 mRNAs (S4 Table). The functional annotations of these genes were extremely high, with 17,938 of the 18,013 genes or 99.6% being annotated by at least one of: Interproscan or BLAST to NR, NT, and Uniprot (S5 Table). The gene annotation was evaluated for completeness with BUSCO in protein mode. A remarkable "Complete" score improvement is seen in both eukaryota and cetartiodactyla at 97.7% and 92.1%, respectively. These results together suggest that both the genome and the gene prediction are of high quality.

Comparison to related species
By utilizing these new gene predictions we evaluated the conservation of chromosome structure between C. canadensis, C. elaphus hippelaphus, and B. taurus using gene-based synteny with i-ADHoRe [56]. All elk chromosomes were syntenic with all C. elaphus and B. taurus chromosomes, though Y chromosome lacked the genes required for gene-based synteny (Fig  1, Table 1). As has been seen in previous Cervus assemblies [23], multiple pairs of chromosomes are tandemly fused in B. taurus and vise-versa (Table 2). We confirmed previous reports Table 1. Chromosome statistics of the Rocky Mountain elk assembly compared to red deer, with syntenic relationships to red deer, sika deer, cattle, sheep and human.

Cervus canadensis Total length (bp) Repetitive elements (bp) Gene Frequency Red Deer Gene Frequency
Chromosomal Relationships  Table 2). Two inter-chromosomal translocations were inferred between the two Cervus species, both having strong Hi-C support in elk (Fig 1, Table 3). Chromosome_15 and Chromosome_24 of elk, comprised large portions of C. elaphus Ce_Chr_33 and a minor portion of Ce_Chr_8. With the majority of Chromosome_24 homologous to C. elaphus hippelaphus Ce_Chr_8, a 17 MB region of Ce_Chr_33 may have been falsely attached to Ce_Chr_8 in C. elaphus hippelaphus. Another smaller chromosome translocation of 13.6 MB occurred between Ce_Chr_22 and Ce_Chr_3 of C. elaphus, attributed to chromosomes 21 and 25 in C. canadensis. A small region of Ce_Chr_22 was likely falsely attached to Ce_Chr_3 in C. elaphus hippelaphus. Interestingly, both of these translocations are between chromosomes in elk that are fused chromosomes in B. taurus, Bt_Chr_2 and Bt_Chr_5 (Table 3). While it is possible that these translocations occurred since the divergence of these two species, because the B. taurus assembly was used to orient and join scaffolds in the C. elaphus hippelaphus genome assembly, it is likely that these translocations are misassemblies in the C. elaphus hippelaphus genome.

Immune gene loss
A total of 36 Bos taurus immune coding sequences from the IMGT GENE-DB database [59] were lacking from initial investigations of the elk genome, and yet were identified in cattle genome. Despite extensive attempts to identify these genes in the elk genome with tBLASTn, BLASTn of cattle hit sequences, and BLASTn of cattle hit sequences with 20bp borders, we were unable to identify putative elk orthologs (Table 4, S6 Table). However, seventeen putative gene loci were identified in elk using a BLASTn of cattle nucleotide sequences hit by the tBLASTn, an additional twelve were found using the broadened cattle hit sequences with 20bp borders, and seven were confirmed missing from the genome (S6 Table, Table 4). We found a Ce_Chr_8 has a 17Mbp region of Ce_Chr_33, and Ce_Chr_3 has a 13.6Mb region of Ce_Chr_22. P is proximal, q represents distal.
https://doi.org/10.1371/journal.pone.0249899.t003 Table 4. Read mapping of suspected missing genes in the elk genome. complete lack of genomic gaps in these regions, confirming the contiguity of these suspected gene regions. However, RNA-seq aligned to 27/36 of these suspected loci, indicating genomic variation in these regions may prevent their identification. Nevertheless, nine genes lacked a translatable sequence in the elk genome and could not align RNAseq, confirming their absence from both genomic and transcriptomic data. These genes were AY644517_TRGC4, IMGT000049_TRAJ8-1, IMGT000049_TRAJ3, IMGT000049_TRAJ17, IMGT000049_TRAJ42, IMGT000049_TRAJ49, IMGT000049_TRAJ56, KT723008_IGHD, and a homolog of (AY149283_IGHJ1-2,KT723008_IGHJ2-2,NW_001494075_IGHJ1-2) (S6 Table). All of these loci encode components of the T cell receptor: (gamma constant 2), (T cell receptor alpha joining), and (delta chain) or are heavy chains in the immunoglobulin complex (S6 Table). Ruminants, including elk, differ from rodents and humans by the high proportion (sometimes 40-50%) of T cells circulating in the peripheral blood expressing γδ receptors. In all species, γδ T cells are involved in diverse and important roles in not only adaptive, but also innate immune responses [62]. Rearrangements of V (variable), J (joining) and C (constant) regions of the γ chain when combined with the δ chain contribute to the repertoire diversity of the γδ T cell receptor. While future work will be necessary to understand how the loss of these genes affects the cellular immune response in elk, certainly the loss of T-cell receptor diversity is an important consideration in discerning why elk does not develop protective immunity after B. abortus vaccination. Because B. abortus is a facultatively intracellular bacteria, stages of the disease cannot be accessed by antibodies, and thus cellular immune responses must be activated by T cell receptors interacting with antigens on the surface of infected cells [63,64]. In cattle, protection to some bacterial diseases via vaccines is mediated by memory T cells activating effector T cells and some specific cases, effector T cell populations bearing gamma-delta chain receptors. A reduction in the number of available T cell receptor variants could limit or hinder immune responses to some antigens. Thus, this investigation provides a foundation for the development of a viable vaccination strategy in elk, a step towards developing long-term immunity to Brucella.

Conclusions
This genome assembly and annotation of the Rocky Mountain elk is the most contiguous assembly of a Cervus species and will serve as an important tool for genomic exploration of all related Cervids. Elk's loss of immune system-related genes in relation to cattle, may provide a clue to establishing a successful vaccination strategy. This chromosomal assembly of the elk genome will provide an excellent resource for investigating genes involved in elk's poor adaptive cellular immune response to Brucella vaccines.
Supporting information S1