Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Manual curation for improved genome annotation of the functionally extinct northern white rhinoceros (Ceratotherium simum cottoni)

  • Elena Ruggeri ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    eruggeri@sdzwa.org

    Affiliation Reproductive Sciences, Conservation Science Wildlife Health, San Diego Zoo Wildlife Alliance, Escondido, California, United States of America

  • Julien Prunier,

    Roles Data curation, Software, Writing – review & editing

    Affiliation Genovalia, Centre de valorisation des données en génomique non-humaine, Université Laval, Québec City, Québec, Canada

  • Marc-André Sirard,

    Roles Supervision, Writing – review & editing

    Affiliation Département des Sciences Animales, Université Laval, Québec City, Québec, Canada

  • Barbara Durrant,

    Roles Writing – review & editing

    Affiliation Reproductive Sciences, Conservation Science Wildlife Health, San Diego Zoo Wildlife Alliance, Escondido, California, United States of America

  • Kristin Klohonatz

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Center for Research on Reproduction and Women’s Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Abstract

The northern white rhinoceros (Ceratotherium simum cottoni) genome and annotation were previously published, but the annotation contained few genes, with many annotation misalignments, and nomenclature not matching HGNC/VGNC naming conventions, making transcriptional studies very difficult. We used in vivo collected granulosa cells for RNA sequencing and de novo transcript assembly through StringTie to identify all nucleotide gene sequences in our samples. Through extensive manual curation we were able to generate a greatly improved genome annotation, increasing gene numbers by 81%. This will greatly enable researchers in this field to utilize the genome and annotation to complete transcriptional studies with this species.

Background

White rhinoceroses play a key ecological role within African ecosystems. An enduring conservation plan has been unraveled to reverse the vanishing of this keystone species [18]. Poaching remains the primary threat, with the northern white rhinoceros (Ceratotherium simum cottoni, NWR) population functionally extinct, and the southern white rhinoceros (Ceratotherium simum simum, SWR) threatened. Over the past decade, significant conservation efforts have focused on saving the critically endangered NWR culminating in the successful application of assisted reproductive technologies (ARTs), including the in vitro production of blastocysts from the final two surviving females [1,2]. These advancements were made possible largely due to the SWR, a closely related subspecies that has served as a critical model in developing these techniques [2]. Although the NWR and SWR have been geographically isolated for thousands of years, recent studies confirm that their genomes are remarkably similar at the chromosomal and genomic level [9]. This genetic compatibility makes the SWR a promising candidate for developing and applying reproductive strategies aimed at rescuing the NWR. Genomic resources, such as the newly published NWR genome [9], transcriptomic studies on reproductive cells from SWR [5,7,8], and the ongoing development of induced pluripotent stem cells (iPSCs) [9], are available to contribute to conservation efforts for rhinoceroses.

Genomic strategies and assisted reproductive technologies are essential to counteract loss of biodiversity, hence the urgency in continuing to develop and advance conservation strategies for this species. The availability of the NWR reference genome marked a milestone in wildlife genomics and was published in early 2025 [10], allowing white rhinoceros research to further integrate molecular genetic approaches. This study highlighted the close genetic homology between the NWR and SWR [10], leading to the interchangeable use of their genomes for research purposes. The recently published genome and annotation were limited due to sample generation and computational pipeline. The annotation was not thorough and contained various gene nomenclatures, which led to large difficulties in developing and interpreting transcriptional studies. Furthermore, this annotation was generated using BRAKER3 [11] and protein coding sequences were generated from human, mouse, southern white rhinoceros, domestic horse, and blue whale proteins. Of the 14,274 annotated transcripts, only 7,299 (51%) were called to the correct genetic sequence; 6,763 were assigned to protein names (rendering them difficult to use in downstream pathway analysis); 212 were incorrectly assigned to sequences; and an additional 455 transcripts were assigned to bacterial genes. Due to these faults, an urgent need to improve the annotation arose.

To improve the NWR genome annotation, we sequenced RNA from granulosa cells, which are reproductive cells, at a variety of developmental stages, promising a larger diversity of transcripts to be sequenced compared to the previous annotation [10]. Sequencing reads were assembled into transcripts aligned to the NWR genome. This analysis of transcriptionally active cells allowed the identification of a significant number of genes that led to a more thorough annotation, allowing future genomic studies to advance the ongoing conservation work in this species [5,7,8].

Results and discussion

Total RNA was extracted and sequenced using a paired-end short read technology. The sequencing depth for each sample averaged 37,364,052 (Table 1). Those reads were then aligned to the northern white rhinoceros (NWR) genome and assembled into transcripts which were manually curated based on homology with other rhinoceros species (Ceratotherium simum simum and Diceros bicornis) and the phylogenetically related Equus Caballus.

thumbnail
Table 1. The sequencing depth of samples used for annotation generation.

https://doi.org/10.1371/journal.pone.0340594.t001

As expected, given the sampling of granulosa cells at various developmental stages, the final annotation after manual curation resulted in a large increase in both annotated transcripts and genes. The original annotation contained 14,274 functional transcripts, while the new annotation contains 34,385 functional transcripts, an increase of 141% (Table 2). Following the same trend, the original annotation contained only 8,701 functional genes, whereas after manual curation the gene number increased by 81% resulting in 15,738 functional genes in the new annotation (Fig 1). This substantial increase was primarily due to many annotated genes that were not included in the first annotation but were detected as “de novo” transcripts in the granulosa cells. While the total number of identified genes increased substantially, reaching a number closer to other well-known mammalian genomes (e.g., 20,848 gene models in cow, Ensembl release 113), the average transcript length remained unchanged (Table 2), supporting the quality of the bioinformatic work. The total sequence length for all transcripts represented 48% of the genome assembly which may appear high but is well aligned with other reported transcriptome lengths such as in rat or human, for instance [12,13]. This is likely related to our total RNA extraction and sequencing that resulted in a transcriptome including unspliced transcripts. In addition, this transcriptome included some repeated elements (including found LINES) and we cannot entirely discard the possibility of pseudogenes (full-length or truncated sequences that may be transcribed but not translated into functional coding sequences). Altogether, we significantly improved the genome annotation through the identification of many additional gene sequences.

thumbnail
Table 2. Descriptive Parameters of the Northern White Rhinoceros Genome Annotation.

https://doi.org/10.1371/journal.pone.0340594.t002

thumbnail
Fig 1. Improvement of the northern white rhinoceros genome annotation before and after manual curation.

The values represent the percent increase from the original to the new annotation.

https://doi.org/10.1371/journal.pone.0340594.g001

After evaluating the sequences of the “de novo” transcriptome, many sequences originated from specific genes and assigned appropriately. Gene names were also corrected from the previous annotation to reflect HGNC/VGNC (HUGO gene nomenclature committee/vertebrate gene nomenclature) naming conventions. BUSCO analysis revealed a completeness of 93.4% (single copy 89.4%, duplicated 3.9%), supporting the quality and completeness of our transcriptome. This resulted in an annotation that is more operational and complete for transcriptional studies in this species.

Conclusions

We significantly improved the annotation of the northern white rhinoceros genome using RNAseq mapping and transcriptome assembly. This new annotation of the northern white rhinoceros genome CerSimCot1.0 (GCA_021442165.1) is now available and will undoubtfully aid genomic studies and conservation strategies for this and closely related species.

Methods

Four southern white rhinoceros females were anesthetized, and transrectal ovum pickup (OPU) was performed as previously described [57]. All procedures, experiments, and methods were reviewed and approved by San Diego Zoo Wildlife Alliance’s Institutional Animal Care and Use Committee (IACUC; protocol number 18−018). To alleviate any pain, all animals underwent anesthesia protocols that were described in the previously referenced publication [7]. OPU was achieved using a customized, ultrasound-guided probe fitted with double-lumen needles. The follicles were first ablated, then the contents were aspirated and rinsed with a warm (37°C) flushing solution (Vigro) containing 12.5 I.U./mL of heparin. After OPU and the oocytes were isolated from the collection, fluid and free-floating mural granulosa cells were collected and pipetted directly into RNAlater (Thermo Fisher Scientific, Waltham, MA). In total, granulosa cells from ten follicles were utilized for this study and a total of 14 tubes (samples) were used for RNA isolation and sequencing. In more detail, 4 rhino samples (NCBI Bioproject: GSE261038) were run in technical replicates with RNA being isolated from separately stored pools of granulosa cells totaling up to 8 granulosa cell samples. From GSE300824 the samples were not run in technical replicates, resulting in a total of six granulosa cell samples. Over the two experiments, we obtained two growing, six dominant, and two pre-ovulatory follicles represented. Total RNA was isolated from granulosa cells using an Arcturus PicoPure RNA Isolation Kit (Thermo Fisher Scientific, Waltham, MA) per the manufacturer’s instructions. A Qubit 4 Fluorometer (Thermo Fischer Scientific, Waltham, MA) was used for quantification, and a 4150 TapeStation System (Agilent Technologies, Santa Clara, CA) was used to determine RNA integrity number (RIN) values. Only samples with a RIN greater than 6.0 were used for RNA-Seq analysis. Library preparation and RNA sequencing were performed at the University of California San Diego Institute for Genomic Medicine Center. Following the manufacturer’s protocol, RNA-sequencing (cDNA) libraries were prepared using the Illumina TruSeq Stranded Total RNA Prep with Ribo-Zero Plus for six samples. The libraries were sequenced as 100 bp paired-end reads on an Illumina NovaSeq 6000 (Illumina, San Diego, CA). The raw data files were uploaded to the Gene Expression Omnibus under accession number GSE261038. The remaining eight libraries were prepared per the manufacturer’s protocol using the NovaSeq X plus, Illumina stranded Total RNA with Ribozero plus library preparation kit. These prepared libraries were sequenced as 150 bp paired-end reads on an Illumina NovaSeq 6000 (Illumina, San Diego, CA). The raw data files were uploaded to the Gene Expression Omnibus under accession number GSE300824. Bioinformatic analysis was performed on the Galaxy web platform using the public server usegalaxy.org [14]. HISAT2 was used to align reads to the northern white rhinoceros (NWR) genome CerSimCot1.0 (GCA_021442165.1) [10,15,16]. Transcript assembly for both annotated and unannotated transcripts was performed with Stringtie2, and each animal sample was analyzed individually [17]. The resulting transcripts for individual samples were merged to create a final annotation file representing the union of all samples [17]. The generated annotation file was used for subsequent manual curation. The complete Galaxy workflow can be found in Supplemental Material 1. Manual curation was performed by determining the nucleotide sequence for each region identified with the merged annotation file in StringTie [17] without a gene assigned. These sequences were searched individually in NCBI BLAST [18] for homology between other rhinoceros species (Ceratotherium simum simum and Diceros bicornis) and the phylogenetically related Equus Caballus [10]. If a sequence had a percent identity 80% [19], or greater, to a known gene, the sequence was properly assigned to that gene. In addition, regions that were assigned to a gene from Wang et. al [20] were reevaluated through the same manual curation process. Erroneously assigned genes, genes with protein nomenclature (rendering it unusable for pathway analyses), and bacterial genes were discovered during manual curation and these errors were corrected. Genes were named using the HGNC/VGNC naming conventions for ease of use for further transcriptional studies. Transcriptome completion was assessed using a BUSCO analysis [21] analysis. The resulting annotation was made publicly available in the following GitHub repository (https://github.com/eruggeri/Northern-White-Rhinoceros-Annotation).

Supporting information

S1 File. Galaxy workflow used to generate annotated and unannotated sequences.

This file contains the code lines for the workflow used in Galaxy to generate the annotated and unannotated sequences for the annotation file.

https://doi.org/10.1371/journal.pone.0340594.s001

(PDF)

Acknowledgments

We would like to acknowledge the Reproductive Sciences team, Dr. Jacobo Rodriguez, Veterinary Services, Conservation Science Wildlife Health team members, and the Rhino Rescue Center team and animals at San Diego Zoo Wildlife Alliance for animal management, care, and sample collection.

References

  1. 1. Hildebrandt TB, Holtze S, Colleoni S, Hermes R, Stejskal J, Lekolool I, et al. In vitro fertilization program in white rhinoceros. Reproduction. 2023;166(6):383–99. pmid:37877686
  2. 2. Hildebrandt TB, Hermes R, Colleoni S, Diecke S, Holtze S, Renfree MB, et al. Embryos and embryonic stem cells from the white rhinoceros. Nat Commun. 2018;9(1):2589. pmid:29973581
  3. 3. Zywitza V, Rusha E, Shaposhnikov D, Ruiz-Orera J, Telugu N, Rishko V, et al. Naïve-like pluripotency to pave the way for saving the northern white rhinoceros from extinction. Sci Rep. 2022;12(1):3100. pmid:35260583
  4. 4. Biasetti P, Hildebrandt TB, Göritz F, Holtze S, Stejskal J, Galli C, et al. Ethics at the Edge of Extinction: Assisted Reproductive Technologies (ART) in the Conservation of the Northern White Rhino. J Agric Environ Ethics. 2024;38(1).
  5. 5. Ruggeri E, Klohonatz K, Sirard M-A, Durrant B, Coleman S. Genomic insights into southern white rhinoceros (Ceratotherium simum simum) reproduction: Revealing granulosa cell gene expression. Theriogenology Wild. 2023;3:100055.
  6. 6. Ruggeri E, Young C, Ravida N, Sirard MA, Krisher R, de la Rey M. Glucose consumption and gene expression in granulosa cells collected before and after in vitro oocyte maturation in the southern white rhinoceros (Ceratotherium simum simum). Reprod Fertil Dev. 2022.
  7. 7. Klohonatz K, Durrant B, Sirard M-A, Ruggeri E. Granulosa cells provide transcriptomic information on ovarian follicle dynamics in southern white rhinoceros. Sci Rep. 2024;14(1):19321. pmid:39164442
  8. 8. Gad A, Menjivar NG, Felton R, Durrant B, Tesfaye D, Ruggeri E. Mapping the follicle-specific regulation of extracellular vesicle-mediated microRNA transport in the southern white rhinoceros (Ceratotherium simum simum)†. Biol Reprod. 2024;111(2):376–90. pmid:38775197
  9. 9. Wang G, Korody ML, Brändl B, Hernandez-Toro CJ, Rohrandt C, Hong K, et al. Genomic map of the functionally extinct northern white rhinoceros (Ceratotherium simum cottoni). Proc Natl Acad Sci U S A. 2025;122(20):e2401207122. pmid:40359041
  10. 10. Wang G, Korody ML, Brändl B, Hernandez-Toro CJ, Rohrandt C, Hong K, et al. Genomic map of the functionally extinct northern white rhinoceros (Ceratotherium simum cottoni). Proc Natl Acad Sci U S A. 2025;122(20):e2401207122. pmid:40359041
  11. 11. Gabriel L, Brůna T, Hoff KJ, Ebel M, Lomsadze A, Borodovsky M, et al. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 2024;34(5):769–77. pmid:38866550
  12. 12. Ji X, Li P, Fuscoe JC, Chen G, Xiao W, Shi L, et al. A comprehensive rat transcriptome built from large scale RNA-seq-based annotation. Nucleic Acids Res. 2020;48(15):8320–31. pmid:32749457
  13. 13. Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8. pmid:22955620
  14. 14. Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46(W1):W537–44. pmid:29790989
  15. 15. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60. pmid:25751142
  16. 16. Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37(8):907–15. pmid:31375807
  17. 17. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5. pmid:25690850
  18. 18. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. pmid:20003500
  19. 19. Pearson WR. An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics. 2013;42(1):3.1. -3.1. 8.
  20. 20. Guo Y, Ma C, Wang S, Wu X, Yang F, Zeng S. Structural and molecular dysfunctions in granulosa cells: A key contributor to porcine follicular atresia. Reprod Biol. 2025;25(2):101008. pmid:40043493
  21. 21. Tegenfeldt F, Kuznetsov D, Manni M, Berkeley M, Zdobnov EM, Kriventseva EV. OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Res. 2025;53(D1):D516–22. pmid:39535043