A resource for improved predictions of Trypanosoma and Leishmania protein three-dimensional structure

AlphaFold2 and RoseTTAfold represent a transformative advance for predicting protein structure. They are able to make very high-quality predictions given a high-quality alignment of the protein sequence with related proteins. These predictions are now readily available via the AlphaFold database of predicted structures and AlphaFold or RoseTTAfold Colaboratory notebooks for custom predictions. However, predictions for some species tend to be lower confidence than model organisms. Problematic species include Trypanosoma cruzi and Leishmania infantum: important unicellular eukaryotic human parasites in an early-branching eukaryotic lineage. The cause appears to be due to poor sampling of this branch of life (Discoba) in the protein sequences databases used for the AlphaFold database and ColabFold. Here, by comprehensively gathering openly available protein sequence data for Discoba species, significant improvements to AlphaFold2 protein structure prediction over the AlphaFold database and ColabFold are demonstrated. This is made available as an easy-to-use tool for the parasitology community in the form of Colaboratory notebooks for generating multiple sequence alignments and AlphaFold2 predictions of protein structure for Trypanosoma, Leishmania and related species.


Introduction
Machine learning approaches to protein structure prediction have crossed a critical success threshold. While predicting the three-dimensional structure of a protein from sequence alone is still unsolved problem, a multiple sequence alignment (MSA) of the target protein sequence with related proteins provides key additional information. Cutting edge approaches using such MSAs now have the potential to reach very high accuracy. MSAs are the input for AlphaFold2 [1] and RoseTTAfold [2], with AlphaFold2 reaching the highest accuracy prediction at the most recent Critical Assessment of protein Structure Prediction (CASP) competition (CASP14 [3])-an accuracy comparable to experimental protein structure determination. AlphaFold2predicted structures for the near-whole proteome of 21 species [4] has been made publicly available, and tools like ColabFold [5] and the official AlphaFold Colaboratory notebook [6] make custom predictions easily accessible. Trypanosomatids pose a challenge because of their large evolutionary distance from common model eukaryotes [7,8]. This order includes important unicellular human, animal and plant parasites, including the human infective Trypanosoma cruzi, Trypanosoma brucei and many human-infective Leishmania species. T. cruzi and Leishmania infantum are the most deadly of these species and were included in the initial 21 AlphaFold whole proteome predictions [1,4]. Trypanosomatids are members of an early-branching eukaryote linage (Discoba) which also includes the less common, but still deadly, pathogen Naegleria fowleri. Other speciose Discoba lineages are Euglena and Diplonema, unicellular aquatic organisms and important and abundant auto-and heterotrophic plankton respectively. An initial inspection of the AlphaFold database (alphafold.ebi.ac.uk) suggested protein structure prediction accuracy for T. cruzi and L. infantum is often low-particularly for kinetoplastid specific proteins-based on self-reported prediction quality scores. Many of these proteins are vital, like the unconventional kinetochore proteins [9].
Discoba diversity is less well sampled by genomes and transcriptomes than lineages like plants or metazoa, making construction of deep MSAs more difficult. This is important as MSAs encode additional structural information beyond the primary protein sequence alone: They capture evidence for co-evolution of different regions of the primary sequence which may correspond to proximity or interaction in the three-dimensional structure. AlphaFold2 and RoseTTAFold prediction of protein structure is greatly improved by high MSA quality and depth, with high MSA coverage critical for high confidence predictions [1,2]. While new approaches [10] are trying to move beyond multiple sequence alignments, MSAs will remain a powerful source of information.
Currently, the input databases for the AlphaFold database and the ColabFold notebook are of UniRef [11] plus environmental sample sequence databases (BFD, Uniclust and MGNify [12][13][14]). However, these databases have relatively poor coverage of Discoba. It appears that a significant quantity of genomic and transcriptomic data available in the community genome resource TriTrypDB [15,16], the NCBI genome [17], transcriptome shotgun assembly (TSA) [18] and sequencing read archive (SRA) [19] databases are not incorporated. It seemed likely that an improved database is a simple opportunity to improve protein MSAs for protein structure predictions for Trypanosoma, Leishmania and other Discoba species.
Here, protein sequence data was gathered into a comprehensive Discoba database and the ColabFold MMSeqs2-based pipeline [5,20] was modified to also include the result of a HMMER search of Discoba. Using a test set of 30 L. infantum proteins, MSA coverage was always improved, leading to increased AlphaFold2 prediction accuracy in 2/3 of cases. Improvements were greatest for kinetoplastids-specific proteins, with dramatic improvements often possible. The necessary tools to make similar protein structure predictions have been made openly available: The Discoba protein sequence database (for custom searches and MSA generation), Colaboratory notebooks for generating MSAs by HMMER or MMSeqs2 (for use in AlphaFold2 or RoseTTAFold implementations), and a standalone Colaboratory notebook for AlphaFold2 structure predictions based on ColabFold incorporating a search of the Discoba database. These tools are available at github.com/zephyris/discoba_alphafold.

Discoba sequence data
Predicted protein sequences were gathered from 243 Discoba transcriptomes or genomes (S1 Table): 160 transcriptomes and 83 genomes. 152 from cultured populations (almost all axenic) and 91 from single cell samples. 238 were assembled giving a good number of predicted protein sequences (>500). The full set of sequences have been deposited as a Zenodo dataset (version 1.0.0) [21].
Predicted proteins from genome sequencing were gathered from two sources: TriTrypDB: All 53 trypanosomatid species available in TriTrypDB [15,16] release 53, using the provided predicted protein sequences on TriTrypDB where available. For the 17 without predicted protein sequences the translation of all predicted open reading frames (ORFs) over 100 amino acids were used, as kinetoplastids typically have compact genomes with short intergenic sequences and extremely low occurrence of introns [22]. NCBI Genomes: 32 genomes for Discoba species. For the 14 with predicted protein sequences on NCBI the existing prediction was used. For the 18 without predictions, the translation of all ORFs over 100 amino acids were used.
Sequencing read archive (SRA): 17 whole genome sequencing (WG-seq) datasets from axenic cultures of Discoba species and 32 single cell WG-seq datasets. For each, assembly was carried out using Velvet [23,24] (see Genome assembly) and all predicted ORFs over 100 amino acids were used.
Predicted proteins from transcriptome sequencing were gathered from three sources: Transcriptome shotgun assembly (TSA) database: 11 transcriptomes for Discoba species, using protein sequence predicted by TransDecoder [25]. Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP): 2 transcriptomes for Discoba species, using the provided protein sequences which were predicted using TransDecoder. NCBI SRA: 19 mRNA-seq datasets from axenic cultures of Discoba species, 2 mRNA-seq datasets from mixed cultures including a Discoba species and 59 single cell mRNA-seq datasets. For each, transcriptome assembly was carried out using Trinity [26-28] followed by protein sequence prediction with TransDecoder [25] (see Transcriptome assembly).

Transcriptome assembly
Transcriptome assembly from RNA-seq data used a standardised pipeline, with the same approach used for axenic culture, mixed culture and single cell transcriptomic data. Reads were first error corrected using Rcorrector v1.0.4 [29,30] (using Jellyfish v2.3.0 [31, 32]) and corrected reads tidied using TranscriptomeAssemblyTools [33]. Any remaining adaptor sequences were trimmed using TrimGalore v0.6.0 [34] (using Cutadapt v2. 8 [35, 36]) then an assembly was generated using Trinity v2.12.0 [26 -28]. Many of these species use polycistronic transcription with a single spliced leader sequence trans-spliced onto the start of all mRNAs. As such common sequences may affect assembly, a two-step approach was used. First, a trial assembly using 1,000,000 reads (or all reads if fewer were available) was generated and the common spliced leader sequence identified using a custom script. Cutadapt was then used to trim reads to remove the spliced leader, then a final assembly was generated using 40,000,000 reads (or all reads if fewer were available). Very similar transcript sequences were removed using cd-hit-est v4.8.1 (part of CD-HIT [37, 38]) then remaining sequences translated to predicted proteins using TransDecoder v5.5.0 [25] LongOrfs.

Genome assembly
Genome assembly from WG-seq data also used a standardised pipeline. For single cell genomic data, reads were first error corrected using Rcorrector v1.0.4 [29,30] (using Jellyfish v2.3.0 [31, 32]) and TranscriptomeAssemblyTools [33]. For all assemblies, any remaining adaptor sequences were trimmed using TrimGalore v0.6.0 [34] (using Cutadapt v2. 8 [35, 36]) then an assembly was generated using Velvet v1.2.10 [23, 24] using all available reads. As expected coverage and insert size are not necessarily known, a refinement step was used. Reads were aligned to the assembly using bwa mem v0.7.17 [39,40] and insert size and mean coverage determined using samtools v1. 10 [41, 42], then a final assembly was generated using Velvet including these parameters and a minimum coverage threshold of 0.25 the mean trial assembly coverage. All open reading frames �300 bp (all three frames, both strands) were identified using a custom script.

Intrinsically disordered domains
Intrinsically disordered domains were predicted using IUPred2A [50] using a score threshold of 0.5 for classification of a residue as disordered.
AlphaFold2 predictions incorporating the new Discoba protein sequence database were carried out using a modified version of ColabFold [5,20]. The MMseqs2 [51] search of UniRef [11] and environmental sample sequence databases [12][13][14], was supplemented with a HMMER (part of HH-suite) [54] search of the Discoba protein database described here. A MMSeqs2 search of the Discoba protein database was also trialled but use of HMMER for Discoba searches typically gave slightly higher pLDDT, presumably as AlphaFold v2.0 was trained using HMMER MSAs. Predictions were again done using AlphaFold2 parameters from 2021-07-14, not using Amber relaxation and not using PDB templates.

Results
AlphaFold2 self-reports confidence in predictions through two measures: predicted local distance difference test score (pLDDT) [55], a per residue 0 to 100 score with high values showing higher confidence, and predicted average error (pAE), a per residue pair distance score with low values showing lower error. Many AlphaFold2-predicted protein structures for Leishmania infantum and Trypanosoma cruzi (from the AlphaFold database alphafold.ebi.ac.uk [1,4]) had high pLDDT and low pAE. This is an impressive achievement, however using the Alphafold database pLDDT proteome-wide performance of AlphaFold2 can be evaluated quantitatively. For comparison mouse (Mus musculus) was selected as, unlike human proteins, predictions were carried out without special treatment. Overall, L. infantum and T. cruzi protein structure predictions are skewed to lower pLDDT (Fig 1A).
Lower pLDDT could be explained by more disordered protein domains, as predictions for these regions correlate with low pLDDT [1]. L. infantum and T. cruzi proteins do not have a markedly different predicted degree of disorder to M. musculus (Fig 1B) although, as expected [1], pLDDT had a negative correlation with disorder score in all three species (Fig 1C). Alternatively, it may be a limitation due to the depth of the input protein MSAs. pLDDT correlated with number of orthologs detected using OrthoFinder [43][44][45] on a set of 77 reference proteins of diverse eukaryotes (Fig 1D), indicating that MSA depth is a likely explanation.
Unlike M. musculus, the distribution of number of orthologs for L. infantum and T. cruzi was strongly bimodal with many having fewer than 10. These proteins had, on average, markedly lower pLDDTs (Fig 1D). Analysis using more stringent measures of protein specificity to the kinetoplastids showed a similar pattern: Kinetoplastids-specific proteins were identified as those with only reciprocal best sequence search hits among the kinetoplastids (Fig 1E) or those with only orthogroup members among the kinetoplastids (Fig 1F), and both sets had low pLDDT (Fig 1E, 1F). Overall, this confirms that MSA quality is likely the limiting factor for many T. cruzi and L. infantum protein structure predictions.
As much protein sequence data as possible was therefore gathered for Discoba species, drawing upon both TriTrypDB [15,16] (well known to the Trypanosoma and Leishmania community), and lesser known, unpublished or very recent data available via nucleotide sequencing databases, gathered using the NCBI taxonomy browser (see Methods, S1 Table). Many of these proteomes are in UniParc, but seemingly not used to build UniRef100 which is one of the key databases used by the AlphaFold database and ColabFold. Unlike many applications, precise knowledge about sample or species identity, high sample purity and high transcriptome/genome coverage are not critical-therefore the gather was as inclusive as possible. This database ultimately included 238 predicted proteomes, representing 1.45 billion amino acids across 4.3 million protein sequences.
To benchmark any improvements over the AlphaFold database predictions a set of 30 L. infantum proteins were selected: 10 random proteins which have orthologs in many diverse eukaryotes, 10 random proteins which appear unique to the kinetoplastid lineage (no orthogroup members outside the kinetoplastids) and 10 random proteins which appeared 'promising' and likely to have globular domains but with a low pLDDT in the AlphaFold  [1,4]. Scores for very low, low, confident and very high confidence categories are the same as used on the website. B) Distribution of perprotein average IUPred score for the same three species. C) Correlation of per-protein average pLDDT with IUPred score for the same three species. D) Correlation of per-protein average pLDDT with number of orthologs (total number of orthogroup members determined from a diverse set of eukaryotes, see Methods). A random number between 0 and 1 was added to each ortholog count to better represent point density at low ortholog numbers. E,F) Distribution of per-protein average pLDDT for all L. infantum and T. cruzi proteins lacking an ortholog outside of the kinetoplastids, as determined by either E) orthogroup members only in kinetoplastid species (1509 and 7181 proteins respectively) or F) reciprocal best protein sequence search hits only in kinetoplastids species (2361 and 11723 proteins respectively). database. The latter were selected based on size (avoiding small proteins, ≳300 amino acids), lack of low complexity or repetitive regions (≲30% unstructured and manually avoiding repeats), orthologs in few species (≲10), without numerous paralogs, and low average pLDDT (≲60).
To carry out AlphaFold2 protein structure predictions ColabFold was selected as a fast but high accuracy and accessible AlphaFold2 implementation [5,20]. As expected, unmodified ColabFold gave per-protein mean pLDDTs comparable to, but on average slightly lower than, the AlphaFold database for the test proteins (Fig 2A). Lower pLDDTs may be through Colab-Fold's use of MMSeqs2 rather than HMMER, on which AlphaFold2 was originally trained, for MSAs. ColabFold was then modified to generate a HMMER-generated MSA from the Discoba database and append this to the default MSA, before carrying out the AlphaFold2 prediction. This extended MSA improved mean pLDDT for a large majority of protein structure predictions, whether compared to the AlphaFold database or unmodified ColabFold, with less confident predictions seeing the largest improvement (Fig 2B and 2C). pLDDT increase occurred at all confidence levels within a protein. Using the confidence thresholds in the AlphaFold database, the proportion of residues over the threshold for a low confidence (>50), confident (>70, high confidence in backbone structure) and high confidence (>90, likely correct sidechain rotamers) prediction almost all increased for a large majority of proteins (Fig 2D).
Improvement was most marked among the test proteins not conserved outside of the kinetoplastids, especially the ones selected as 'promising' (Fig 2A). Inspection of these predictions showed a range of improvements, best interpreted from plots of pAE which show a pairwise measure of predicted error in residue-residue spacing. Improvements included overall large decreases in pAE (Fig 3A), the first high confidence prediction of any folds (Fig 3B) and the prediction of a single high confidence domain (one contiguous block of low pAE) rather than two subdomains (Fig 3C).

Discussion
This work shows that significant improvement in the pLDDT and pAE of AlphaFold2 structure predictions is possible for Trypanosoma and Leishmania proteins, relative to the publicly available AlphaFold database at alphafold.ebi.ac.uk [1,4] and the open tool ColabFold [5,20] (Fig 2). This was simply by designing a protein sequence database for MSA generation more appropriate for the Discoba branch of eukaryotic life. Easy-to-use tools for MSA generation and AlphaFold2 structure prediction exploiting this Discoba protein sequence database have been made available at github.com/zephyris/discoba_alphafold. Even in these early-branching eukaryotes, the huge advance AlphaFold (and RoseTTAfold) represent can therefore, to a great extent, translate protein structure determination into a genome and transcriptome sequencing problem. Although, experimental protein structure determination will continue to be vital to confirm predictions, explore dynamics and complexes, etc.
Structure prediction improvement was most marked for proteins specific to the kinetoplastids. A large proportion of trypanosomatid parasites' genomes falls into this group-several hundred to thousands depending on definition (Fig 1E and 1F). Many of these proteins lack any domains detectable by primary sequence (sometimes called the 'dark proteome') making a structure prediction a first insight into potential function. However, improvement in protein Predicted protein structures for three example L. infantum proteins showing, from left to right, the AlphaFold database structure, the predicted structure using ColabFold supplemented with a HMMER search of the Discoba database described in this study, the pairwise pAE for the AlphaFold database structure and the pAE for the structure predicted in this study.
https://doi.org/10.1371/journal.pone.0259871.g003 structure prediction at all levels are valuable. It may allow a high-confidence prediction of vital kinetoplastid proteins with orthologs in many parasite species, allowing analysis of high specificity small molecule docking. This is of potential importance for drug development.
Improvement is certainly not guaranteed for any individual protein: Proteins well conserved across diverse eukaryotes will already have deep MSAs giving high confidence structural prediction (cf. Fig 1D). Proteins which have intrinsically disordered domains (eg. many RNA binding proteins) or only gain structure as part of a multisubunit structure (eg. many ribosome proteins) are unlikely to see significant improvement (cf. Fig 1C). Also, proteins which are extremely fast-evolving, or recent innovation found only in a few species, are less likely to benefit. Properties of kinetoplastids chromosome organisation may enable future developments. The order of genes on chromosomes is well conserved [56], sometimes providing additional information which allows identification of orthologs which are difficult or impossible to detect based on primary sequence alone (eg. Basalin [57]) which may allow even deeper MSA generation.
Overall, this work highlights both the importance of sequencing diverse organisms, for example animal pathogens related to human pathogens and non-pathogenic relatives, and ensuring that this data is made available through nucleotide sequencing, genome and proteome databases. It also emphasises that protein sequence data need to be carefully gathered before embarking on important or large-scale structural predictions: Even carefully selected representative databases often retain biases towards model organisms. Similar protein structure prediction improvements are likely possible for other branches of eukaryotic life.
Supporting information S1