Searching algorithm for Type IV effector proteins (S4TE) 2.0: Improved tools for Type IV effector prediction, analysis and comparison in proteobacteria

Bacterial pathogens have evolved numerous strategies to corrupt, hijack or mimic cellular processes in order to survive and proliferate. Among those strategies, Type IV effectors (T4Es) are proteins secreted by pathogenic bacteria to manipulate host cell processes during infection. They are delivered into eukaryotic cells in an ATP-dependent manner via the type IV secretion system, a specialized multiprotein complex. T4Es contain a wide spectrum of features including eukaryotic-like domains, localization signals or a C-terminal translocation signal. A combination of these features enables prediction of T4Es in a given bacterial genome. In this study, we developed a web-based comprehensive suite of tools with a user-friendly graphical interface. This version 2.0 of S4TE (Searching Algorithm for Type IV Effector Proteins; http://sate.cirad.fr) enables accurate prediction and comparison of T4Es. Search parameters and threshold can be customized by the user to work with any genome sequence, whether publicly available or not. Applications range from characterizing effector features and identifying potential T4Es to analyzing the effectors based on the genome G+C composition and local gene density. S4TE 2.0 allows the comparison of putative T4E repertoires of up to four bacterial strains at the same time. The software identifies T4E orthologs among strains and provides a Venn diagram and lists of genes for each intersection. New interactive features offer the best visualization of the location of candidate T4Es and hyperlinks to NCBI and Pfam databases. S4TE 2.0 is designed to evolve rapidly with the publication of new experimentally validated T4Es, which will reinforce the predictive power of the algorithm. The computational methodology can be used to identify a wide spectrum of candidate bacterial effectors that lack sequence conservation but have similar amino acid characteristics. This approach will provide very valuable information about bacterial host-specificity and virulence factors and help identify host targets for the development of new anti-bacterial molecules.


Introduction
Proteobacteria have evolved specific effector proteins to manipulate host cell gene expression and processes, hijack immune responses and exploit host cell machinery during infection. These proteins are secreted by ATP-dependent protein complexes named type IV secretion systems (T4SS). Some T4Es have been identified and shown to be crucial for pathogenicity. To facilitate the identification of putative T4Es, we previously developed a bioinformatics tool called S4TE 1.0 (Searching Algorithm for Type IV secretion system effector proteins) [1].
In the present article, we present the second version of 'S4TE'. S4TE 2.0 is a tool for in silico screening of proteobacteria genomes and T4E prediction based on the combined use of 14 distinctive features. In this updated version, modules searching for promoter motifs, homology, NLS, MLS and E-block are more efficient. A new module has been added in the workflow to locate phosphorylation (EPIYA-like) domains. S4TE 2.0 consists of the S4TE 1.4 tool and a web interface available to non-commercial users at http://sate.cirad.fr. The web interface is designed to make S4TE 2.0 easy to use for biologists and more time efficient. Most of the genomes and plasmids available in the NCBI database of pathogenic bacteria that have a type IV secretion system have been loaded into the S4TE 2.0 database so effectors can be predicted in only a few clicks. S4TE 2.0 offers advanced users an expert mode (S4TE-EM) they can use to customize S4TE 2.0 search parameters (e.g. exclude modules, modify module weightings). In this mode, S4TE 2.0 can be used as 14 independent programs to search for particular features in a given bacterial genome (e.g. NLS, C-ter charges).
A new function for comparative genomics (S4TE-CG) has been added to compare up to four predicted effectomes in just a few seconds.
All S4TE 2.0 results are interactive and linked to NCBI and Pfam databases. Modified or novel searching modules in S4TE 2.0 Promoter motif search. As several T4Es in a given bacterium can be subjected to coordinated regulation with the same protein, e.g. PmrA [2], we used S4TE 2.0 to conduct a search for conserved motifs (potential regulatory motifs) in the short promoter regions of the genes. The aim was to improve S4TE 2.0 prediction of possible regulons of T4Es. Enriched DNA motifs were searched in the intergenic region upstream of the start codon, using MEME [3]. Eight consensus motifs were identified in different bacteria (Fig 1). The corresponding motif search module of S4TE 2.0 extracts the 5' Flanking intergenic regions (5' FIRs) and searches for all these motifs thanks to a position-specific scoring matrix generated from multiple sequence alignments with the promoters of known T4Es. Only alignments with a score above the chosen threshold are selected. The threshold that yielded the highest sensitivity and specificity for each motif in the corresponding bacterium was chosen (Fig 1).

Programming
Homology. BLAST 2.2 was used to compare proteins to search for homologies with known T4Es [4]. The cut-off of the S4TE 1.0 homology module was changed. S4TE 2.0 compares the database containing all known T4Es with the query proteome and returns all orthologs with a cut-off of the expected value (E) <10-4. This E-value cut-off was selected to find accurate orthologs between phylogenetically distant bacterial species. Databases containing proven effectors have also been updated (S3 Dataset).
Nuclear localization signals (NLS). NLS are protein sequences that target proteins in the nucleus of eukaryotic cells [5]. We assume that the occurrence of NLS in a bacterial protein sequence would be a good indicator of secretion. There are two classes of NLS, monopartite and bipartite. In S4TE 2.0, the search for monopartite NLS has been improved according to Ruhanen et al. [6]. We rewrote this module to add more known NLS motifs in the search.  . The new module was tested with a dataset of 32 NLS and 32 no-NLS containing proteins (S1 Dataset). The module selected 24 true positives (TP) and only three false positives (FP). This represents a sensitivity (Se) of 75% and a specificity (Sp) of 91%.
Mitochondrial localization signals (MLS). MLS are signal sequences located in the Nterminus of proteins that are targeted to mitochondria. This sequence is cleaved after translocation of the protein inside the mitochondria [5,7]. To predict MLS in S4TE 2.0, we used the MitoFates tool [8]. MitoFates predicts mitochondrial presequences, a cleavable localization signal located in the N-terminal, and its cleaved position.
E-block. The E-block domain consists of a glutamate sequence rich in C-terminal 30 amino acids and is associated with T4Es translocation in L. pneumophila. Huang et al. showed that an E-block motif is also important for the translocation of T4SS substrates [9]. In S4TE 2.0, the E-block module was modified according to Lifshitz et al. [10]. The E-block was searched in a window of 22 amino acids between position -4 C-terminal and -26 C-terminal. The motif that is searched for is a motif of 10 amino acids containing three or more glutamate (E) residues. The module was tested on 98 E-block and 98 no-E-block containing proteins (S2 Dataset). This module selected 60 TP and only 6 FP (Sensitivity of 61%, Specificity of 94%).
Pfam database. The local Pfam database has been updated to find more eukaryotic domains of known effectors of Legionella pneumophila [10]. Eukaryotic domains were extracted from the whole Pfam database and added to the S4TE 2.0 workflow. All eukaryotic domains used for this search are listed in S4 Dataset.
EPIYA search. EPIYA search is a new module implemented in S4TE 2.0. Bacterial EPIYA effectors are delivered into host cells by T4SS, where they undergo tyrosine phosphorylation at the EPIYA motif and thereby manipulate host signalling by tight interaction with SH2 domain-containing proteins [11]. In H. pylori, EPIYA has been shown to contribute to the secretion of a CagA effector [12]. Moreover the functional versatility of EPIYA motif highlights the importance of this emerging family of bacterial effectors [11]. We searched for conserved EPIYA motifs (EPIYA, ENIYE, NPLYE, EHLYA, TPLYA, EPLYA, ESIYE, EDLYA, EPIYG, EPVYA, VPNYA, EHIYD) in different bacteria that have a type IV secretion system and we searched for hypothetical EPIYA motifs using the motif E-X-X-Y-X.

Result
Validation S4TE 2.0 is a software with 14 independent modules. We tested all the modules independently. The 14 modules were weighted to make S4TE 2.0 efficient. The weighting of each module was calculated according to its performance in finding effectors in L. pneumophila Philadelphia I which has been shown to have the most extensive repertoire of T4Es ever identified, with 286 confirmed effectors [10].
Each module has its own weighting in S4TE 2.0 searches. The weightings were calculated for each module based on their Positive Predictive Value (PPV [PPV = TP/(TP+FP)]) for L. pneumophila ( Table 1).
The S4TE 2.0 prediction threshold was then defined to enable the best prediction by disregarding homology with known effectors. The threshold was chosen by examining the Sensitivity (Se), Specificity (Sp), Positive Predictive Value (PPV), Negative Predictive Value (NPV) and Accuracy (Acc) for thresholds ranging from 40 to 120 on the test dataset (Fig 2). The threshold was set at a score of 72 to obtain the best global PPV possible with the least possible impact on sensitivity.
With this update, S4TE 2.0 prediction is more powerful than that of S4TE 1.0 whose sensitivity was 14% lower. Without homology, sensitivity increased by 25%. Other characteristics including specificity, accuracy and negative predictive value did not change significantly ( Table 2). S4TE 2.0 allows flexible, highly sensitive and specific detection of new putative T4SS effectors.

SATE-CG
S4TE-CG is a new tool designed to compare different repertoires of putative T4Es identified by S4TE 2.0. The corresponding S4TE-CG algorithm is described in Fig 3. The user can compare up to four effectomes simultaneously. S4TE 2.0 results from selected genomes (effectomes) are compared with Blastp 2.2 with an expected value (E) cut-off of <10-4 to find homologous proteins in each effectome. S4TE-CG successively compares all effectomes in a pairwise manner, the overlaps between the effectomes of each genome are calculated and the

Availability and future directions
Command-line interface. S4TE 2.0 is a web interface but the S4TE 1.4 package is freely available to non-commercial users at http://sate.cirad.fr/S4TE-Doc.php. All programming was done using Perl 5.18 and BioPerl 1.6.1. The software runs on Linux platforms (Ubuntu 14.04 and Mac OS X). All required packages and the installation process are described in the user guide included in the package. The user guide also details S4TE options for running S4TE. By default, the command line to launch S4TE is ./S4TE.pl -f "Genbank_file" in the S4TE folder (cd way_to_S4TE/S4TE/). Some options are available for the user to launch S4TE: -c, suppression of a module in the pipeline; -w, modification of the weight of each module in the pipeline; -t, imposition of a threshold for effector selection. Each S4TE module creates a .txt file in the folder way_to_S4TE/S4TE/Jobs/job<Name _of_genome_folder><year><month><day><hour><min> All the results are compiled in the CompilationFile.txt and Results.txt in the same folder. Web interface. The S4TE 2.0 website is powered from scratch on the 'CIRAD web server'. All the features of the web site were tested on common web browsers. S4TE 2.0 found T4Es in large genome database (S5 Dataset) available to all users. A user account is available and necessary to keep your jobs for up to three months, to import your own genome in a S4TE 2.0 temporary database and to ask to add a new proved effector in the database. The addition of an effector to the database must be accompanied by a reference (scientific article) and will be checked manually before the effector is added to the database. Those who subscribed to the newsletter will be notified by email about the addition of new effectors to the database and the effector will be visible in the S4TE 2.0-tab strip. This free account allows users to search for proteins in the S4TE 2.0 database using the name, the locus tag or NCBI number of a protein in the search bar. The account also allows the user to subscribe to the S4TE newsletter that summarizes any changes made to the software and provide updates on the latest research on Type IV Effectors. S4TE 2.0 is a simple and user-friendly tool. S4TE 2.0 is a web-based user-friendly tool that gets results in only a few clicks. The user can locate a chromosome in more than 340 bacterial genomes and plasmids available in the database and the results can be viewed by clicking on run S4TE 2.0 (S5 Dataset).
If the desired genome is not available in the database, the user can import it with a GenBank file (.gbk). S4TE 2.0 will import the file to a temporary database for three months. All S4TE tools (S4TE-EM and S4TE-CG) can then be used on the uploaded genome.
The S4TE 2.0 web page allows users to read some of the news published in the newsletter. Five news items are visible on the S4TE2.0 web page, but all the news can be found by clicking on the bottom right link. Fig 4 presents some results obtained with S4TE 2.0. All the proteins in the selected genome are represented on the S4TE 2.0 web results page. A score was calculated for each protein based on the weighting of each module. Proteins were ranked according to the same score. All proteins whose scores are above the threshold are considered as belonging to the S4TE 2.0 effectome. An iconography was created to help read the list (Fig 4A). Users can find all the details concerning each characteristic of a given protein by clicking on the protein on the web results page. Characteristics are easy to find by highlighting the corresponding sequence in the effector sequence. These characteristics are detailed below the sequence. (B) Distribution of S4TE 2.0 predicted type IV effectors (T4Es) according to local gene density. The predicted T4Es are plotted according to the length of their flanking intergenic regions (FIRs). All A. phagocytophilum genes were sorted into 2-dimensional bins according to the length of their 5 0 (yaxis) and 3 0 (x-axis) FIRs. The number of genes in the bins is represented by a color-coded density graph. Genes whose FIRs are both longer than the median FIR length were considered as gene-sparse region (GSR) genes. Genes whose FIRs are both below the median value were considered as gene-dense region (GDR) genes. In-between region (IBR) genes are genes with a long 5 0 FIR and short 3 0 FIR, or inversely. Candidate T4Es predicted using the S4TE 2.0 algorithm were s plotted on this distribution according to their own 3 0 and 5 0 FIRs. A color is assigned to each of the three following groups: Red to GDRs, orange to IBRs, and blue to GSRs. (C) Genome-wide distribution of predicted effectome according to the G+C content. From outer track to inner track, sense and antisense genes (black), S4TE 2.0 putative T4Es (pink), proved T4Es (turquoise), S4TE 2.0 putative T4Es in genomic region with low G+C content (yellow), S4TE 2.0 putative T4Es in genomic region with high G+C content (blue), G+C � average G+C (red), G +C < average G+C (green). https://doi.org/10.1371/journal.pcbi.1006847.g004 When a user runs S4TE 2.0, in addition to the results page, two graphs are automatically drawn. The first shows the distribution of predicted effectors according to local gene density (Fig 4B). The second one displays the distribution of predicted T4Es according to the G+C content along the genome (Fig 4C).
S4TE-EM expert mode for accurate searching. S4TE-EM is the expert mode of S4TE 2.0. S4TE-EM allows the user to modify the weights of each module and to deactivate one or more modules in the search (Fig 5). The weight of a module can be changed by moving the position of the cursor next to the name of each module. Weightings can be changed between the lowest weighting available for the module and the threshold of S4TE 2.0 (t = 72). The lowest weight is calculated independently for each module as a function of the positive predictive value and corresponds to a value equal to 0.5. Users can also cancel one or more modules in the pipeline by unchecking the box next to the name of the module (Fig 5).
All the modules are independent, and users can use S4TE-EM to locate the same characteristic throughout the genome. For example, if the user disables all the modules except NLS, The right side provides some information about the page. The right side matches the user account. The user account shows all the jobs previously ran in S4TE 2.0 and S4TE-CG. This account makes it possible to search for a protein with the search bar and to ask to add a proven T4 effector in the database. In the central part of the workspace, the user can select a genome in the drop-down menu. In S4TE-EM, the user can change the weighting or disable one or more modules (on the left) shown in the S4TE diagram (on the right) and run S4TE-EM by clicking on the 'Run S4TE-EM' button.  (Fig 3). Information about each effector can easily be found by clicking on the name of the effector in the table. Or users can simply copy and paste the table in a .csv file.

Conclusions
This paper presents updated S4TE software. The computational tool is designed to predict the presence of T4SS effector proteins in bacteria. The identification of T4Es and some characteristics are improved in this update. Compared with a machine learning approach, using S4TE 2.0 to predict T4Es in Legionella and Coxiella species [10,13,14] improved sensitivity (98% for S4TE 2.0 and 89% for Wang et al.) and equivalent specificity (97% for Wang et al. and 93% for S4TE 2.0). S4TE 2.0 is easy to use: only an internet connection and a few clicks are needed to search for T4Es in more than 340 bacterial genomes and plasmids. The results are displayed instantaneously for easy reading. An automated pipeline is also provided to analyze and visualize effector distribution in the genome according to G+C content and local gene density. S4TE 2.0 results are linked to bioinformatics databases like NCBI and Pfam. The S4TE 2.0 database is designed to evolve and will be updated by adding new proven effectors and new bacterial genomes. S4TE 2.0 not only predicts the T4Es but also their subcellular localization (NLS, MLS, prenylation) and the function of these proteins (Coiled coils, EPIYA, Euk-like, etc.). All these features make S4TE 2.0 a powerful software for studies of T4Es. S4TE 2.0 also offers an expert mode, which allows users to make manual adjustments to the weight of the modules. Each module that searches for a feature or a characteristic can be used independently. Thus, S4TE-EM can be viewed and used as 14 independent programs. This could facilitate the annotation of new genomes by looking for specific features such as NLS, prenylation domains, etc.
Finally, S4TE-CG makes it possible for users to compare effectomes to highlight core T4 effectomes and/or accessory T4 effectomes to understand how effectomes evolved and may provide clues to the specificity of different strains.