GLADX: An Automated Approach to Analyze the Lineage-Specific Loss and Pseudogenization of Genes

A well-established ancestral gene can usually be found, in one or multiple copies, in different descendant species. Sometimes during the course of evolution, all the representatives of a well-established ancestral gene disappear in specific lineages; such gene losses may occur in the genome by deletion of a DNA fragment or by pseudogenization. The loss of an entire gene family in a given lineage may reflect an important phenomenon, and could be due either to adaptation, or to a relaxation of selection that leads to neutral evolution. Therefore, the lineage-specific gene loss analyses are important to improve the understanding of the evolutionary history of genes and genomes. In order to perform this kind of study from the increasing number of complete genome sequences available, we developed a unique new software module called GLADX in the DAGOBAH framework, based on a comparative genomic approach. The software is able to automatically detect, for all the species of a phylum, the presence/absence of a representative of a well-established ancestral gene, and by systematic steps of re-annotation, confirm losses, detect and analyze pseudogenes and find novel genes. The approach is based on the use of highly reliable gene phylogenies, of protein predictions and on the analysis of genomic mutations. All the evidence associated to evolutionary approach provides accurate information for building an overall view of the evolution of a given gene in a selected phylum. The reliability of GLADX has been successfully tested on a benchmark analysis of 14 reported cases. It is the first tool that is able to fully automatically study the lineage-specific losses and pseudogenizations. GLADX is available at http://ioda.univ-provence.fr/IodaSite/gladx/.


Introduction
GLADX is a module included in a software application: DAGOBAH (Gouret et al., 2011). According to its name (Gene Loss Analyzer DAGOBAH eXtension), it is dedicated to gene losses and pseudogenizations automatic detection and analysis.
All these components form the lab's bioinformatic software platform, called: T.O.W.E.R (Tools Operating With Evolutive Resources). GLADX work in the TOWER framework.
For us and for external users, TOWER is now very complex to install, because one has to deploy many software components, many bioinformatics binaries, many databases and many genomic data.
So we chose the virtualization strategy, that means the installation of all TOWER's components, on a virtual machine image. Several image instances can be started, as virtual computers, on computers which disposes of a virtualisation software like VirtualBox, VMWare, ... and on Clouds.

Technical requirements
We decided to build an Ubuntu 11.04, 64-bit image on VirtualBox 4.1.2 (Oracle TM). Therefore, this image will work efficiently on 64-bit architecture host computers.
To run one image of TOWER, we recommend using a four cores workstation, with 4Go of RAM (minimal configuration). Our image is configured, as a default, to run with eight cores and with 8 Go (current workstation producing our tests).
Please, note that hardware virtualization technology has to be activated on the host computers. (VT-X for Intel, AMD-V for AMD) in order to obtain most advantageous performances.
Warning: the hyperthreading technology with OpenMPI, that is a software layer used to exploit parallel computing with bioinformatics softwares like Tree-Puzzle, ClustalW, is not recommended because of reported bugs.
Note: the eight cores are not strictly required for the image, users could modify the scripts as below: • in /home/tower/TOWER_1.03/prod/FGX_API/scripts/puzzle_cmd_perl, change "-np 8" by "-np X", where X is the number of cores you want to use About the network configuration of the image, the NAT mode was set as a default. This mode doesn't allow 'ssh' access but it is very much faster than Bridge mode.
Images can be run with or without X Window GUI (quite slow in the emulation). In NAT mode, RDP clients can be used to access to a non graphical image. In Bridge mode, one can use ssh.
Important : the tower user has t0wer as password in ssh or graphical mode.
GLADX image is downloadable on the following link: GLADX image. First uncompress it, then add it to VirtualBox with the GUI or with "vboxmanage -registervm …." command.

GLADX launch
GLADX is started on boot of the image (with VirtualBox). In order to start a gene study with GLADX, one just has to deposit one or several FASTA files (amino acid sequences) in the following directory: /home/tower/GLADX_DATA The FASTA files require to be named as follows: EnsemblProteinSequenceName.Taxid.fasta This file must contain a sequence in FASTA format with an header in the following format: >lcl|EnsemblProteinSequenceName|Taxid|Species~Name|OptionalyADescription corresponding in this actual example:

Additional options:
Users can deactivate automatic start of GLADX on boot of the image by commenting with a '#' the line 'su tower -c /home/tower/TOWER_1.03/prod/DGH_2/start' of the file '/etc/rc.local'.
In this configuration you need to launch DAGOBAH using the command 'start' in a Terminal from the current directory /home/tower/TOWER_1.03/prod/DGH_2.
To stop DAGOBAH you just need to press 'CTRL+C' or alternatively to kill the process.

Choice of phylum and species studied
The default parameters of GLADX allow to analyze lineage-specific gene losses in Euteleostomi (or from the closest ancestor in leaves direction whether the gene is appeared later) by studying the orthologous group containing the protein reference given as input. By default 22 chordates species are used with the topology described below: On these 22 species, by default 21 species of Euteleostomi are studied because the 'orthologs_group_mode' parameter defined in the /home/tower/TOWER_1.03/prod/DGH_2/dagobah.xml file is parameterized to analyze losses in Euteleostomi (taxid = 117571) in lineage mode. However, the analysis of largest phylum such as Chordates (including Ciona) is conceivable by using the taxid 7711. In the contrary, smallest phylum could be studied by using the taxid of any ancestor described in the figure 1. The number of species studied in a phylum may be modified by choosing among the 22 species those kept in the scope parameters (species_scope_for_phylogeny_study & species_scope_list_for_phylogeny_study).

Produced data and results
Results are automatically produced as .report files and databases contents. Report files can be easily read by our "user friendly" viewer FGXView (/home/TOWER/FGXView). The most important result is the final species tree of species-set in which all the results are pinpointed.

Databases contents are of two kinds:
-FIGENIX results produced on a SGBDR PostgreSQL in the database: figenix_db -DAGOBAH results produced as an ontological database (see supplement 1), that relies also on a SGBDR PostgreSQL database: dagobah_db.
Note that these databases can be deployed on our IODA web site through collaborations.

1) Manipulate databases ( figenix_db and dagobah_db ) in SQL:
To manipulate these databases, please use the following commands in a Terminal: • To backup the database in SQL format: pg_dump DatabaseName -f SavingFileName Warning: There is an incompatibility with the SGBDR PostgreSQL when the version >8.2 is used (we used 8.4). When a new database is created, before database installation you must be connected to the database (as postgres user, "sudo su postgres", then "psql DatabaseName") and past the text present in the /home/tower/jena_with_postgres_higher_than_8.2 file.
2) Manipulate ontological database (only dagobah_db ) in OWL: The ontological results can be exploited by Protege software and exported in ".OWL" files. To backup the dagobah_db database in owl without use of Protege you need to be in the directory DGH_2 and launch the following command: owldump NameOfBackup.owl

How to add new species retrieved from Ensembl ?
The current GLADX version enables using 22 species, but more species can be used by some manipulation.

2) Modify the tree topology
The binary species tree defined in GLADX needs to contain the species chosen for analyses.

2.1) Database modifications
The tree topology of species is provided into FIGENIX database (called figenix_db) in the dagobahtreeoflife table. The topology is described branch by branch where each taxid is linked to its parent taxid and a description of its rank (class if it is an ancestral node, species if it is a leaf). An ancestral node must be linked to two taxid corresponding to their child nodes.
Warning: if you add new species that are outgroup of species already present: the farthest ancestor must always be linked to the ghost root taxid 1.

2.2) Advise the length of branches
The length of branches of the species tree topology is defined in the file /home/tower/TOWER_1.03/prod/DGH_2/src/project_specific.pl as follows: "tof_branch_length_to_node('taxid','branch_length')." You need to add all the new branch lengths.

Can I use an other kind of protein database ?
Yes but only in "simple mode". To use the simple mode you need to modify the dagobah.xml file available at this path /home/tower/TOWER_1.03/prod/DGH_2/.
1. change the mode as described below:

GLADX parameters
Numerous parameters are available to adjust the behaviour of GLADX. Some are essential, such as species and used database, ortholog detection mode (from the used reference sequence, or from its ortholog the most exterior depending to the selected phylum), and mode of study (verification of putative lost genes or not). These parameters must be defined before analysis is launched. They are contained in an XML file accessible at:
"database('Path_database_used')" defines the path of the protein database used. The default path is '../AlgoTools/Blast/db/ensembl'. B) Parameters defined in the geneloss_event_search agent : "nucleotide_in_more_by_side(10000)" is the number of nucleotides taken on each side of a TBLASTN hit, to output a prediction (value must be identical to the genelosses_synthetic_analysis value). The default value is 10000. "orthologs_group_mode(mode('TaxidAncestor'))" is the ortholog sequence analysis mode launched. There are two mode options: lineage or species (cf. article).
In lineage mode, GLADX searches the sub-tree having the TaxidAncestor ancestor as root and containing the reference given as input. All the sequences present in this subtree form an orthologous group. From this orthologous group it deduces the lineage-specific losses comparing the species present in the group to the species-set selected for the study. /!\ An agent allowing to analyze systematically all nodes of the lineage leading to the input reference from the selected ancestor can be activated. => see G) section In species mode, it searches in the phylogeny the species that have orthologs to the reference protein given as input until the TaxidAncestor ancestor and deduces losses comparing species that have an ortholog to the species-set selected for the study.
The default value is lineage('117571') that corresponds to a search of species that have no representative of a gene established at least since the last common ancestor of Euteleostomi.
"do_not_study_when_species_exist (['9606','9544'])" defines species that will stop the study if an ortholog exists in the first phylogeny. Should be empty if you want to analyse all the species where the gene is missing. If you need to concentrate on losses in a specific species, note its taxid here. If a database-described ortholog already exists for your species in the first phylogeny, there is no need to continue the study (to save your time). By default the value is empty. "minimum_size_of_orthologs_group_for_begin_the_study(3)" is the minimum size of an ortholog group required in the first phylogeny to continue the study. The default value is 3. "search_missing_cause_in_genome(choice)" defined if you want to use GLADX in complete mode to search for the genome of a species where orthologs are missing in the first phylogeny. Choice can be yes or no. If no is chosen, no verification of loss is made, and the results output come exclusively from analyses of the first phylogeny built from the chosen database (making the process much faster). The default value is yes.
"translate_in_gene_to_detect_ortholog_if_necessary(choice)" is defined when you have a tree of proteins that you want to translate into genes. Allows comparing two ortholog groups of a gene or two ortholog groups of a protein. Choice can be yes or no. No is faster but a little less precise.
"force_to_analyse_this_species (['9593','9606'])" This parameter allow to annotate the list of selected species, even if an ortholog is found by phylogeny in the first step. By default the value is empty. C) Parameters defined in the best_hit_fgx agent : "max_nb_managed_hits('5')" is the number of hits retained from TBLASTN to continue the analysis. More this number is high, more the GLADX analysis can be long when putative tested homologous sequences are not orthologous. The default value is 5. D) Parameters defined in the genelosses_checkpoint_all_events_by_study agent: "length_threshold(50)" is the minimum overlapping threshold between an orthologous sequence retrieved by GLADX and a known protein in order to continue the study at nucleotide level. The default value is 50. "identity_threshold(50)" is the minimum identity threshold needed between an orthologous sequence retrieved by GLADX and a known protein to continue the study at nucleotide level. The default value is 50. "identity_threshold_for_real_gene(70)" is the minimum identity threshold needed between known protein and used reference protein to be used in study at nucleotide level. The default value is 70. E) Parameters defined in the genelosses_synthetic_analysis agent : "nucleotide_in_more_by_side(10000)" is the number of nucleotides taken on each side of an orthologous gene to build an alignment with orthologs retrieved during the study. It is the step just before the reconstruction (The value must be identical to the geneloss_event_search value). The default value is 10000.

F) Parameters defined in the verify_prediction_existence agent :
When GLADX retrieves an ortholog, it systematically checks the database used to see whether there is an annotation on its position. Sometimes previously-described genes are present on the same area.
"overlap_threshold(50)" is the minimum overlap threshold in percentage for a previouslydescribed gene in the database to consider that they are on the same position. The default value is 50. "identity_threshold_to_conclude_gene_already_exist(70)" is the minimum identity threshold in percentage for a previously-described gene in the database overlapping the GLADX-retrieved ortholog sequence to be considered as the same prediction. The default value is 70. G) Activation of the gladx_driver agent to automate the search of lineage-specific losses on all nodes: Activation of this agent allows to analyze systematically the lineage-specific losses from all nodes available along the lineage leading to the input reference from the selected ancestor.
"Targets(['9606'])" is a parameter defining the species concerned by lineage-specific loss, searched by GLADX. It allows to focus the search on the interest species. When no species are specified, GLADX searches all lineage-specific losses along the studied lineage. By default the value is empty.
The activation of agents are defined with these following markups: <master> <type>Agent_Name</type> … </master> By default gladx_driver agent is deactivated by comment markups. To activate it, the comment markups of the gladx_driver agent must be removing, and the line of the orthologs_group_mode parameter of the geneloss_event_search agent must be commented.
Note: When new studies are performed with the gladx_driver agent, its orthologs_group_mode(lineage('TaxidAncestor')) parameter is used to define from which ancestor the study begin. While if the gladx_driver agent is launched after a first round of analysis with default mode, its orthologs_group_mode(lineage('TaxidAncestor')) parameter does not used. In this case, all the nodes of the lineage are analyzed from the ancestor that was defined at first round in the orthologs_group_mode(lineage('TaxidAncestor')) of geneloss_event_search agent.