Identifying Single Copy Orthologs in Metazoa

doi:10.1371/journal.pcbi.1002269

Figure 1.

Project workflow.

The analysis workflow is divided into 3 major steps. The first step (Eukaryotic guide tree construction) aims at constructing the guide tree used to infer duplication and loss events. The second step (Identification of core metazoan gene families) is the core of our method, i.e. the identification within the eggNOG database of the single copy genes. The last step concerns the extraction of the single copy genes from the EST datasets.

More »

Expand

Figure 2.

Gene tree reconciliation process.

Reconciling a gene tree with a (guide) species tree. A) Given the species tree on the left, we need to estimate the most parsimonious number of duplications and losses that explain the topology and distribution of the gene tree (on the right). In order to assess correctly the number of duplications and losses, we need to find the best rooting of the gene tree. To this end, the gene tree is rooted at every possible position, and for each rooting, the most parsimonious number of duplications and losses is calculated. The rooting that requires the fewest number of steps (duplications and losses) is considered the most parsimonious rooting of the gene tree. For example: the reconciliations for two possible rootings are shown: positions X and Y in panes B) and C). The positions of duplication events are indicated with a diamond, losses are indicated with a dashed line. B) Rooting the gene tree at position X in B) requires duplication and two losses, while rooting at position Y in C) requires 1 duplication and 1 loss. Of the two rootings, position Y is the most parsimonious. The numbers on the internal branches indicate the internal branch of the species tree in A)that they are mapped to. If we were trying to identify single copy genes at the hierarchical level of internal branch 2 on the species tree, then the sub-tree marked with a * in C) would represent a gene family that has been in single copy since this hierarchical level.

More »

Expand

Figure 3.

Eukaryotic guide trees used in the analysis.

The Eukaryotic guide trees constructed based on a concatenated alignment of the 40 universally distributed genes [35]. A) The phylogeny supporting the Coelomata hypothesis for the evolution of animals. B) The phylogeny supporting the Ecdysozoa hypothesis for the evolution of animals was created by hand from A). Branch lengths represent the evolutionary distances between the taxa based on their amino acid sequences and were estimated using the same alignments of universal genes. Both trees were used in the gene-tree reconciliation step, so as not to bias subsequent analyses towards either hypothesis. Filled circles represent internal branches that received greater than 95% Bootstrap proportion (BP) support. Open circles represent internal branches with greater than 60% BP support.

More »

Expand

Figure 4.

Distribution of single copy genes in the analyzed species.

Distribution of single copy genes across all studied species. The tree contains the species analyzed in this study and their relationships as defined by the NCBI taxonomy. The number of single copy genes found in each species is shown, along with a representation of that value as a percentage of all the 1,126 single copy genes and as a percentage of the total number of genes in the genome or EST dataset used. The black bars represent counts from genomes, grey bars from published EST datasets. Species names in bold indicate the species that were used to define the set of single copy orthologs.

More »

Expand

Figure 5.

Multigene family reconstruction.

An example of the reconciliation of a proteasome 26S subunit multigene family is shown in the left. Duplications are hypothesized to have occurred on the branches colored in red, while those branches that are hypothesized to be lost are in grey. The subtree in the dashed box has been identified as being in single copy. The tree on the right is a more detailed view of the same clade. The leaves on the tree are labeled with their species names followed by the protein ID of the specific sequence that was mapped to that position.

More »

Expand