Complete Mitochondrial DNA Sequences of the Threadfin Cichlid (Petrochromis trewavasae) and the Blunthead Cichlid (Tropheus moorii) and Patterns of Mitochondrial Genome Evolution in Cichlid Fishes

The cichlid fishes of the East African Great Lakes represent a model especially suited to study adaptive radiation and speciation. With several African cichlid genome projects being in progress, a promising set of closely related genomes is emerging, which is expected to serve as a valuable data base to solve questions on genotype-phenotype relations. The mitochondrial (mt) genomes presented here are the first results of the assembly and annotation process for two closely related but eco-morphologically highly distinct Lake Tanganyika cichlids, Petrochromis trewavasae and Tropheus moorii. The genomic sequences comprise 16,588 bp (P. trewavasae) and 16,590 bp (T. moorii), and exhibit the typical mitochondrial structure, with 13 protein-coding genes, 2 rRNA genes, 22 tRNA genes, and a non-coding control region. Analyses confirmed that the two species are very closely related with an overall sequence similarity of 96%. We analyzed the newly generated sequences in the phylogenetic context of 21 published labroid fish mitochondrial genomes. Consistent with other vertebrates, the D-loop region was found to evolve faster than protein-coding genes, which in turn are followed by the rRNAs; the tRNAs vary greatly in the rate of sequence evolution, but on average evolve the slowest. Within the group of coding genes, ND6 evolves most rapidly. Codon usage is similar among examined cichlid tribes and labroid families; although a slight shift in usage patterns down the gene tree could be observed. Despite having a clearly different nucleotide composition, ND6 showed a similar codon usage. C-terminal ends of Cox1 exhibit variations, where the varying number of amino acids is related to the structure of the obtained phylogenetic tree. This variation may be of functional relevance for Cox1 synthesis.

Sequence similarity of genomic features. Shown are percent and summed position-wise identities of nucleotide and amino acid sequences and the number of gaps occurring in total, as determined by BLAST-based comparison. Recognized and compared coding features: ND1-6 (NADH dehydrogenase subunit 1-6), COX1-3 (cytochrome c oxidase subunit 1-3), ATP6/8 (ATP synthase F0 subunit 6/8), CYTB (cytochrome b); values on the right hand side of the | symbol refer to amino acid sequences, all other values to nucleotide sequences; CR (control region).   Figure S1 Example of a regression analysis result. For each gene a linear least squares regression was calculated against a reference gene. In this example, pairwise distances (each species-specific sequence against the consensus sequence) for the ND6 genes were related to the distances for the reference 12S rRNA genes. Dots represent distance value pairs per species; red lines indicate the confidence region. The slope value of the regression line is presented in the bar plot in Figure 3. Boxplots were used to examine the distribution of differences in the distance value pairs (i.e. the rate gradient or direction, showing which one of the two genes exhibits more normalized substitutions). In the case of ND6, in each observed species there were more substitutions in ND6 than in the 12S rRNA (only positive values). In tRNA vs. 12S rRNA comparisons the zero was crossed in several cases, hence, for certain species the direction of which gene evolves relatively faster is inverse to that stated for the general trend.

Table S4A
Overview of linear regression statistics (p-distance based). Shown are adjusted R² values and p-values from F-statistics. R² is interpretable as the proportion of the total variability of the distance pairs that is accounted for by the linear model; as each additional model parameter (additi onal data point in the distance vector, i.e. an additional species) increases R², even if it has no statistical power, the adjusted version was applied (taking model complexity into account). F is the ratio of the variance explained by the parameters in the linear model and the residual or unexplained variance; p-values are calculated as the probability of achieving an F that large under the null hypothesis of 'no effect' (a bad model), from an F-distribution with degrees of freedom related to the number of observations/parameters. Regression results for rRNAs, coding-genes and the D-loop may be regarded as reliable; however, some distance pair scatter plots exhibited rather broad distributions lowering R² values and hampering a clear placement of the regression line, they are tagged (*).  Table S4B Overview of Bayesian-estimated mean rates. Shown are the mean rates (see Figure S5 for the definition) along with the median and the lower and upper bound of the highest posterior density (HPD) interval (the credible set that contains 95% of the sampled values) for all genes or partitions.  Table S5A AIC and AICc selected substitution models (with outgroup).
Best nucleotide substitution models were determined out of 88 candidate models (11 substitution schemes, G, I and F) with 8 rate categories for the gamma distribution; a maximum likelihood tree was used as base tree for likelihood calculations, SPR as tree topology search option. Best amino acid substitution models were determined out of all 120 available models (15 matrices, I, G, I+G, and F) with 8 rate categories; starting topology for likelihood calculations was either a fixed BIONJ JTT tree or a maximum likelihood tree (the top 3 scoring models of the latter are shown). Starting topology had an influence on selected models but mostly only on the ordering of the top 3, except for ND6 for which the whole fixed BIONJ top 3 are also given (boxed). In cases were the starting topology had an influence on the selected model, the fixed BIONJ top model is marked with an asterisk (*). Eventually selected models for calculations are highlighted (bold). Models were rated according to the AIC (for tRNA sequences the AIC corrected for small sample size (AICc) was applied). ΔAIC = AIC difference to top scoring model; p = number of model parameters; -lnL = log-likelihood, cc = concatenated; S10

Table S6
Overview on bifurcation support values. Shown are all node support values obtained as described in Methods. Slashes (/) indicate a node being absent in a particular tree; numbers in brackets represent exchanges of species between two nodes (see Figure S2b); underlined values for nodes 22 and 23 imply that in this particular tree indeed two nodes were created while they typically (i.e. in most setups) collapsed (thus, the notation "22/23" in Figure 2); missing values are marked (-). Using mixed models in MrBayes gave exactly the same results as stated for the AIC-selected models (thus these results were omitted

#3 Gb
CDS|rRNA|D-loop Gblocks-filtered p partitioned #1 aa CDS amino acid S12 A B Figure S2 Gene tree topologies. Shown are both generated tree topologies. Topology A was obtained consistently for partitioned data sets #2 and #3 with different computational methods and was therefore presented as valid gene tree in Figure 2; topology B was obtained for data set #1 and includes all below mentioned inconsistencies of species-node assignments within tribes [3 ↔ 11; 9 ↔ 13; 10 ↔ 14]. Numbers at nodes relate to , who also used only coding sequences in their analysis. Third, Pseudolabrus sieboldi and Pteragogus flagellifer switched places, putting P. flagellifer closer to Pseudolabrus eoethinus -this occurred only in the best ML tree and not in the majority rule consensus tree; it is very likely an artifact. In addition, the grouping of ((Ditrema temminckii, Cymatogaster aggregata)(Amphiprion ocellaris, Abudefduf vaigiensis)), i.e. the merging of the families Pomacentridae and Embiotocidae to sister groups, was not supported -but with weak support values for the split (see Table S6). These disagreements showed up similarly in the analyses based on amino acid sequences or on a codon position-wise partitioned data set #1 ( Figure S2). In short, analysis solely based on coding sequences or amino acid sequences did not identify Pomacentridae and Embiotocidae as sister groups (with weak support for their split, though), and all other analyses did. Gblocks reduced data set #3 from 16,084 to 13,833 characters, where, as expected, mainly the D-loop region and rRNA sequences were affected from filtering (see Table S3A). Full and Gblocks-filtered alignments still resulted in the same topology, both under ML and BI.

Figure S3 Phylogenetic relationships among labroid families analyzed in this study.
Shown is a representative phylogram with branch lengths (expected number of substitutions per site) calculated by maximum likelihood (RAxML) on the gene wise partitioned data set #3 (i.e., all sequences except those of tRNA genes).

Figure S4
Amino acid accessibility of COX-1 genes. Shown are scale profiles (% accessible) for Tropheus duboisi (orange), Oreochromis aureus (red), Cymatogaster aggregata (blue), and Alepocephalus agassizii (black); scores are normalized (0,1) with a higher score indicating higher accessibility. High accessibility of the gained C-terminal sequences is obvious (orange and red peak on the right). S14 Table S7 Information criterion-based selection of nucleotide substitution models (without outgroup). Best nucleotide substitution models were determined out of 88 candidate models (11 substitution schemes, G, I and F) with 8 rate categories for the gamma distribution; a maximum likelihood tree was used as base tree for likelihood calculations, best of NNI and SPR as tree topology search option. Eventually selected models for calculations are highlighted (bold); based on the results from Posada and Buckley [67], AIC/AICc or BIC were used for model selection. For comparison purposes models were rated according to AIC, BIC, DT and AICc (except for 4-fold degenerate sites); sequence length was used to approximate sample size in AICc, BIC, and DT calculations. ALT indicates the manual selection of an alternative model, performed when information criteria have chosen models with obvious misoperation of the model fitting or likelihood approach (with extremely high estimates of relative substitution rates). Abbreviations: 4fd, fourfold degenerate; cp, coding position; AIC, Akaike information criterion; AICc, AIC corrected for small sample size; BIC, Bayesian information criterion; DT, decision-theoretic performance-based approach;  Figure S5 Relative rate of molecular evolution. Bars represent the mean rates estimated in a Bayesian MCMC analysis (BEAST 1.7.4), where mean rate is defined as the total number of substitutions per site divided by the total amount of time that the tree represents, or the mean of (rate on i-th branch) weighted by (the length of time of the i-th branch) (i.e., ∑ ∑ ) (for individual bar heights see Table S4b). Group mean values are indicated (from left to right: 13.34, 11.49, 0.28, 0.95, 3.14, 1.14, 0.84). D-loop (3.21) represents the full non-coding region, whereas D-loop Gb (2.42) is based on a Gblocks-filtered multiple sequence alignment. Yellow bars refer to unreliable results as identified by the respective posterior distributions (skewed, long-tailed to the right; under the same gamma prior as used for all other genes or partitions). S19 Table S8 Molecular clock hypothesis tests. Molecular clock tests were performed for all genes using BEAST 1.7.4. Strict and lognormal relaxed clocks were tested on a tree model based on a Yule process tree prior; UPGMA starting trees were used. Two independent MCMC chains were set to 5 MIO steps being logged every 100th step. Parameter distributions were evaluated using Tracer 1.5. The clock-like behavior was validated by means of the standard deviation of the uncorrelated lognormal relaxed clock (UCLD stdev). If the UCLD stdev parameter estimate is close to 0.0 then the data is quite clock-like, if it is greater than 1.0 then the data exhibits substantial rate heterogeneity among lineages. ATP8, ND5, and some tRNAs show increased rate heterogeneity. The YBR is the rate of lineage birth in the Yule model of speciation. In the present setting this parameter estimates the number of lineages born from a parent lineage per substitution per site; all RNA genes exhibit relatively high birth rates. If the covariance (COV) between parent and child branch rates in a tree is significantly positive then branches with fast rates are followed by branches with fast rates; if the COV distribution spans zero (what is the case for all examined genes), then branches with fast rates and slow rates are next to each other (there is no strong evidence of autocorrelation of rates in the phylogenies