Identification of Transcription Factor Genes and Their Correlation with the High Diversity of Stramenopiles

The biological diversity among Stramenopiles is striking; they range from large multicellular seaweeds to tiny unicellular species, they embrace many ecologically important autothrophic (e.g., diatoms, brown algae), and heterotrophic (e.g., oomycetes) groups. Transcription factors (TFs) and other transcription regulators (TRs) regulate spatial and temporal gene expression. A plethora of transcriptional regulatory proteins have been identified and classified into families on the basis of sequence similarity. The purpose of this work is to identify the TF and TR complement in diverse species belonging to Stramenopiles in order to understand how these regulators may contribute to their observed diversity. We identified and classified 63 TF and TR families in 11 species of Stramenopiles. In some species we found gene families with high relative importance. Taking into account the 63 TF and TR families identified, 28 TF and TR families were established to be positively correlated with specific traits like number of predicted proteins, number of flagella and number of cell types during the life cycle. Additionally, we found gains and losses in TF and TR families specific to some species and clades, as well as, two families with high abundance specific to the autotrophic species and three families with high abundance specific to the heterotropic species. For the first time, there is a systematic search of TF and TR families in Stramenopiles. The attempts to uncover relationships between these families and the complexity of this group may be of great impact, considering that there are several important pathogens of plants and animals, as well as, important species involved in carbon cycling. Specific TF and TR families identified in this work appear to be correlated with particular traits in the Stramenopiles group and may be correlated with the high complexity and diversity in Stramenopiles.


Introduction
Stramenopiles are a very diverse group that includes algae, diatoms, as well as the non-photosynthetic oomycetes, and a range of chlorophyll c-containing unicellular and multi-cellular brown algae [1].On one hand, there are Bacillariophyceae and Chrysophyceae, secondarily photosynthetic organisms having engulfed red algae and adopted it as a plastid approximately 1,300 million years ago [2].On the other hand, non-photosynthetic Stramenopiles such as the oomycetes, do not even have the vestigial plastids found in apicomplexan and euglenoid parasites that originated from phototrophs [2].
In all organisms, the regulation and coordination of growth, development, and cell cycle progression, as well as the physiological and metabolic adaptation to a variable environment, depend on the regulation of gene expression.A major step in modulation is controlling when and how much RNA is generated from a DNA template.Sequence specific DNA-binding transcription factors (TFs) each recognizing a family of cis-regulatory DNA sequences, regulate spatial and temporal gene expression by binding to DNA and either activating or repressing the action of the RNA polymerase [3].Additional proteins, other transcriptional regulators (TRs), are involved in protein-protein interactions or chromatin remodelling [4].TFs and TRs are modular in nature, composed of structural semi-independent units called domains that can appear in various combinations and arrangements [5], i.e., domain architecture.Domain architectures have been used in order to identify TFs and TRs in different approaches [4][5][6].Pe ´rez-Rodriguez et al.
(2010) described a number of TF families according to their domain architecture (rules system), identifying all domains which are known to occur in this type of regulatory proteins and that are generally employed to classify proteins as transcriptional regulators, making a database with putatively complete sets of TFs and TRs from several plant species [4].
In this study we identified and classified TFs and TRs of Stramenopiles which genome sequences are available, based on an extension of the rule system designed by Pe ´rez-Rodriguez et al. (2010) [4].Furthermore, we were interested in evaluating the correlation, in a proper phylogenetic context, between traits of interest and gene family sizes as a way to offer insights into the processes generating the observed diversity in Stramenopiles.

Identification and classification system
In order to identify TF and TR family members in the genome sequences of Stramenopiles, in a first step, we extended the set of rules established by Pe ´rez-Rodriguez et al., (2010) [4], in order to recover specific families for groups such as Stramenopiles and Fungi.Briefly, we identified, using current literature, the set of domains that were known to occur in TFs and TRs from Stramenopiles and Fungi (a list of a total of 49 new rules is available in the Table S1).This list was established from the available PFAM profile HMMs (v25.0).Additionally, we generated new, or updated HMMs (in house models), based on alignments that were created with outputs of PSI-BLAST searches against the NCBI protein database or found in the literature.
To carry out the identification and classification of TFs and TRs, we retrieved the entire set of predicted proteins of all Stramenopiles species which nuclear genomes are sequenced and annotated, available at the end of 2010 (Table 1).A profile HMM search with hmmsearch (HMMER v3.0) was performed over all protein sequences using all protein domain models (PFAM v25.0 and in house models).We considered as significant all hits with a bit-score larger than the domain gathering cut-off value defined in the model.We classified the proteins into the established families of TFs and TRs based on their domain architecture i.e., the set of rules (Figure S1).
After the identification and classification of TFs and TRs, we carried out a relative importance analysis.This analysis measures how important a family is in comparison with other families in the same organism, according to its size, i.e., we represented the importance of the family in a given species, as its percentual contribution in that species over the total number of TFs and TRs.

Updating in house models
For all in house models we created our own HMM profile with HMMER 3.0 based on multiple sequence alignments.Furthermore, we defined a gathering cut-off value beyond which the hits are considered significant, to this end, we used the same procedure as described in Pe ´rez-Rodriguez et al. (2010) [4].
In order to verify our models, we carried out the identification and classification of TFs and TRs in Arabidopsis thaliana to find out whether the use of HMM-models now created with HMMER 3.0 shows the same classification as shown in Pe ´rez-Rodriguez et al. (2010) [4].We expected to retrieve at least the same members for each of the protein families reported in Pe ´rez-Rodriguez et al., (2010) [4].Seed alignments and models are available as Datasets S1 and S2, respectively.

Phylogenomic analysis of Stramenopiles
We inferred the phylogenetic relationships between the species of Stramenopiles using orthologous genes identified with OrthoMCL (http://orthomcl.org/)[7].We carried out two extra OrthoMCL analyses, one of them including selected protist species, the other one including the same selected protist species and some plant species, all the extra species were selected taking into account their phylogenetic relationships and that their genomes were fully sequenced (Table 2).These OrthoMCL analyses were made in order to determine whether the inclusion of distant species had any effect on the tree topology, the number of phylogenetic clusters retrieved in each OrthoMCL analysis is shown in Table 3.To establish the groups of orthologous genes we used a MCL inflation value of 1.5.Two datasets were retrieved: one with all orthoMCL clusters with one and only one orthologous gene for each species, and the second, with all clusters in which at most one species was missing.For each retrieved cluster we performed a multiple sequence alignment with MAFFT (-auto default option) [8], and we determined the proper evolutionary model using ProtTest (default options) [9].Alignments were concatenated using FasCon-Cat [10].Phylogenetic inference was performed with FastTree [11] and RAxML [12] for the super-matrix approach with 1000 bootstrap replicates/samples, in a maximun likelihood analysis.All the matrices and trees from RAxML that were used for subsequent analysis were uploaded in TreeBase (http://purl.org/phylo/treebase/phylows/study/TB2:S12312).Furthermore, we used the PANTHER library version 7.0 (http://www.pantherdb.org/) to assign putative funtional annotation to the gene clusters used for the phylogenomic reconstruction.

Comparative phylogenetics of Stramenopiles
One of our goals was to detect whether there were any traits that co-varied with gene family sizes, as this will suggest a clear relationship between the given trait and gene family, thus prioritizing families for further studies and experiments.In order to assess the correlation between genomic and phenotypic characters and the sizes of different TF and TR families in Stramenopiles we built a trait matrix including 4 traits: genome size, number of predicted proteins, number of flagella and number of cell types in the life cycle (Table S2).The evaluation of the covariation between the traits and the gene family size, might be done with a simple correlation analysis.However, this assumes that the observations, data points, are independent and identically distribute, e.g., number of flagella in different species; but we know that the species under study are related by common decent, thus violating the assumption required for a simple correlation analysis.We can correct for the common evolutionary history, transforming our data points into an, in principle, independent and identically distributed dataset using Phylogenetic Independent Contrast analysis (PIC) [13][14][15], and then perform the correlation analysis, over these contrasts.In this way, using PIC, we can evaluate the correlation between traits and gene family size, controlling the effect of common evolutionary history.We perfomed a PIC for each of these traits, using the R package APE [16].Finally, the Pearson's correlation coefficient was computed between the contrasts of interest, i.e., traits and the sizes of TF and TR families, using the basic function cor() in R [17]; p-values for the measured correlation coefficients were corrected using the false discovery rate approach in the R package fdrtool [18].Furthermore, we mapped on the phylogenetic tree the gains, losses and high abundance of TF and TR families.Six additional characters, i.e., cell cover (Cellulose or silica cover), presence/ absence of chloroplast, presence/absence of pigments, lifestyle, presence/absence of cell wall degrading enzymes and formation of haustorium were also mapped on the tree (Table S2).Mapping family sizes and traits was perfomed using Mesquite (http:// mesquiteproject.org/).In order to identify correlations between the family size and the 6 additional characters, we used a phylogenetic logistic regression with the firth correction using the PlogReg algorithm, this method was necessary since the additional 6 characters are categorical [19].Moreover, significance of high abundance differences in TFs and TRs families between the autotrophic and the heterotrophic groups was analyzed using standard t-test with subsequent false discovery rate correction.

Extension of the rules system for the identification and classification of TF and TR families in Stramenopiles
From a thorough literature search, we designed 135 rules for classifying regulatory proteins.From the entire set of rules, 75% were related with domains occuring in plants from the previous work of Pe ´rez-Rodriguez et al. (2010) [4].This set of rules comprises a wide array of families widely distributed across eukaryotes.The other 25% of the rules were built with domains occurring in TFs and TRs of Stramenopiles and/or fungi (Figure S1).

TFs and TRs found in Stramenopiles
We found 38 TF and 25 TR families in the 11 Stramenopiles evaluated, i.e., about 50% of the total of rules included (Table S3).The relative importance analysis (Figure 1), shows a comparison between the families with a high relative importance like GNAT (proteins with acetyltransferase domain), MYB-related (DNA binding domains from Myb proteins), and C2H2 (protein structural zinc finger that bind DNA, RNA or proteins) with contrasting families with low relative importance across the species, such as CSD, SWI/SNF-BAF60b and TFIIS.

Phylogenomic analysis of Stramenopiles
We could show that the inference of evolutionary relationships among Stramenopiles is robust to the inclusion of distantly related species as protists and algae plants (Figure S2).Noteworthy, we found that red and green algae are closely related to Stramenopiles, as suggested previously [20][21][22].For further analyses we used the tree inferred including all the orthologous gene clusters, where at most, one gene representing each species was present (orthologues in a one to one relationship), and using an inflation value of 1.5, as recommended previously [23] (Figure 2; Dataset S3).We were able to assing functions to 45% of the ortologous gene clusters using the PANTHER library, examples of these gene families are Heat shock protein 70, Histone acetyltransferase, Ornithine decarboxylase and TATA-box-binding protein, however, a large number of cluster genes were not identified with this approach (details in Table S4).

Comparative phylogenetics of Stramenopiles: Trait correlations with TF and TR family sizes
Out of the 63 TF and TR families identified in Stramenopiles, we found a correlation between 28 families with the presence of flagella, number of predicted proteins and number of cell types in the life cycle (Table S5).Moreover, we found no correlation with genome size; this is an expected result due to the high variability of the genome sizes in the group, even more when taking into account the c-value paradox [24].The PIC analysis identified a common set of families that are correlated with all traits (Figure 3), making them a candidate set responsible for the biological diversity shown by this group of organisms.A different subset of families was correlated with the formation of flagella and number of cell types together.Moreover, there are several families specifically correlated with the number of predicted proteins.Three families specifically correlated with the number of cell types in the life cycle (Figure 3).Taking into account the relative importance analysis described previously (Figure 1) we found that there are families showing a correlation and at the same time a high relative importance, as in the case of C2H2, GNAT and MYB families.
Mapping families on the phylogenetic tree allowed us to identify gains and losses.Losses are recurrent in different species and clades; however, in species like Hyaloperonospora parasitica and Phaeodactylum tricornutum we found a higher number of them.There are families that were lost more than once, as in the case of zf_A20, HB and TIG families (Figure 2).On the other hand, we found that the transcription factor family LuxR (related with the expression of response regulators in several bacteria species) was gained in 3 autotrophic species (Figure 2).It was reported that a remarkably high number of genes appear to have been transferred between Stramenopile diatoms and bacteria via horizontal gene transfer (HGT) [25].We carried out a phylogenetic inference (RAxML -1000 Bootstraps -evolutionary model Gamma+I+G) using the eukaryotic versions of LuxR and the corresponding versions from Vibrio cholera, Vibrio alginolyticus and Bacillus cereus, revealing that the eukaryotic LuxR forms a single, well supported, clade.This clearly indicates that if a horizontal gene transfer event took place, it was a single one predating the split between Fragilariopsis cylindrus, Phaeodactylum Tricornutum and Thalassiosira pseudonana (Figure S3).
We could also establish that the LIM family, found in other eukaryotes [26], was lost in the Oomcycete clade, and later regained in the Phytophthora group (Figure 2).Furthermore, high abundance in HMG and HSF families was significant for the autotrophic group, as high abundance in C2H2, C3H and FHA families was found significant for the heterotrophic group (Figure 2).Finally, no correlation was found between the TF and TR family size and the binary traits cell cover, presence/ absence of chloroplast, presence/absence of pigments, lifestyle, presence/absence of cell wall degrading enzymes and formation of haustoria (Table S6).

Discussion
The updated rule system designed by Pe ´rez-Rodriguez et al. (2010) [4] allowed us to rigorously classify proteins into several and well-described families of TFs and TRs.This was a key phase in order to avoid the inclusion of false negatives.Furthermore, a phylogenomic analysis of the Stramenopiles species enabled us to reach a reliable approximation of their phylogenetics relationships.This was a key step to establish a correlation between different traits and TF and TR families, eliminating as much as possible the bias generated by common ancestry in order to obtain values that are statistically independent [13].
The number of predicted proteins, number of cell types during the life cycle and the number of flagella, showed a strong Orthologous gene sets were identified using OrthoMCL with a MCL inflation value of 1,5.All the clusters with at most one orthologous gene for each species of Stramenopiles were selected for phylogenetic inference, see the main text for details.All clades have a bootstrap support of 1, except the clade with a blue dot, which has a bootstrap value of 0.986.Gene family losses appear in green and gains are in shown in blue.Families that differ in size between the autothrofic and the heterothrophic group appear in red.In cyan and purple, some traits characteristic of autothrophic and heterothrophic organisms, respectively.doi:10.1371/journal.pone.0111841.g002correlation with the size changes of several TFs and TRs families.Most of these families, however, were positively correlated with more that one specific trait, making it clear that these families are implied in the complexity of the organisms, but leaving for further analyses the elucidation of their specific roles in the regulation of these traits.On the other hand, there were several families correlating specifically with the number of cell types during the life cycle and the number of predicted proteins, families like MYB, MYB_related and CAMTA are examples.It is very remarkable that we found a high relative importance of these families in some organisms, making them of great interest for further investigation.
Gain and loss of TF and TR families have been reported to be determinant in the complexity and diversity of many organisms [27].Stramenopiles species evaluated this study show no exception to these findings.For example, we could suggest a gain of the LuxR family for some autotrophic species, most likely resulting from horizontal gene transfer between bacteria and Stramenopiles [25].Also, we found an interesting behavior in the LIM family (domains that mediate protein-protein interactions), as it was apparently lost in the oomycete group, but later, it appeared as a gain in the Phytophthora clade.Moreover, we found several families lost in specific species like the heterotrophic organism Hyaloperonospora parasitica, and the autotrophic organisms Aureococcus anophagefferens and Phaeodactylum tricornutum.These families need to be investigated in depth, in order to understand whether these losses and gains have and impact in the diversity of specific groups.However, the lack of genome sequences of heterotrophic but non-parasitic organisms could mask more relations between TF and TR families and specific features of Stramenopiles.
On the other hand, high abundance in TF and TR families have been thoroughly analyzed in many groups across the tree of life and seem to correlate with the complexity and diversity in several organisms [27].We also could establish this relation in the Stramenopiles, taking into account the comparison between the two largest groups (autotrophic and heterotrophic organisms).In protists organisms, the significant high abundance in HMG (TR family) could suggest a possible secondary adaptation as a transcription factor [28], besides, the high abundance in the HSF family is involved in the regulation of heat shock proteins, in order to respond to different type of stresses [28].The C2H2 and C3H families have been reported in several eukaryotic organisms across the tree of life; the expansions of these families have been described as independent lineage specific expansions in parasitic organisms that may be involved in regulation of their specific characteristics [28].The high abundance of these families reported in this study for heterotrophic group (oomycetes) could be implicated in the regulation of specific features of this group, i.e., parasitic features.

Conclusions
The understanding of how Stramenopiles species regulate some of their distinct characteristics may be helpful in dealing with some of the species in this group that cause economic losses around the world, e.g., Phytophthora species, or play an important role in the carbon cycling, e.g., Diatoms.We showed for the first time that TF and TR families are correlated with specific traits that may be involved in the complexity and diversity of Stramenopiles.Besides, there are several losses established in the Stramenopile group, as well as, lineage specific high abundance in families related to autotrophic and heterotrophic organisms.Currently, there is a bias in the genome sequences that are available towards the pathogenic organisms.The inclusion of more autotrophic organisms, as also, the inclusion of heterotrophic but non-pathogenic species, i.e., Salisapilia spp., will make possible a more complete identification across the Stramenopiles group in order to clarify the gains, losses and high abundance of TF and TR families.Table S1 New rules for identification of TFs and TRs in Stramenopiles.49 rules specific for Stramenopiles were identified using current literature.Models were used using available PFAM profile HMMs (v25.0) or created based on alignments (in house models).Additionally, for each family the domains that must appear in the family (type = should) and domains that must not appear (type = should not) are shown.(XLSX) Table S2 Traits used for phylogenetic independent contrast analysis (PIC) and traits mapped for each species.Continuous traits were used for the PIC analysis (Genome size, predicted proteins, number of flagella and cell types); on the other hand, binary traits were used for mapping (cell cover, chloroplast, pigments, lifestyle, cell wall degrading enzymes and haustorium).(XLSX) Table S3 Number of proteins assigned to a TF or TR family.Columns represent each species and rows represent each of the TF and TR families.For each family the number of proteins classified into it is displayed.(XLSX) Table S4 Classification of orthologous genes using the PAN-THER.45% of the orthologous genes were assigned to a specific family in the PANTHER library, however, 65% of the genes were not assigned to a specific family.(XLSX) Table S5 TF and TR families correlated after PIC + FDR.Families with q-value ,0.05 -the ones positive or negatively correlated -are shown for each of the continuous traits used in this study.(XLSX)

Figure 1 .
Figure 1.Relative Importance of TF and TR families in Stramenopiles.The Y axis represents the relative importance of each family, the X axis represents the 9 species used in this study.For each species, families with a high relative importance and correlated with the number of predicted proteins, number of cell types and number of flagella are shown (C2H2 in orange, HSF in green in and MYB-related in blue).Moreover, some families with a low relative importance are shown in the figure as a control for comparison (CSD in yellow, SWI/SNF-BAF60b in purple and TFIIS in pink).doi:10.1371/journal.pone.0111841.g001

Figure 2 .
Figure 2. Reconstruction of evolutionary relationships among Stramenopiles and Mapping of TF and TR families on the tree.Orthologous gene sets were identified using OrthoMCL with a MCL inflation value of 1,5.All the clusters with at most one orthologous gene for each species of Stramenopiles were selected for phylogenetic inference, see the main text for details.All clades have a bootstrap support of 1, except the clade with a blue dot, which has a bootstrap value of 0.986.Gene family losses appear in green and gains are in shown in blue.Families that differ in size between the autothrofic and the heterothrophic group appear in red.In cyan and purple, some traits characteristic of autothrophic and heterothrophic organisms, respectively.doi:10.1371/journal.pone.0111841.g002

Figure 3 .
Figure 3. Families correlated with traits after phylogenetic independent contrasts analysis and false discovery rate correction.Colors represent each trait analyzed: in blue, number of cell types during the cell cycle; in red, presence or absence of flagella; and in green, number of predicted proteins encoded by the nuclear genome.Genome size does not show correlation with any trait.doi:10.1371/journal.pone.0111841.g003

Table 3 .
Number of phylogenetic clusters generated with OrthoMCL.

Table S6 P
-values for phylogenetic logistic regression with Firth correction.P-values for each family after Firth correction are shown for each binary trait used in this study.Moreover, there are not families correlated since none of the p-values is ,to 0.05.(XLSX)