Fig 1.
Phylogenetic distribution of the Merlin elements in eukaryotes.
The cladogram was drawn based on [32], and the subdivisions of Bilateria followed the NCBI Taxonomy. Green boxes indicate the presence of Merlin in at least one species per each taxonomic group and those analyzed by tblastn were indicated; gray boxes indicate that the group has no genome sequence available; white boxes indicate that no Merlin sequence was found;? indicates that the presence of Merlin remains unclear. G = genus; F = family; O = order; C = class; P = phylum; K = kingdom. Some taxa that have no genome available were omitted from the three (Chaetognatha, Gnathostomulida, Syssomonas from Holozoa; Colponemidia, Acavomonas from Alveolata; Jacobida, Tsukubamonas from Discoba: Symbiontida from Euglenozoa).
Fig 2.
Main characteristics of Merlin sequences that were identified in this work.
Similar Merlin copies within the same species sharing TIR sequences were grouped into families and identified by the letter F followed by a number. Groups of sequences within the same species with no TIRs and TSDs were divided according to the nucleotide divergence and identified by the letter G followed by a number. The residues aligned in the positions of the DDE motif are shown, and its conservation is highlighted in bold. TSD logos are shown and represent the nucleotide usage at each position and the y-axis ranges from a bit score of zero to two. TIR sequences are also shown and represented by both the 5’ TIR and the reverse complement of the 3’ TIR. Sequences are majority-rule consensus derived from the alignment of multiple copies of each family or individual copies in some cases and the mismatches between the two TIRs are shown as degenerate bases (R = A or G, Y = C or T, S = G or C, W = A or T, K = G or T, and M = A or C). A) Merlin family from P. purpureum has conserved DDE motif, TIRs of 39 bp and the 9-bp TSDs logo is a frequency plot based on one conserved copy. B) Merlin families (F1 and F2) from E. gracilis carry the conserved DDE motif, 8-bp TSDs and almost perfect TIRs. TSD logo from F1 is a frequency plot based on one preserved copy. C) Neighbor-joining tree of Merlin transposase proteins found in Monocercomonoides sp. showing at least 7 families (F1-F7) that present different TIRs and no clear TSDs consensus. D) Neighbor-joining tree of Merlin-related proteins found in Alveolata (S.r–Stentor roeselii; S.c–S.coeruleus; P. p—Porphyridium purpureum; Sy.m–Symbiodinium microadriaticum; Sy.k–Symbiodinium kawagutii; P.m–Perkinsus marinus) based only on the conserved DDE_Tnp_IS1595 domain. However, the DDE motif is not conserved in all sequences and TIRs and TSDs were identified in only a few of them. TSD logo from F2 is a frequency plot based on two conserved copies. The limit of the TIRs from F1 is not clear.
Fig 3.
Alignment of the DDE catalytic motif region of Merlin families.
The three conserved blocks of residues surrounding the DDE motif identified by Feschotte [14] are shown. The number of residues between blocks 1 and 2 varied from around 50 to 70 aa, except for the sequences Sy.m_LSRX01000331.1, Sy.m_LSRX01000268.1 and Sy.m_LSRX01000807.1 that present a larger region (around 110 aa). The DDE motif positions are highlighted with asterisks above the alignment. Colours on the sequences denote residues conservation: black > 90%; dark grey > 80%; light gray > 60%. All Merlin transposase proteins identified in this work were aligned with the Merlin sequence from C. briggsae (CAE74230). The consensus sequences for the three blocks of other Merlin transposases and IS1016 were obtained from [14] and added to the alignment (Cb–C. briggsae; Tm–Trichuris muris; Ag(A), Ag(B) and Ag(C)–Anopheles gambiae; Sm/Sj–Schistosoma mansoni and S. japonicum; Sj–S. japonicum; Ci–Ciona intestinalis; Dr(A) and Dr(B)–Danio rerio; Hs–Homo sapiens; Rc/Rs—Rickettsia conorii and R. sibirica, Hi/Hs–Haemophilus influenzae and H. somnus, Hp–H. paragallinarum, Mh–Mannheimia haemolytica; Nm–Neisseria meningitides). Sequences from this work are identified by initials (M.sp–Monocercomonoides sp.; E.g–Euglena glacilis; S.r–Stentor roeselii; S.c–S.coeruleus; P. p—Porphyridium purpureum; Sy.m–Symbiodinium microadriaticum; Sy.k–Symbiodinium kawagutii; P.m–Perkinsus marinus; B.s–Bodo saltans; P. sp–Perkinsela sp.) and the contig/scaffold ID. Some copies that are identical to others in these regions were omitted from the alignment (Sm_LSRX01000007.1 equal to Sm_LSRX01000898.1; M.sp_LSRY01000007.1 equal to M.sp_LSRY01000927.1; M.sp_LSRY01000020.1 equal to M.sp LSRY01000078.1; M.sp LSRY01001805.1 equal to M.sp LSRY01000732.1; Sc_MPUH01000330.1 equal to Sc_MPUH01000210.1; Sc_MPUH01000105.1 and Sc_MPUH01000701.1 equal to Sc_MPUH01000682.1; Sy.k_VSDK01018746.1, Sy.k_VSDK01013050.1, Sy.k_VSDK01010916.1, Sy.k_VSDK01004385.1, Sy.k_VSDK01021388.1, Sy.k_VSDK01022794.1 and Sy.k_VSDK01027557.1 equal to Sy.k_VSDK01019235.1; Sy.m_LSRX01001116.1, Sy.m_LSRX01000462.1, Sy.m_LSRX01000026.1 equal to Sy.m_LSRX01002035.1).
Fig 4.
Schematic representation of Merlin families containing tandem repeats.
A) Representation of Sy.m_F1 from Sy. microadriaticum containing a complex pattern of repeats in the 5’ region of the element. B) Representation of Sy.k_F1 from Sy. kawagutii showing a 5’ region that has high divergence among copies in sequence and size and a second divergent region in the 3’ end that contains tandem repeats in two copies. Copy 1—VSDK01027557.1, Copy 2—VSDK01013050.1. C) Sy.k_F2 from Sy. kawagutii with three indicated divergent regions. Due to missing data, we cannot estimate the size of the first region. The second divergent region contains 4 units of a 12-bp repeat in copy 2, while the third divergent region contains a 163-bp tandem repeat. Copy 1—VSDK01020766.1, Copy 2—VSDK01003368.1.
Fig 5.
Merlin copies from B. saltans and Perkinsela sp.
A) Representation of the most conserved Merlin copy from each species. An alternative internal ATG is shown. Both proteins possess the DDE_Tnp_IS1595 domain whose coding region is represented by dark color. B) Neighbor-joining tree of Merlin copies based on nucleotide sequences and the representation of their genomic context in the Perkinsela sp. genome. C) Neighbor-joining tree of Merlin copies based on nucleotide sequences and the representation of their genomic context in the B. saltans genome. ORFs are represented with boxes and numbers and the arrows indicate their direction. Additional information on genes is available in S6 Table. Colored boxes are related ORFs. Red boxes are Merlin copies and red boxes without an outline are the non-coding regions with similarity to Merlin protein in the tblastn. The relative position of the alignment with Merlin reference copy is written in red.
Fig 6.
Unrooted 50% majority rule consensus Bayesian tree (WAG + G) of Merlin and IS1595 group sequences based on the amino acid sequence of the conserved transposase domain DDE_Tnp_IS1595 (168 positions).
Posterior probability values (PP) are indicated near the nodes and some of the values from derived clades were omitted. The * sign near PP values indicates the clade was supported with bootstrap higher than 50 in the ML tree. Merlin sequences from different taxonomic groups are highlighted in different colors and identified with initials: AL–Alveolata, CN–Cnidaria, CO–Chordata, NU–Nucletmycea, PR–Protostomia, SA–Stramenopiles, RO–Rhodophyceae, ME–Metamonada, DI–Discoba. Information of Repbase Merlin and IS1595 sequences are found in S1 and S2 Tables, respectively. Merlin sequences characterized in this work are identified by the initial of taxon followed by the abbreviation of species name (S.r–Stentor roeselii; S.c–S.coeruleus; P. p—Porphyridium purpureum; Sy.m–Symbiodinium microadriaticum; Sy.k–Symbiodinium kawagutii; P.m–Perkinsus marinus; E.g–Euglena gracilis; B.s -Bodo saltans; P.sp–Perkinsela sp.) and the group or family.