Can RNA-Seq Resolve the Rapid Radiation of Advanced Moths and Butterflies (Hexapoda: Lepidoptera: Apoditrysia)? An Exploratory Study

Recent molecular phylogenetic studies of the insect order Lepidoptera have robustly resolved family-level divergences within most superfamilies, and most divergences among the relatively species-poor early-arising superfamilies. In sharp contrast, relationships among the superfamilies of more advanced moths and butterflies that comprise the mega-diverse clade Apoditrysia (ca. 145,000 spp.) remain mostly poorly supported. This uncertainty, in turn, limits our ability to discern the origins, ages and evolutionary consequences of traits hypothesized to promote the spectacular diversification of Apoditrysia. Low support along the apoditrysian “backbone” probably reflects rapid diversification. If so, it may be feasible to strengthen resolution by radically increasing the gene sample, but case studies have been few. We explored the potential of next-generation sequencing to conclusively resolve apoditrysian relationships. We used transcriptome RNA-Seq to generate 1579 putatively orthologous gene sequences across a broad sample of 40 apoditrysians plus four outgroups, to which we added two taxa from previously published data. Phylogenetic analysis of a 46-taxon, 741-gene matrix, resulting from a strict filter that eliminated ortholog groups containing any apparent paralogs, yielded dramatic overall increase in bootstrap support for deeper nodes within Apoditrysia as compared to results from previous and concurrent 19-gene analyses. High support was restricted mainly to the huge subclade Obtectomera broadly defined, in which 11 of 12 nodes subtending multiple superfamilies had bootstrap support of 100%. The strongly supported nodes showed little conflict with groupings from previous studies, and were little affected by changes in taxon sampling, suggesting that they reflect true signal rather than artifacts of massive gene sampling. In contrast, strong support was seen at only 2 of 11 deeper nodes among the “lower”, non-obtectomeran apoditrysians. These represent a much harder phylogenetic problem, for which one path to resolution might include further increase in gene sampling, together with improved orthology assignments.


Introduction
The insect order Lepidoptera (moths and butterflies; >157,000 spp.; [1]) is arguably the largest single radiation of plant-feeding insects. A prominent element of terrestrial ecosystems, Lepidoptera function as herbivores, pollinators and prey, with substantial impact on humans. Highly destructive as agricultural pests, they have also become icons for environmental conservation, and supply food and fiber to multiple societies [2]. And, they provide important model systems for studies of genetics, physiology, development, and many aspects of ecology and evolutionary biology [3], including the question of why herbivorous insects, 25% of earth's known species, are so species-rich [4][5][6].
A robust phylogenetic framework is essential for all attempts to understand the diversity, adaptations and ecological roles of Lepidoptera. The past decade has seen tremendous advances in our understanding of lepidopteran phylogeny at all levels. Molecular data have proven especially powerful for defining superfamilies and relationships within them. In a remarkable burst of community progress, robust molecular phylogenies for nearly all of the major superfamilies (those containing hundreds to thousands of species), combined with review of the morphological evidence, have been published in the past few years or will be forthcoming shortly. Recent examples (not an exhaustive list) include studies of Bombycoidea [7], Gelechioidea [8], Geometroidea [9][10][11], Gracillarioidea [12], Noctuoidea [13,14], Papilionoidea [15], Pyraloidea [16], Tortricoidea [17], and Yponomeutoidea [18]. In all of these superfamilies, a majority of the major divergences (at least) seem credibly established, though important uncertainties remain. Progress is also now rapid at more subordinate levels.
The past few years have likewise seen the first attempts at "backbone" phylogenies spanning much or all of the order [19][20][21]. A recent such study [22], with the largest gene and taxon sampling to date, used 483 exemplars, representing 115 of the approximately 125 families of Lepidoptera [23], sequenced for up to 19 nuclear protein-encoding genes/14.7 kb. It gave a topology quite similar to those of earlier nuclear gene studies, but with stronger bootstrap support. It also agrees with newly-emerging evidence from whole mitochondrial genomes (e.g., [24,25]; see Discussion). The main conclusions of the Regier et al. study [22] are summarized in Figure 1.
The so-called non-ditrysian lineages ( Figure 1, left side) are mostly species-poor but rich in morphological variation, and often have apparently relictual distributions suggesting great age. Exhaustive comparative-anatomical studies of these groups (e.g., [26][27][28], an early application of Hennigian phylogenetics, yielded many synapomorphies and a well-resolved backbone phylogeny. Although important puzzles remain, the molecular data strongly resolve a majority of these early divergences, recovering previously-recognized major clades including Glossata, Heteroneura and Eulepidoptera ( Figure 1). There is also strong molecular support for several novel proposals, such as apparent non-monophyly of Palaephatidae. The molecular data strongly corroborate the clade Ditrysia, named for the presence in the female Terminalia  [22]. ML topology shown for degen1 (non-synonymous change only) is based on 100 GARLI searches. Bootstrap percentages are degen1 followed by nt123 (all nucleotides), based on 1000 bootstrap replicates with 15 search replicates each. Only values greater than 50% are shown. Branch lengths are arbitrary. '-' = node not found in ML tree for nt123. Numbers in parentheses after taxon names indicate number of families/number of exemplars studied. Names in bold denote clades in which larvae are not typically phytophagous. Names in serif font denote clades in which adults typically bear ultrasound-detecting tympanic organs on the thorax and/or abdomen. Classification follows van Nieukerken et al. [1]. doi: 10.1371/journal.pone.0082615.g001 of separate openings for mating and for oviposition, which contains over 98% of lepidopteran species and 80% of the families.
The superfamilies of Ditrysia, in contrast to the nonditrysians, tend to be species-rich, cosmopolitan and less distinct morphologically, so that major groupings have been difficult to discern. The authoritative morphological hypothesis synthesized by Kristensen and collaborators [23,29,30] postulated only 11 tentative monophyletic groupings among the 33 ditrysian superfamilies recognized. Molecular data markedly strengthen resolution for the initial divergences within Ditrysia.
There is now strong molecular support (Figure 1) for the morphological inference that all Ditrysia apart from Tineoidea form a monophyletic group. Molecular data also strongly support four new or previously uncertain conclusions: (1) The Tineoidea themselves are paraphyletic with respect to all other Ditrysia; (2) Yponomeutoidea and Gracillarioidea are sister groups; (3) Yponomeutoidea and Gracillarioidea together form the sister group to the remaining Ditrysia; and (4), the remaining ditrysians form a strongly supported group consisting of Apoditrysia in an earlier sense [31,32] plus Gelechioidea. Apoditrysia sensu novo [1], now including Gelechioidea, are also supported by several morphological synapomorphies [8,33].
In striking contrast to those in earlier-originating clades, "backbone" relationships in the Apoditrysia sensu lato are almost entirely lacking in strong support from either molecules or morphology, although rogue taxon removal [34] helps somewhat. Recent large-scale molecular studies consistently recover monophyly of some variant of the huge group Obtectomera (107,551 spp.; [1]), originally proposed for families with relatively immobile pupae [31], but support is very weak (Figure 1). Molecular studies also find the large superfamily Gelechioidea to be closely related to Obtectomera, but again with weak support (Figure 1). Within Obtectomera, the morphological working hypothesis recognized a group Macrolepidoptera, consisting of the butterflies (Papilionoidea; 18,363 spp. [1]) and the familiar large moths (inchworms, cutworms, silkmoths and relatives; 5 superfamilies, 72,398 spp.; [1]). Molecular studies have instead consistently separated the butterflies from the large moths, and found that the latter, termed the Macroheterocera [1], are more closely related to the non-macrolepidopteran superfamily Pyraloidea (15,587 spp.; [1]). These findings too, however, have weak bootstrap support (Figure 1). Within Macroheterocera, neither nuclear genes nor morphology provide strong evidence for any relationships at all among superfamilies (Figure 1; but see 24,25). This phylogenetic uncertainty, in turn, limits the power of analyses of the origins, ages and evolutionary consequences of traits hypothesized to promote the spectacular diversification of Apoditrysia, which include 144,524 species in 93 families and 26 superfamilies according to a recent classification [1].
Low support along the apoditrysian backbone probably reflects rapid diversification, as in other major insect radiations [35,36]. The alternative explanation, of pervasive strong conflict among gene trees, found little support in our earlier studies [19]. If short branches resulting from rapid radiation are the problem, it may be feasible to strengthen resolution by radically increasing the gene sample. Empirical tests of this proposition, however, have been few. In this study we assess the potential of massive gene sampling for resolving the apoditrysian radiation by analyzing 741 gene sequences, obtained through RNA-Seq, in 46 exemplars spanning nearly all major lineages of Apoditrysia. The resulting dramatic but non-uniform increase in bootstrap support illustrates both the power and the complexity of the phylogenomic approach.

Taxon sampling and taxon set design
The goal of this study was to assess the degree to which RNA-Seq transcriptome data can increase the support for relationships among the superfamilies of Apoditrysia over that found in our previous 19-gene study [22]. Our 46 exemplars include 42 apoditrysians spanning 16 of 26 superfamilies and 34 of 93 families of Apoditrysia in a recent classification [1]. The distribution of our exemplars across that classification is shown in Table 1, while the collecting locality, accession number and other details for each specimen are given in Table  S1. The only large apoditrysian superfamily (>1,000 species) not sampled was Papilionoidea. The phylogenetic position of Papilionoidea is the focus of a forthcoming independent RNA-Seq study that is yielding results similar to those we report below (A. Y. Kawahara, in litt.) As outgroups we used two non-apoditrysian Ditrysia and two non-ditrysians. For two taxa we used previously published data: for Bombyx mori, we used the published genome (SilkDB; [37]), and for Striacosta albicosta, we reassembled raw sequences from an earlier study that used older sequencing technology [38]. The purpose of including S. albicosta was to gauge how much data can be extracted from such older transcriptome studies, and whether these data can be successfully incorporated into a phylogeny estimate based mainly on newer, larger transcriptome assemblies. For the other 44 taxa we generated transcriptomes de novo by RNA-Seq. We matched the taxa included as closely as possible to those in our previous backbone study [22]. Thirty-eight of the 44 species had been included in that study, and for a majority of these we were able to use the same specimen. Four other species were congeners of taxa in the earlier study, and an additional two belonged to the same subfamily and tribe (see Table S1). These substitutions were made because no more material of the same species or genus, respectively, was available. All of the specimens we sequenced came from the ATOLep collection built by the Assembling the Lepidoptera Tree of Life project (Leptree), and had been stored in 100% ethanol at -80° C, some for more than 20 years.
Taxon sampling in this exploratory study expanded in phases, from 16 to 38 to 46 exemplars, each with a separate phylogenetic analysis, as we sought to characterize the data and develop our informatic and analytical workflows. The initial test set focused (14/16 taxa) on one especially problematic tree region, the hypothesized group consisting of Cossoidea + Sesioidea + Zygaenoidea [30,32]. This assemblage, here termed the "CSZ clade", consists of 5996 species in 19 families according to van Nieukerken et al. [1], who merged Sesioidea into Cossoidea. It is one of very few groupings among apoditrysian superfamilies that is postulated in the morphologybased working hypothesis [29]. It also presents an exceptionally clear superfamily-level contrast in a major life history feature, internal versus external feeding: Cossoidea and Sesioidea are mostly stem borers, whereas Zygaenoidea are mostly external folivores. In analyses with the 19 Leptree genes (14.7 kb), the CSZ clade is only sometimes monophyletic, and always with very weak support [22]. A core subset of Zygaenoidea is reliably monophyletic, but Sesioidea, Cossoidea and Cossidae never are. Relationships of the sesioid families, the cossoid families and subfamilies, and the two aberrant (parasitic) families of Zygaenoidea (Epipyropidae and Cyclotornidae), to each other and to the "core" Zygaenoidea, are almost completely unsupported (e.g., Figure  1). The test data set also included one non-apoditrysian outgroup (Yponomeuta) and one putative apoditrysian outgroup, Bombyx mori.
After testing and improving our protocols using the 16-taxon test set, we added 22 more exemplars representing most of the other major lineages of Apoditrysia, focusing on the other large superfamilies (those with over 2000 species). Another eight taxa were then added for a final, 46-taxon analysis. These eight had been held back from the second analysis because we considered them especially likely to complicate tree estimation, either because they have much less data than the rest (Striacosta albicosta) or because they were previously identified as difficult-to-place or "rogue" taxa [22]. We wanted to see how much the inclusion/exclusion of such taxa would affect the results based on our very large gene samples.
An additional, related benefit to our stepwise increase in sampling is the evidence it provides on the effects of taxon sampling density, which has been of special concern in phylogenomics [ [39][40][41]. Strong conflicts among phylogenies of 16 and 38 and 46 taxa could suggest the presence of false signal due to taxon under-sampling, as could strong support in the RNA-Seq phylogenies for nodes contradicting strongly supported nodes in the much larger Leptree taxon sample ( Figure 1). Successive expansion of the taxon sample could also identify instances in which weak support is increased by denser taxon sampling.
To provide a controlled assessment of the potential benefits of massively increased gene sampling, we compared topologies and branch supports from RNA-Seq analyses both to those from the 19-gene, 483-taxon "backbone" phylogeny [22], and to new 19-gene analyses of 16-, 38-and 45-taxon data sets. The data sets for the 19-gene analyses were taken from the data matrix of Regier et al. [22]. For each species in the RNA-Seq data set, an associated Leptree exemplar from Regier et al. [22], listed in Table S1, was chosen to match it as closely as possible, and was used in our 19-gene analyses. In 38 cases, exactly the same species was used; a closely related substitute was used in six others. For Striacosta albicosta, not included in the "backbone" study, we substituted the con-tribal Agrotis ipsilon, included by Regier et al. [22], in the 19-gene analysis. We thought it unnecessary to substitute for Heliothis virescens, for which we also lack 19-gene data, because it already had a close relative in the 19-gene data set (Helicoverpa zea). Thus, the final 19-gene analysis used 45 exemplars instead of 46.

RNA-Seq data generation
Total RNA was extracted using Promega SV total RNA isolation mini-kits. The great majority of our specimens were adults; four were larvae (see Table S1), with species identifications verified by comparison of COI sequences with those in the Barcode of Life Data System [42]. For larger moths we used the thorax and/or anterior part of the abdomen; for a few smaller ones we used the entire body. RNA extracts were submitted to the University of Maryland-Institute for Bioscience and Biotechnology Research Sequencing Core. The quality of total RNA was assessed by capillary electrophoresis on an RNA chip using an Agilent Bioanalyzer 2100 system. RNA preps of sufficient quality were subjected to poly-A selection and indexed library construction for sequencing on an Illumina HiSeq1000. Following Hittinger et al. [43] our libraries were left un-normalized, so as to favor highly-expressed genes likely to be present in most species and life stages. Libraries were run four per lane, yielding about 110 million 100-bp paired-end reads per taxon.

Sequence quality control and transcript assembly
Reads that did not pass the default Illumina HiSeq1000 "Chastity" quality filter (~5-20% per sample), and those with Phred quality score [44] not greater than 20 at greater than 90% of positions (~5-15% per sample) were discarded. The filtered reads input to assembly (mean = 76M per sample) had median Phred scores greater than 35 for over 95% of the bases in each read.
De novo transcriptome assembly was performed using both Trinity (versions r2012-03-17 and r2013-02-25 [45]) and Trans-ABySS (versions 1.3.2 and 1.4.4; ABySS versions 1.3.3 and 1.3.5 [46,47]), and the results compared (see Table 2) for numbers and length of transcripts using standard assembly metrics such as N50 (the length N for which 50% of all bases are contained in contigs of length L < N). A typical Trinity assembly required greater than 100 GB RAM and finished in 24 to 96 hours using 16 computer cores. A typical Trans-ABySS run required less than 4 GB RAM and a single processor, finishing in 1-2 hours. The same is true for each constituent ABySS run, of which there were 23 per sample (k ranging from 52 to 96 in steps of two). In general, Trinity used more RAM and more compute time, and produced fewer transcripts, than Trans-ABySS, but it produced longer transcripts (Table S2). Combining the Trinity and Trans-ABySS assemblies proved early on to yield a slightly more complete data matrix than either alone, which is why we continued to use both. The added cost of doing so was minimal once a workflow was established.
Some modification of these methods was necessary for reassembly of the Striacosta albicosta transcriptome [37]. We acquired the original 75-bp single-end Illumina reads, which were based on 16 individuals and normalized cDNA, and were not subjected to a "Chastity" filter. Application of our Phred filter eliminated 61% of the reads. We modified Trans-ABySS to work with single-end data, and optimized its k-mer sweep for 75-bp reads (k ranged from 38 to 74 in steps of two). The original assembly contained 16,850 contigs of median length 173 bp; our combined Trinity and Trans-ABySS assembly yielded 336,829 contigs of median length 114 bp, including over 15,000 contigs of median length 351 bp from the Trinity assembly alone.
In the first step of the HaMStR procedure, regions of our transcript assemblies (expressed as amino acid sequences) that matched any one of the 1579 Insecta core-ortholog pHMMs were provisionally assigned to the corresponding orthologous group. To reduce the number of highly divergent, potentially paralogous sequences returned by this initial search, we changed the E-value cutoff defining a "hit" to 10 -5 , from the HaMStR default of 1.0, and retained only the topscoring quartile of hits. In the next HaMStR step, the provisional "hits" from the Insecta search were compared to a "reference taxon" (Bombyx mori), and retained only if they survived a reciprocal best BLAST hit test with that taxon. Once assigned to orthologous groups, protein sequences from our assemblies were aligned using MAFFT [53]. The resulting protein alignments were then converted to the correct corresponding nucleotide alignments, using a custom Perl script that substitutes for each amino acid the proper codon from the original coding sequence.
Following initial orthology assignments, we computed "coverage per base" for each orthologous group, defined as read length x median number of reads mapped to orthologous group sequences ÷ median length of orthologous group sequences. Read mappings used Bowtie (version 0.12.8; [54]), allowing up to four mismatches.

Data matrix construction and paralogy filtering
Our orthology determination pipeline often yields multiple sequences for a particular taxon-locus combination, which can reflect the presence of, among other possibilities, multiple orthologs, heterozygosity, alternatively spliced transcripts, paralogy (including inparalogs; [55]), and sequencing errors. One general approach for reducing this variation to a single sequence, as required for phylogenetic analysis, is exemplified by the "REPRESENTATIVE" option in HaMStR [48]. This procedure chooses the single sequence (or concatenation of nonoverlapping fragments) with the best pairwise alignment to a chosen reference taxon. We developed an alternative that accommodates the uncertainty in orthology determination by combining the set of sequences into a single consensus sequence, using nucleotide ambiguity codes [56] as necessary. Consensus sequences were generated by providing the alignment of the nucleotide coding sequences corresponding to the amino acid sequences passing our filtering steps, described above, to the consensus_iupac BioPerl subroutine [57]. There are two principal motivations for this "CONSENSUS" approach. The first is a desire to incorporate all information about specific nucleotide states for positions that might reasonably be inferred to be orthologous, including those where orthologous relationships among genes between pairs of taxa are many to one, and many to many, as well as cases of polymorphism. A second motivation is to mitigate the effects of mistaken orthology determination and other errors, including those resulting from incorrect choice of a single representative sequence, by in effect reducing the weight of positions at which transcription fragments differ. By including more available transcription fragments, moreover, CONSENSUS can potentially yield longer total sequences than REPRESENTATIVE, as has been our experience. However, degenerating nucleotide sites that vary among transcripts could result in dilution of phylogenetic information, if the single best sequence chosen by REPRESENTATIVE were almost always the most phylogenetically appropriate one. The approach that works best is thus an empirical question, which we addressed by performing both procedures and comparing the results.
Despite the filters described above, inspection of our initial 1579 alignments revealed obvious paralogs. An extreme example is orthologous group 412460 of the Insecta Hmmer3-2 database, annotated there as acetyl-CoA acetyltransferase, a type of thiolase. In our data, HaMStR search returned two divergent sets of sequences for this ortholog group, which upon BLAST search matched two different members of the thiolase gene family in a noctuid moth. No single E-value threshold can eliminate problems of this kind, so we turned to direct scrutiny of gene trees (e.g. [58][59][60]). Using the initial 16 test taxa, a maximum likelihood (ML) gene tree was constructed for each orthologous group using all matching sequences, and provided as input to the program PhyloTreePruner [61]. If the sequences for a particular taxon form a polyphyletic group, the program prunes the gene tree to the maximal subtree in which the non- polyphyly criterion is met for all taxa. For the 16 test taxa, PhyloTreePruner pruned 838 of the 1579 gene trees to some degree. For this exploratory study we have taken a very conservative action based on these results, using for all subsequent phylogenetic analyses only the 741 genes in which no evidence of paralogy was found in the test taxa, and entirely ignoring the remaining genes; alternative possibilities for future studies are considered in the Discussion section. Following application of the paralogy filter, the 741 putative ortholog alignments were concatenated, adding gaps for missing data as necessary using a custom Perl script. For all phylogenetic analyses the nucleotide matrix was subjected to degen1 coding (version 1.4; [62]), and sites not represented by sequence data in at least four taxa were subsequently removed. "Degen" uses degeneration coding to eliminate all synonymous differences among species from the data set, resulting in phylogeny inference based only on non-synonymous nucleotide change. This procedure was shown in our previous backbone study [22] to generally improve recovery of deep nodes. At deeper levels in the Lepidoptera, inclusion of synonymous change in any form, even as part of a codon model, sometimes introduces conflict and systematic error, due to compositional heterogeneity [21,22]. Analysis under degen1 can be viewed as a computationally efficient approximation to a purely "mechanistic" amino acid model, i.e., one based on the genetic code but not incorporating empirical transition frequencies between amino acids [21,63,64]. Sequences and alignments for the 19-gene analyses were extracted from Table S4 of the Leptree backbone study [22]. Nine of these genes are present in the Insecta Hmmer3-2 database. The PCR amplicon codes of these nine from Regier et al. [22] are: 40fin, 109fin, 192fin, 262fin, 265fin, 268fin, 3007fin, 3070fin, and CAD. Five of these genes were eliminated by our paralogy screen, while the following four, listed by their numbers in the Insecta Hmmer3-2 database, were included among the 741 used in phylogenetic analyses: 413101 -262fin; 412564 -268fin; 412293 -265fin; 412031 -40fin.

Phylogenetic analysis
Maximum likelihood phylogenetic analyses used GARLI (Genetic Algorithm for Rapid Likelihood Inference; version 2.0 [65]) and grid computing [66,67] via a web service on molecularevolution.org [68] based on tools developed by Bazinet et al. [69] that include post-processing with DendroPy [70], R [71], and custom Perl scripts. The majority of the phylogenetic analyses were completed using the BOINC volunteer computing platform [72] (http:// boinc.umiacs.umd.edu). We used a GTR+I+G nucleotide model together with GARLI default settings, including random stepwise addition starting trees, except that we halved the number of successive generations yielding no improvement in likelihood score that prompts termination (genthreshfortopoterm = 10000), as suggested for bootstrapping in the GARLI manual. Memory requirements ranged from 800 MB for the 16-taxon, 741-gene analysis to 3500 MB for the 46-taxon, 741-gene analysis. Each best tree was selected from 100 GARLI search replicates, while bootstrap analyses consisted of 1000 replicates. Insufficient search effort during bootstrapping has been shown to artificially depress bootstrap support (BP) values [22]. A rough guide to the effort needed is provided by our initial 100 ML search: if the best tree topology is found only rarely, multiple search replicates per bootstrap replicate may be necessary. We tested each of our data sets for the effect of increased search effort on BP values, at levels of one, five, and ten search replicates per bootstrap replicate. We found a significant increase in BP values for several analyses using five search replicates instead of one, but did not find a significant improvement using ten search replicates instead of five. Thus, all results presented here used five search replicates per bootstrap replicate.
The 741-gene and 19-gene data matrices have been deposited in Dryad (doi:10.5061/dryad.02qv3). The Illumina reads have been deposited in the NIH Sequence Read Archive, as BioProject PRJNA222254.

Data matrix properties
The paralogy-filtered matrix of 741 genes contained from 742,017 to 873,036 nucleotide positions and was 80-93% complete, depending on the number of taxa included and the orthology determination procedure used (Table S3). Thus, overall matrix completeness was slightly higher than in the 14.7 kb, 483-taxon Leptree analysis [22]. Completeness was fairly consistent among the 44 newly-sequenced taxa, ranging from, e.g., 67% to 84% for the 46-taxon, 741-gene CONSENSUS matrix (Table S2). Our reassembly of the previously-published Striacosta albicosta sequence reads [38] yielded sequence for 1138 orthologous groups, whose median sequence length was 147 bp. Thus, in the paralogy-filtered 46-taxon data matrices, for example, Striacosta has approximately half the data of our other taxa (Table S2). Coverage per base (Table S4) averaged 103× for 15 test taxa, with a range of 31× to 334×.

Phylogenetic results
The tree of maximum likelihood found for both the 46-taxon, 741-gene CONSENSUS data set and its REPRESENTATIVE counterpart is shown in Figure 2, together with bootstrap values for the CONSENSUS and REPRESENTATIVE 46-taxon, 741-gene data sets and the 45-taxon, 19-gene data set. A phylogram version of the same tree is given in Figure S1. ML cladograms and bootstrap values for all other data sets are given in Figures S2-S5.
The two alternative procedures for determining a single sequence per taxon-locus combination for phylogenetic inference when orthology search returns multiple "hits", i.e., REPRESENTATIVE and CONSENSUS, yielded identical ML topologies, and nearly identical bootstrap values (Figure 2). A marked difference between the two procedures was observed in the 38-taxon analysis, for which finding the best tree topology took considerably more search effort for REPRESENTATIVE than for CONSENSUS: out of 100 ML searches, the best tree topology was found 25 times for the CONSENSUS matrix, but only once for the REPRESENTATIVE matrix. However, we found no such difference for either the 16-or 46-taxon analyses; in those cases, a comparable amount of search effort for each procedure was required to find the best tree topology. An experiment described in Supplementary Text S1suggested that the greater search effort required for REPRESENTATIVE in the 38-taxon case stems from conflicting signal in a small proportion of nucleotide positions in that matrix which are left ambiguous in the CONSENSUS matrix. The most dramatic pattern in the results is the much greater frequency, across all taxon sets, of strong support for nodes subtending multiple superfamilies in the 741-gene analyses than in either the corresponding 19-gene analyses or the 483taxon "backbone" study. For example, in the 46-taxon, 741gene ML topology of Figure 2, there are 22 nodes within Apoditrysia that subtend taxa assigned to different superfamilies in either the newest classification [1] or its immediate predecessor [23]. Of these, 11 have bootstrap support (BP) of 100%, two additional nodes have BP ≥98%, and one additional node has BP >80%, for a total of 14/22 nodes with "strong" or "very strong" support ( Figure 2). In contrast, of 23 nodes subtending multiple superfamilies in the ML topology for the 45-taxon, 19-gene matrix ( Figure S1), none have BP ≥80%; only one has BP >70%, and only three have BP >50% (Figures 2, S1).
Strong deeper-node support in the 741-gene analyses is not spread evenly across the Apoditrysia, but is restricted almost entirely to a clade consisting of Obectomera sensu van Nieukerken et al. [1] + Gelechioidea + Pterophoroidea ( Figure  2). Of the 12 nodes within and including that clade which subtend multiple subfamilies in recent classifications, 11 have BP = 100% and all have BP >80%. In contrast, of the 11 such nodes elsewhere among the Apoditrysia, none have BP=100% and only two have BP >80%.
Tree topology changed little as taxon sampling expanded for 741 genes. Table 2 summarizes the main differences in topology and bootstrap support among the 16-, 38-and 46taxon analyses. In no comparison among trees for different numbers of taxa were there incompatible nodes that each had strong bootstrap support. Thus, there is little evidence for artifactual strong support resulting from taxon undersampling. The most notable conflict concerns monophyly of the putative CSZ clade. In the 16-taxon analysis, which includes only one apoditrysian (Bombyx) apart from the putative CSZ clade, that clade gets 82% bootstrap support ( Figure S5). In contrast, the 38-and 46-taxon analyses, which include many other apoditrysian lineages, find the CSZ assemblage to be paraphyletic with respect to the clade Obtectomera + Gelechioidea + Pterophoroidea. Bootstrap support for this conclusion, however, is only 43% and 59% for 38 and 46 taxa, respectively (Figures 2, S3). The most striking instance of decline in bootstrap without change in topology involves the grouping of the "CSZ clade" constituents with the Obtectomera, to the exclusion of other apoditrysians. The 90% bootstrap support for this grouping in the 38-taxon analysis falls to 27% in the 46-taxon analysis, which includes three additional nonobtectomeran superfamilies.
The evidence is stronger for a positive effect of taxon sampling density on node support. The clearest examples are the contrasting positions of Noctuoidea and Drepanidae in the 38-versus 46-taxon, 741-gene analyses. In the 38-taxon analysis ( Figure S3), which is missing several small groups (Cimeliidae, Axiidae, Doidae) that may or may not represent distinct superfamilies of Macroheterocera [1,23], Noctuoidea are grouped with Drepanidae, but with weak support (BP = 54%). When the three missing groups are added, as part of the 46-taxon analysis, Drepanidae and Noctuoidea are no longer paired, but the new positions of these two taxa, together with those of the newly-added families, are all supported by BP=100%. A beneficial effect of denser taxon sampling on node support is also suggested by the generally lower support in our new 19-gene analyses of 16, 38 and 46 taxa than in our previous 19-gene, 483-taxon study [22]. For example, bootstrap support for Apoditrysia, 98% in Regier et al. [22], is only 58% here in the 19-gene, 45-taxon analysis. Moreover, unlike the 483-taxon study, the 19-gene, 45 taxon analysis also fails to support monophyly for Pyraloidea and for Macroheterocera. An interaction between gene and taxon sampling is suggested, finally, by the fact that the 45-taxon, 741-gene analysis supports the monophyly of both Pyraloidea and Macroheterocera with BP=100%.

Discussion
Our results suggest that the expansive gene sampling yielded by RNA-Seq may be able to strongly resolve intersuperfamily relationships throughout a clade consisting of Obtectomera sensu van Nieukerken et al. [1] plus Gelechioidea and Pterophoroidea (at least), comprising over two-thirds of the species of Lepidoptera. But, might these high bootstraps be misleading? Multiple authors have urged caution in the interpretation of bootstrap support in phylogenomic studies (e.g., [40,73]) or even abandonment of bootstraps altogether in favor of other support measures [74]. If random error is sufficiently reduced by massive gene sampling, strong but misleading bootstrap support might arise from even subtle forms of pervasive systematic error, such as minor compositional heterogeneity or slight differences in the relative abundance of strongly-conflicting gene tree topologies, as well as from long-branch attraction due to the typically sparse taxon sampling in phylogenomics.
How could we judge whether the strong support seen in our results is artifactual? That explanation would gain credence if the strongly supported nodes repeatedly conflicted with groupings that were robustly supported, or at least consistently monophyletic, in previous studies. In fact, however, the topology of the RNA-Seq phylogeny of Figure 2 is closely similar, though not identical, to that of the 483-taxon, 19-gene study ( Figure 1) and to those of earlier molecular studies [19][20][21]. It is also consistent, in topology and node support levels, with recent studies using whole mitochondrial genomes [24,25]. All strongly-supported relevant nodes from previous nuclear gene studies are also strongly supported by the RNA-Seq analysis. Nowhere in the tree does a strongly supported node in the phylogenomic study contradict a strongly supported node in any earlier study. Moreover, it appears that limited taxon sampling, rather than inducing artifacts, can be better overcome by the RNA-Seq data than by the 19-gene data: in the 38-and 46-taxon analyses, the RNA-Seq data strongly support the monophyly of Pyraloidea, for which previous molecular and morphological evidence is definitive, whereas the 19-gene data fail to group the two pyraloid exemplars.
A second reasonable expectation, if strong support in the phylogenomic results were largely artifactual, is that such support should be distributed across all levels in the tree. Indeed, some of the forces that can produce strong false signal, such as convergence in amino acid composition and long branch attraction, should be more likely for deeper than for shallower divergences. But in fact, within Apoditrysia, strong support from RNA-Seq is concentrated in the subordinate clade Obtectomera, while the deeper divergences have uniformly weak support.
These observations -agreement of strong support with previous groupings, and decreasing signal strength with increasing depth of divergence within Apoditrysia -suggest that such strong support as we find in the RNA-Seq results is real rather than artifactual. They further suggest that even with 741 genes, we are still data-limited: we do not yet have enough characters to fully resolve all stages of the rapid radiation of the Apoditrysia. On the plus side, however, it also appears that, unlike many previous phylogenomic studies, we are not working with levels of divergence at which strong bootstrap support, even from entirely non-synonymous change, is both inevitable and often misleading [40,73,74].
If, as we argue, the strong support seen in our 741-gene analyses is real, it appears that further taxon sampling could quickly produce major advances in our understanding of the huge clade Obtectomera. Precise definition of this clade has been difficult, and the placement of multiple superfamilies has been unclear. Our results suggest that there is a sharp discontinuity between superfamilies that are and are not strongly supported as near relatives of the Macroheteroceran moths. If this distinction holds up under further taxon sampling, it would be reasonable to use it to define the Obtectomera, which would then include both Gelechioidea and Pterophoroidea. It appears that RNA-Seq may be able to definitively resolve all or nearly all relationships within Obtectomera so redefined. There is very strong support for monophyly of Macroheterocera sensu van Nieukerken et al., and for Mimallonidae as the sister group to these. It might make sense to include Mimallonidae in Macroheterocera. There is also very strong support for a sister group relationship of Mimallonidae + Macroheterocera to Pyraloidea.
All of the superfamilies of Macroheterocera are sampled here, and relationships among them, with one possible exception, are all strongly supported. The basal divergence is between a clade consisting of Cimeliidae + (Doidae + Drepanidae) and one containing the remaining four superfamilies; an identical or similar division, albeit weakly supported, is seen in previous molecular studies. The first grouping corroborates the recent incorporation of all three families into Drepanoidea sensu novo [1], and increases the evidence for removal of Doa from Noctuoidea, despite its possession of the two main noctuoid morphological synapomorphies. Within the clade consisting of Noctuoidea, Geometroidea, Bombycoidea and Lasiocampoidea, the latter two are strongly grouped, and only the node uniting these with Geometroidea (BP=84%) has bootstrap support of less than 100%. The position of Epicopeiidae, weakly supported in all previous studies, strongly corroborates their transfer from Drepanoidea to Geometroidea [1,22]. The close relationship between Geometroidea and Bombycoidea + Lasiocampidae suggested here may explain why Epicopeiidae sometimes grouped (weakly) with the latter in earlier studies [20].
Although Papilionoidea, formerly grouped with the "big moths", were not included in this study, one can confidently predict, from earlier studies (Figure 1), that they would fall among the "lower" Obtectomera. In Figure 2 this would mean, somewhere between the base of Obtectomera and the base of Pyraloidea + Macroheterocera. This prediction has recently been strongly confirmed by studies based on mitochondrial genomes [24,25] and on RNA-Seq (A. Y. Kawahara, in litt.), although the exact sister group of the butterflies will not be known until sampling of the non-macroheteroceran superfamilies of Obtectomera is complete.
While prospects for resolving the Obtectomera sensu lato look promising, the outlook is less bright in the "lower," i.e. nonobtectomeran, Apoditrysia. In this tree region only two nodes subtending multiple current or former superfamilies get bootstrap support approaching conclusive levels ( Figure 2). There is 99% bootstrap support for a clade consisting of Cossoidea sensu stricto [23] plus Castniidae, formerly placed in Sesioidea [23,38]. If this grouping holds up under further RNA-Seq sampling, it may be useful to redefine Cossoidea to conform to it. Such a definition would re-exclude Sesiidae, included here by van Nieukerken et al., [1], for which no strong placement has been discovered. Within the putative Cossoidea sensu novo, only a single inter-family relationship gets notable bootstrap support, namely, the novel pairing of Castniidae with Dudgeonidae (BP=98%). The relationships of the four cossid subfamilies sampled, to each other and to Castniidae + Dudgeonidae, have weaker support (BP = 71-81%). Elsewhere in the non-obtectomeran Apoditrysia, no bootstrap value exceeds 59%. Phylogenetic relationships in the Cossoidea-Zygaenoidea-Sesioidea complex will clearly need much further work.
Why are the "lower" Apoditrysia such a difficult phylogenetic problem, in comparison to lepidopteran lineages of both greater and lesser age? Several complementary explanations seem plausible. Cladogenesis might have been particularly rapid at the base of Apoditrysia as compared to later on, resulting in especially short internal branches. Alternatively, the rate of subsequent extinction might have been high, reducing the taxon sample available for reconstructing rapid cladogenesis. Or, these divergences might be harder to reconstruct simply because they are older than those in Obtectomera, leaving more time for synapomorphies to be overwritten by subsequent substitution. Increasing the gene sample might allow us to overcome the first and third effects. To overcome the second effect, we would want to sample taxa as densely as possible, but would face limits set by extinction. Fortunately, as our results so far have shown, gene and taxon sampling are to some degree interchangeable. Therefore, more gene sampling might help in this case as well. Thus, further expanding the gene sample may be critical to further resolution of the lower Apoditrysia, no matter why these lineages are so refractory to phylogenetics.
One immediate way to increase our gene sample would be to relax our severe initial interpretation of the PhyloTreePruner results, under which only genes for which no evidence of paralogy was found were considered suitable for phylogenetic analysis. Following Kocot et al. [61], one could recover some of the information thereby lost by estimating bootstrap support for the individual gene trees and avoiding pruning when support is weak. One could also include the partially incomplete pruned gene trees, from which the apparent paralogs have been deleted, in phylogeny calculations. While these measures might be useful, a potentially more profitable approach in the long run would be to address the underlying problem that led us to PhyloTreePruner in the first place. The Insecta Hmmer3-2 database was a highly useful starting point, but for two reasons it is not ideal for studies within Lepidoptera. First, it contains only the 1579 genes that were identifiably orthologous across six very divergent arthropod and annelid genomes. Comparisons restricted to Lepidoptera would undoubtedly yield a much higher number of useful genes; for example, the complete proteome of the diamondback moth (Yponomeutoidea: Plutellidae: Plutella xylostella) is close to 15,000 genes [75]. Second, presumably because most of the taxa on which the database is built are so divergent from Lepidoptera, many of its putative ortholog groups appear to include sequences that are non-orthologous in Lepidoptera. Therefore, it would be useful to have a new database of Lepidoptera-specific gene models for orthology determination in the Apoditrysia. Such an effort could capitalize on a growing set of annotated lepidopteran genomes and transcriptomes, which now includes multiple apoditrysians as well as a member of the sister group to Apoditrysia [75][76][77][78].

Summary and Conclusions
This study explored the potential of next-generation sequencing to conclusively resolve relationships among the superfamilies of advanced ditrysian Lepidoptera (Apoditrysia), which were very weakly supported in previous nuclear gene studies. We used RNA-Seq to generate 1579 putatively orthologous gene sequences across a taxonomically broad sample of 40 apoditrysians plus four outgroups, to which we added two taxa using previously published data. Phylogenetic analysis of a 46-taxon, 741-gene matrix, resulting from a strict filter that eliminated ortholog groups containing any apparent paralogs, yielded dramatic overall increase in bootstrap support for deeper nodes within Apoditrysia as compared to results from previous and concurrent 19-gene analyses. High support was restricted mainly to the huge apoditrysian subclade Obtectomera broadly defined, in which 11 of 12 nodes subtending multiple superfamilies had bootstrap support of 100%. The strongly supported nodes showed little conflict with groupings from previous studies, and were little affected by changes in taxon sampling, suggesting that they reflect true signal rather than artifacts of massive gene sampling. Additional taxon sampling has the potential to definitively resolve obtectomeran superfamily relationships. In contrast, strong support was seen at only 2 of 11 deeper nodes among the "lower", non-obtectomeran apoditrysians. These represent a much harder phylogenetic problem, for which further increase in gene sampling, together with improved orthology assignments, offers one potential path to resolution.  Text S1. Exploration of differences between CONSENSUS and REPRESENTATIVE methods for determining a single sequence per taxon-locus combination for phylogenetic inference. (DOCX)