The author(s) have made the following declarations about their contributions: Conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, and wrote the paper: The International Aphid Genomics Consortium.
¶ Membership of the International Aphid Genomics Consortium is provided in the Acknowledgments.
The authors have declared that no competing interests exist.
The genome of the pea aphid shows remarkable levels of gene duplication and equally remarkable gene absences that shed light on aspects of aphid biology, most especially its symbiosis with
Aphids are important agricultural pests and also biological models for studies of insect-plant interactions, symbiosis, virus vectoring, and the developmental causes of extreme phenotypic plasticity. Here we present the 464 Mb draft genome assembly of the pea aphid
Aphids are common pests of crops and ornamental plants. Facilitated by their ancient association with intracellular symbiotic bacteria that synthesize essential amino acids, aphids feed on phloem (sap). Exploitation of a diversity of long-lived woody and short-lived herbaceous hosts by many aphid species is a result of specializations that allow aphids to discover and exploit suitable host plants. Such specializations include production by a single genotype of multiple alternative phenotypes including asexual, sexual, winged, and unwinged forms. We have generated a draft genome sequence of the pea aphid, an aphid that is a model for the study of symbiosis, development, and host plant specialization. Some of the many highlights of our genome analysis include an expanded total gene set with remarkable levels of gene duplication, as well as aphid-lineage-specific gene losses. We find that the pea aphid genome contains all genes required for epigenetic regulation by methylation, that genes encoding the synthesis of a number of essential amino acids are distributed between the genomes of the pea aphid and its symbiont,
Aphids are small, soft-bodied insects with elaborate life cycles that include all-female, parthenogenetic generations that alternate with sexual generations (
During the spring and summer months, asexual females give birth to live clonal offspring (see photo). These offspring undergo four molts during larval development to become (A) unwinged or (B) winged asexually reproducing adults. Winged individuals, capable of dispersing to new plants, are induced by crowding or stress during prenatal stages. After repeated cycles of asexual reproduction, shorter autumn day lengths trigger the production of (C) unwinged sexual females and (D) males, which can be winged or unwinged in pea aphids, depending on genotype. After mating, oviparous sexual females deposit (E) overwintering eggs, which hatch in the spring to produce (F) wingless, asexual females. In some populations, especially in locations without a cold winter, the sexual and egg-producing portions of the life cycle are eliminated, leading to continuous cycles of asexual reproduction (photo by N. Gerardo; illustration by N. Lowe).
Phloem sap is rich in simple sugars but contains an unbalanced mixture of amino acids. This unbalanced diet is compensated for by the intracellular mutualistic bacterium,
(A) Transmission electron micrograph showing elongate
Aphids, which are essentially plant parasites, have evolved complex life cycles involving extensive phenotypic plasticity
Here we present the genome sequence of the pea aphid,
The haploid pea aphid genome of four holocentric chromosomes (three autosomes and one X chromosome) was estimated by flow cytometry for the sequenced pea aphid line LSR1.AC.G1 to be 517 Mb (
The assembled regions of the pea aphid genome have the lowest GC content of any insect genome sequenced to date; at 29.6%, pea aphid GC content is 5.2% lower than that of
Prior to this project, less than 200 pea aphid genes had been sequenced. Thus, we performed automated gene predictions to aid study of the pea aphid gene repertoire. High-quality gene models with either partial or full-length EST and/or protein homology support computed by NCBI's gene prediction pipeline serve as a core set of 10,249 protein-coding gene models and are integrated into the public RefSeq databases at NCBI. Since the number of gene models with EST or protein homology support is expected to be smaller than the true number of protein-coding genes in the pea aphid genome, additional gene models were calculated using six additional gene prediction programs and combined, using GLEAN
Gene Modeling Software | Prediction Type | Gene Models | mRNAs | Number of Exons Per mRNA | Average mRNA Length | Average Exon Length | Total Number of Exons | Total Exon Length |
NCBI RefSeq | Evidence | 11,089 | 11,308 | 7.6 | 1,908 bp | 251 bp | 86,018 | 21.6 Mb |
NCBI Gnomon | 37,994 | 37,994 | 3.9 | 887 bp | 222 bp | 149,183 | 33.3 Mb | |
Augustus | 33,713 | 40,594 | 5.3 | 982 bp | 223 bp | 147,909 | 33.1 Mb | |
Fgenesh | 30,846 | 30,846 | 4.5 | 1,048 bp | 232 bp | 139,357 | 32.3 Mb | |
Fgenesh++ | 26,773 | 26,773 | 4.9 | 1,148 bp | 236 bp | 130,509 | 30.7 Mb | |
Maker | 23,145 | 23,145 | 6 | 854 bp | 142 bp | 138,596 | 19.8 Mb | |
Geneid | 62,259 | 62,259 | 2.9 | 553 bp | 194 bp | 177,361 | 34.5 Mb | |
Genscan | 32,320 | 32,320 | 3.5 | 844 bp | 241 bp | 112,777 | 27.3 Mb | |
Glean | consensus | 36,606 | 36,606 | 4.3 | 943 bp | 220 bp | 156,578 | 34.5 Mb |
GLEAN(-refseq) | consensus | 24,355 | 24,355 | 2.8 | 657 bp | 233 bp | 68,632 | 16.0 Mb |
OGS 1.0 | NCBI RefSeq + non redundant GLEAN | 34,604 | 34,821 | 4.3 | 1,024 bp | 241 bp | 148,081 | 35.7 Mb |
NCBI RefSeq models are subdivided into 10,249 protein coding models completely or partially based on EST or protein alignments, plus 840 pseudogene models containing debilitating frameshift or nonsense codons and noncoding RNAs. For alternative transcripts, primary transcript variant in RefSeq and Augustus were used in mRNA/exon calculation. All exon calculations are based on coding sequences only. Average mRNA length does not include UTR sequences. OGS, Official Gene Set (RefSeq coding genes + non-redundant GLEAN).
The combined total of 34,604 gene predictions includes unsupported
We took advantage of the first genome for a hemipteran species to perform a whole genome-based species phylogeny of the insects. The resulting phylogeny, based on 197 genes with single copy orthologs, is congruent with previous phylogenetic analyses
The phylogeny is based on maximum likelihood analyses of a concatenated alignment of 197 widespread, single-copy proteins. The tree was rooted using chordates as the most external out group. Bars represent a comparison of the gene content of all species included in the analysis (scale on the top). Bars are subdivided to indicate different types of homology relationships; black: widespread genes that are found with a one-to-one orthology in at least 16 of the 17 species; blue: widespread genes that can be found in at least 16 of the 17 species and are sometimes present in more than one copy; red: widespread but insect-specific genes present in at least 12 of the 13 insect species; yellow: non-widespread insect-specific genes (present in less than 12 insect species); green: genes present in insects and other groups but with a patchy distribution; white: species-specific genes with no (detectable) homologs in other species (striped fraction corresponds to species-specific genes present in more than one copy). The thin red line under each bar represents the percentage of
Beyond these comparisons—which are based on BLAST searches of aphid genes against other insect gene sets—we employed a phylogeny-based homology prediction pipeline
Analysis of the pea aphid phylome revealed 2,459 gene families that appear to have undergone aphid lineage-specific duplications, a number greater than that of any other sequenced insect genome (
(A) Size distribution of the major lineage-specific groups of in-paralogs (i.e., paralogs resulting from duplications occurring after the split of the lineages leading to the pea aphid and the louse
To provide a time scale for the origin of aphid-specific duplications, we estimated the synonymous distances (dS values) among all paralog pairs, which were identified using a within-genome reciprocal best blast hit. Because the sequenced line showed some heterozygosity, divergence between truly paralogous gene pairs could be confounded with allelic variation, but this should be a problem only for very close pairs of paralogs, since divergence values for allelic variants in most systems are generally very low (<1%). The large majority of gene pairs have higher divergence (dS > 0.05) than this allelic variant cut-off value, and thus can be assumed to represent true paralogs. Paralog pairs display a wide range of dS values, suggesting that gene duplication has occurred for an extended time in the pea aphid lineage. The elevated gene duplication rate appears to have started early in aphid evolution, since the oldest paralog pairs within the pea aphid genome show dS values that are comparable to the dS values for ortholog pairs between pea aphid and
Vertical dotted lines show the estimated average dS between orthologs from different aphid species. 1:
The pea aphid, similar to other non-dipteran insects, possesses a single candidate telomerase gene and the canonical arthropod telomere repeat of TTAGG
Approximately 38% of the assembled genome is composed of TEs. We identified 13,911 consensus TE sequences in the pea aphid genome using REPET, a TE annotation pipeline. The consensus TE sequences were grouped by sequence similarity and classified according to their structural and coding features into 1,883 TE families (consisting of two or more consensus sequences) and 1,672 singletons. Within the 1,883 TE families, we manually curated 85 families including the largest families representative of widespread TE groups, such as LTRs, LINEs, SINEs, TIRs, and Helitrons (
We show the mean identities of (A) TE copies in the pea aphid genome to their consensus reference sequence, (B) LTR super-families, and (C) TIR super-families. The consensus reference TE sequences contain the most frequent nucleotide at each base position and are thus approximations of the ancestral TE sequences, correcting for mutations affecting a small number of copies. Hence, the identity here is a proxy for TE family ages, with recent family having high identity (few differences with the ancestral state), and allows the ordering of transposable element invasions of the pea aphid genome. Note that the repeat order “Others” (
Order | Number of Families | Number of Curated Families | Number of Copies | Numbers of TE Copies for Curated Families | Coverage (% of the Genome) | Coverage of Curated Families (% Genome) |
TIRs | 320 | 38 | 46,155 | 11,063 | 4.382 | 1.656 |
LINEs | 178 | 15 | 24,579 | 6,230 | 3.066 | 0.939 |
LTRs | 69 | 17 | 11,199 | 5,405 | 1.365 | 0.741 |
SINEs | 63 | 7 | 12,462 | 4,767 | 1.002 | 0.480 |
MITEs | 20 | 3 | 5,104 | 2,461 | 0.420 | 0.250 |
Polintons | 17 | 3 | 1,583 | 768 | 0.255 | 0.089 |
Helitrons | 12 | 2 | 2,881 | 2,055 | 0.248 | 0.167 |
Others | 1,216 | NA | 402,346 | NA | 27.117 | NA |
Total | 1,883 | 85 | 506,309 | 32,749 | 37.856 | 4.321 |
Terminal inverted repeats (TIRs) and long interspersed elements (LINEs) are the most represented orders in the pea aphid genome. The repeat order named “Others” includes repetitive regions that match to pea aphid consensus TEs but could not be classified by the REPET pipeline because they lack structural features and similarities to other known TEs, and thus are not manually curated.
Like the hymenopteran honey bee and parasitic wasp
Methylated C nucleotides in CpGs—the sites of known DNA methylation in pea aphid—are prone to deamination to uracil, after which DNA repair machinery can produce thymidine. Thus, an excess of CpG sites over those expected at random can provide evidence for purifying selection maintaining CpG sites for methylation. This approach has been used previously to successfully predict methylated genes
CpG ratios were calculated using RefSeq data for each insect species. For each sequence the observed (obs) CpG frequency and the expected (exp) CpG frequency were calculated. The expected CpG frequency was calculated based on the GC content of each sequence and the CpG ratio was calculated as obs/exp. The frequency of each CpG ratio was plotted against the observed/expected ratio. A bimodal distribution was observed for
Micro RNA and small interfering RNA gene silencing participates in regulation of eukaryotic gene expression
miRNA biogenesis is initiated in the nucleus by the Drosha-Pasha complex, resulting in precursors of around 60–70 nucleotides named pre-miRNAs. Pre-miRNAs are exported from the nucleus to the cytoplasm by Exportin-5. In the cytoplasm, Dicer-1 and its cofactor Loquacious (Loq) cleave these pre-miRNAs to produce mature miRNA duplexes. A duplex is then separated and one strand is selected as the mature miRNA whereas the other strand is degraded. This mature miRNA is integrated into the multiprotein RISC complex, which includes the key protein Argonaute 1 (Ago1). Integration of miRNAs into RISC will lead to the inhibition of targeted genes either by the degradation of the target mRNA or by the inhibition of its translation. All components of the miRNA pathway have been identified in the pea aphid. Shown are the number of homologs in
Most aphid species harbor the obligate, mutualistic, primary symbiont,
Although this sequencing project was designed to target the genome of
Besides
The pea aphid genome provides a first opportunity for an exhaustive search for genes of bacterial origin in the genome of a eukaryotic host showing persistent associations with heritable bacterial symbionts. Besides their ancient association with
Screening of the genome project data for bacterial sequences revealed a large number of genes of apparent bacterial origin, even after vector contaminants had been screened out. However, a majority of these were on small contigs (mostly under 5 kb) that did not contain evident aphid sequence; PCR experiments on a subsample of such genes supported their identity as bacterial contaminants in the dataset rather than as true transferred genes (
Our findings indicate that overall aphids have acquired few functional genes via lateral gene transfer from bacteria. However, these few genes may be critical in the maintenance of the symbioses exhibited by aphids.
The pea aphid genome provides insight into the intimate metabolic associations between an insect host and obligate bacterial symbiont, revealing how the pea aphid's amino acid and purine metabolism might be adapted to support essential amino acid synthesis and nitrogen recycling by
A global view of the metabolism of the pea aphid as inferred from genome sequence data is available at AcypiCyc, a dedicated BioCyc database (see
The schematic shows hypothetical relations based on the annotation of amino acid biosynthesis genes in the two organisms.
Analyses revealed an additional unusual trait with implications for metabolism. Neither the aphid nor
The aphid immune system is expected to be critical in determining responses to microbial symbionts
Previously sequenced insect genomes (fly, mosquitoes, honeybee, red flour beetle) have indicated that the immune signaling pathways, including IMD and Toll pathways shown here, are conserved across insects. In
Plant volatiles are important cues for host plant recognition by aphids. In insects, such cues enter the antennae, bind to odorant-binding proteins (OBPs)
We identified 15 genes encoding putative OBPs and 13 putative CSP genes. By way of contrast, other insects also have more OBPs than CSPs
We identified 79 genes in the OR family, including intact, partially annotated genes, and putative pseudogenes. An ortholog of the highly conserved
The pea aphid GR family contains at least 77 genes. There are six members of the well-conserved sugar receptor subfamily and no homologs of the highly conserved carbon dioxide receptors found in holometabolous insects
Responsible for transmission of 28% of known plant viruses, aphids show four modes of virus transmission; (1) non-persistent (stylet-borne), (2) semi-persistent (foregut-borne), (3) persistent circulative, and (4) persistent propagative
As an herbivore, the pea aphid is likely to overcome plant chemical defenses, at least in part, by employing detoxification enzymes, including cytochrome P450 monooxygenases (P450s), glutathione
The osmotic pressure of phloem sap is significantly greater than that of aphid hemolymph
As hemimetabolous insects, aphids undergo incomplete metamorphosis, passing through a series of molts involving four immature instars to reach the adult stage. Aphids display a wide range of adult phenotypes (
The majority of genes involved in axis formation, segmentation, neurogenesis, eye development, and germ-line specification in the embryo are well-conserved. Genes playing critical roles in
In arthropods, chitin contributes to the structure of the cuticle (i.e., the lining of the tracheae, foregut, and hindgut; and the exoskeleton). There are three major classes of chitin-binding proteins. The pea aphid genome contains a large expansion of the first class, genes containing the R&R consensus sequence
Genes of the highly conserved TGF-β, Wnt, EGF, and JAK/STAT signaling pathways, all utilized in development, have undergone several aphid-specific duplications and losses. Multiple paralogs of
The pea aphid genome contains 640 putative sequence-specific transcription factors. Most of the transcription factor families are similar in size and composition to those of other insects. However, the pea aphid genome encodes significantly more zinc-finger-containing proteins than other insects with sequenced genomes. Although the number of bHLH encoding genes is similar to other insects, orthologs of the
JH has been implicated in regulating aphid reproductive polyphenism
Gene Name | Abbreviation | Pea Aphid Gene Prediction | Pea Aphid CpG Methylation | ||||
Juvenile Hormone Acid Methyltransferase | JHAMT | ACYPI255574 | Not found | FBgn0028841 | NM_001127311 | XM_001119986 | NM_001043436 |
ACYPI568283 | Not found | ||||||
Cytosolic Juvenile Hormone Binding Protein | JHBP | ACYPI154871 | XM_964351 | XM_625097 | NM_001044203 | ||
Juvenile Hormone Epoxide Hydrolase | JHEH | ACYPI275360 | Not found | FBgn0010053 | XM_970006 | XM_394354 | NM_001043736 |
ACYPI189600 | Not found | FBgn0034405 | XM_394922 | ||||
ACYPI307696 | FBgn0034406 | ||||||
Juvenile Hormone Esterasea | JHE | ACYPI381461 | Not examined | ||||
Juvenile Hormone Esterase Binding Protein | JHEBP | ACYPI563350 | FBgn0035088 | XM_964394 | NM_001047009 | ||
Hexamarin | Hex | No homolog | XM_961866 | NM_001110764 | |||
XM_962135 | NM_001098717 | ||||||
NM_001101023 | |||||||
Methoprene-tolerant | Met | hmm126914 | Not examined | FBgn0002723 | NM_001099342 | NM_001114986 | |
Allatostatin | Ast | hmm252834 | Not examined | FBgn0015591 | XM_001809286 | NM_001043571 | |
Allatostatin receptor | ACYPI008623 | Not examined | FBgn0028961 | XM_397024 | NM_001043570 | ||
FKBP39 | ACYPI003035 | Not examined | |||||
Chd64 | ACYPI003572 | Not examined | FBgn0035499 | XM_392114 | |||
Broad | Br | ACYPI008576 | Not examined | FBgn0000210 | XM_001810758 | NM_001040266 | NM_001043511 |
XM_001810798 | XM_393428 | ||||||
Retinoid X receptor (ultraspiracle) | RXR (usp) | ACYPI005934 | Not examined | FBgn0003964 | NM_001114294 | NM_001011634 | NM_001044005 |
a. The predicted juvenile hormone esterase is identified by the characteristic GQSAG motif and does not show significant homology to other known JHEs.
Aphids exhibit plasticity in meiosis and the cell cycle, allowing for both sexual reproduction and parthenogenesis. Most genes involved in meiosis and the cell cycle in vertebrates and yeasts are present in the pea aphid genome, while other sequenced insect genomes show lineage-specific losses of individual genes or gene family members
The cell division cycle typically consists of four phases: two growth phases (G1 and G2), a DNA synthesis or replication phase (S), and mitosis (M). Distinct and overlapping sets of regulatory genes are required for orderly progression through these phases. (A) Genes important for G1 and S phase progression are similar in number to other insects (orange box). G1/S Cyclin/Cyclin-dependent kinase (Cdk) protein complexes, along with E2F transcription factors, are critical for entry into G1 and progression into DNA replication and are opposed by cell cycle inhibitors such as p21/p27 family members and pRb/p107 family (Rbf) members, respectively. (B) Genes important for G2 and M phases have expanded in pea aphids (blue box). Polo kinases, Aurora kinases, Cdc25 phosphatases, and G2/M Cyclin/Cdk protein complexes are all critical for promoting entry into and progression through mitosis and meiosis. Negative regulators of Cdk1 and entry into mitosis include the Wee1/Myt1 kinase family. However, while Cdk1 has undergone aphid-specific duplication, no expansion of its activation subunits, Cyclins A and B, has been observed. Expanded gene families are in bold italics. Copy number was compared to that in
Neuropeptides and biogenic amines are cell-to-cell signaling molecules that act as hormones, neurotransmitters, and/or neuromodulators
Circadian clocks are internal oscillators governing daily cycles of activity and are proposed to underlie responses to day-night cycle, the most important cue triggering aphid reproductive polyphenism. In
Shown is a schematic representation of pea aphid orthologs of the circadian clock genes arranged in a two-loop model, as proposed for
Aphid sex determination is chromosomal. Females have two X chromosomes and males have only one
Major results from analyses of the pea aphid genome can be summarized as follows:
Extensive gene duplication has occurred in the pea aphid genome and appears to date to around the time of the origin of aphids.
The aphid genome appears to have more coding genes than previously sequenced insects, although a precise gene count awaits better assembly and further functional annotation of the genome. The increased gene number reflects both extensive duplications and the presence of genes with no orthologs in other insects.
More than 2,000 gene families are expanded in the aphid lineage, relative to other published genomes; examples include families involved in chromatin modification, miRNA synthesis, and sugar transport.
Orphan genes comprise 20% of the total number of genes in the genome. Many are found in EST libraries, suggesting they are functional.
As the first genome sequenced for an animal with an ancient coevolved symbiosis, the pea aphid genome reveals coordination of gene products and metabolism between host and symbionts. Amino acid and purine metabolism illustrate apparent cases of biosynthetic pathways for which different enzymatic steps are encoded in distinct genomes. These preliminary findings of host-symbiont coordination will be enhanced by the availability of genomes for three pea aphid symbionts, including the obligate nutritional symbiont
Selenocysteine biosynthesis is not present in the pea aphid, and selenoproteins are absent.
Several genes were found to have arisen from bacterial ancestors. Some of these genes are highly expressed in bacteriocytes and may function in regulation of the symbiosis with
The immune system of pea aphids is reduced and specifically lacks the IMD pathway; this unusual loss may be linked as a cause or consequence of the evolution of intimate bacterial symbioses.
As a specialized herbivore, the pea aphid must overcome plant defenses, and the pea aphid genome provides candidates for genes involved in critical insect-plant interactions.
The unusual developmental patterns of aphids, involving extensive polyphenism, may be facilitated by duplications of many development-related genes.
Our analysis of the pea aphid genome has begun to reveal the genetic underpinnings of this animal's complex ecology—including its capacity to parasitize agricultural crops, its association with microbial symbionts, and its developmental patterning. One project benefiting from the availability of the genome sequence is the investigation of aphid saliva proteins
The parental line of the sequenced aphid clone, LSR1, was collected in a field of alfalfa (
The genome size of LSR1.AC.G1 was estimated from single heads of seven asexual females by flow cytometry as described in
3.13 million Sanger sequence reads were produced on 3,730 sequencing (Applied Biosystems, Foster city, CA, USA) machines and assembled using the Atlas assembly pipeline, representing about 464 Mb of sequence and about 6.2× coverage of the (clonable)
We took two complementary approaches to automated gene prediction. First, for high-quality evidence-based gene models, we used the NCBI evidence-based RefSeq pipeline. Second, because EST and protein homology evidence was insufficient for the RefSeq pipeline to generate a comprehensive gene model set, we supplemented the RefSeq models with a GLEAN
The NCBI RefSeq pipeline uses a combination of homology searching with
Our supplemental GLEAN consensus gene model set of 36,606 was generated with input gene model sets from six different gene predictors: Augustus, FgenesH, FgenesH++, NCBI Gnomon, Maker, and NCBI RefSeq. Of these gene models, 12,251, overlapped RefSeq gene models by 100 bp or more, and in these cases, the RefSeq models were used. The final automated gene model set contains 34,604 gene models (
Using results of computational annotation as a baseline, members of the International Aphid Genomics Consortium manually curated over 2,000 genes of biological interest. Briefly, sequences of target genes from other arthropods were utilized to blast search the RefSeq gene set, Gnomon predictions, scaffolds, and unassembled reads. Homology of putative aphid genes was verified using a combination of reciprocal blast and information garnered from phylomeDB and other phylogenetic analyses. Gene models (e.g., starts and stops, exon boundaries) were then manually refined based on available EST and full-length cDNA support, as well as alignment with homologs from other taxa. Manual curation was facilitated by an Apollo instance directly integrated with AphidBase (see below).
The pea aphid assembled genome sequence data has been comprehensively scanned and annotated to highlight transcription evidence. ESTs, EST contigs, and full-length cDNAs have been mapped to the genome using SIM-4, whereas homologs in other insect genomes or Uniprot have been identified by high-throughput BLAST searches. All of the approximately 170,000 ESTs and 200 full-length cDNAs, as well as gene models generated by different programs (Augustus, RefSeq, Genscan, Maker, Snap, GeneID, Gnomon, and Fgenesh) and RefSeq and Glean gene model repertoires, were loaded into a GMOD-Chado database
One hundred and ninety-seven genes with single-copy orthologs in all species included in the analyses were selected to infer a species phylogeny. Alignments performed with MUSCLE described were concatenated into a super-alignment containing 14,922 positions. The removal of positions with gaps in more than 50% of the sequences resulted in a final alignment of 90,512 positions. This alignment was used for Maximum Likelihood (ML) tree reconstruction as implemented in PhyML v2.4.4
We reconstructed the complete collection of phylogenetic trees, also known as the Phylome, for all
Prediction of orthology is a fundamental step in the functional annotation of newly sequenced genomes. Reciprocal BLAST best hit is often used for genome-wide orthology detection, but phylogeny-based orthology predictions are considered more accurate, especially at large evolutionary distances or when gene duplication and loss is rampant
A list of orthology-based transfer of functional annotations was built based on phylogeny-based orthology relationships with
The duplication events defined by the above mentioned species overlap algorithm that only comprised paralogs from
Putative pairs of paralogs were identified as pairs of genes following a reciprocal best hit criterion (RBH) within the
The pea aphid has four chromosomes
TEs were identified and annotated using the “REPET” (
These TE consensus sequences representing ancestral copies of TEs subfamilies were clustered into groups for family identification using the GROUPER clustering method. Each family (i.e., group) was characterized assuming that the most populated well characterized TE category in a group of consensus sequences can define the order of the group it belongs to. Eighty-five families containing at least five TE consensus sequences were then manually curated using multiple sequences alignments, phylogenies, and Hidden Markov Models
The pea aphid genome was annotated with all the subfamilies of TE consensus sequences using the second part of the REPET annotation pipeline. This pipeline is composed of TE detection software—BLASTER
TEs often insert into other TEs fragmenting each other. A specific “long join” annotation procedure was performed, using age estimates of repeat fragments to correctly identify fragments from the same repeat. The percent identity between a fragment and its reference TE/repeat consensus can be used to estimate the age of TE fragments.
CpG analysis was performed as described in
During the course of whole genome sequencing of pea aphid clones, LSR1.AC.G1, 24,947 sequence reads corresponding to the
A BioCyc metabolism database
(0.04 MB DOC)
(0.04 MB DOC)
(0.04 MB DOC)
(0.05 MB DOC)
(0.12 MB DOC)
(0.04 MB DOC)
Special thanks to J. Colbourne for insightful discussions on genome project organization and future directions.
† Principal Investigator
* Analysis Group Leaders
antimicrobial peptide
chitin-binding domain
carboxyl/choline esterases
chemosensory proteins
G protein-coupled receptor
gustatory receptors
glutathione
juvenile hormone
major facilitator superfamily
Maximum Likelihood
Neighbor Joining
odorant-binding proteins
odorant receptors
P450 monooxygenases
peptidoglycan recognition proteins
reciprocal best hit
RNA Induced Silencing Complex
transposable element