Three Groups of Transposable Elements with Contrasting Copy Number Dynamics and Host Responses in the Maize (Zea mays ssp. mays) Genome

Most angiosperm nuclear DNA is repetitive and derived from silenced transposable elements (TEs). TE silencing requires substantial resources from the plant host, including the production of small interfering RNAs (siRNAs). Thus, the interaction between TEs and siRNAs is a critical aspect of both the function and the evolution of plant genomes. Yet the co-evolutionary dynamics between these two entities remain poorly characterized. Here we studied the organization of TEs within the maize (Zea mays ssp mays) genome, documenting that TEs fall within three groups based on the class and copy numbers. These groups included DNA elements, low copy RNA elements and higher copy RNA elements. The three groups varied statistically in characteristics that included length, location, age, siRNA expression and 24∶22 nucleotide (nt) siRNA targeting ratios. In addition, the low copy retroelements encompassed a set of TEs that had previously been shown to decrease expression within a 24 nt siRNA biogenesis mutant (mop1). To investigate the evolutionary dynamics of the three groups, we estimated their abundance in two landraces, one with a genome similar in size to that of the maize reference and the other with a 30% larger genome. For all three accessions, we assessed TE abundance as well as 22 nt and 24 nt siRNA content within leaves. The high copy number retroelements are under targeted similarly by siRNAs among accessions, appear to be born of a rapid bust of activity, and may be currently transpositionally dead or limited. In contrast, the lower copy number group of retrolements are targeted more dynamically and have had a long and ongoing history of transposition in the maize genome.


Preliminaries
When comparing the amount of siRNA that hits on a certain TE in two different cultivars, care must be taken to account for the different number of copies of the same TE. The relevant variable should be the siRNA hits over the number of copies, a measure of which could be the associated RPKM over the coverage. Also, to compare count-like data by using a χ 2 test it is important not to rescale the count number, but the expected number of occurrences of an event (i.e. hits of siRNA on a given TE subfamily) should nevertheless reflect the fact that the number of TEs is different across cultivars.
Before continuing, we need to establish the definition of the RPKM and its relations with the number of copies. First we start with the RPKM of the i th TE subfamily: where H i is the number of reads mapping to the i th subfamily, L i is its length in kb and M is the total number of mapped reads against the UTE. An estimation of the number of copies from the number of mapped reads could be where rl the length of the reads in kb and cov is the estimated coverage of the library. Equivalently one could write the estimated copy number in terms of the RP KM T E : Here is a

Methodology
Suppose that two given cultivars, A and B, have a different coverage of the RNA library and the DNA library. Now, assume that x A,i and x B,i are the expected number of siRNA hits on the i th TE subfamily for cultivar A and B respectively. Then, our null hypothesis states that: that is to say, the number of hits is proportional to the number of copies of the TEs and to the (unknown) coverage of the siRNA library. Note the way in which we estimate the number of copies involves a coverage of the DNA library, see previous section. Below we describe a way to estimate the ratio of the coverages of the siRNA libraries, an unknown constant independent of the TE subfamiliy.
For each TE, in order to find an expected number of instances, we have to take into account the constraint: where S i is the total number of siRNA hits across cultivars.
The expected values of x A,i and x B,i can be computed from the previous equations, which results in a simple lever rule: Now, in order to estimate the value of the ratio covRN A A /covRN A B we make use of an additional constraint, which is that the sum of all the x A,i has to be equal to the total number of mapped siRNA hits on the transposons for cultivar A, S A : where N is the number of different TEs. Taking into account all the previous equation we obtain the following equation for the ratio covRN A A /covRN A B : There is an equivalent equation for cultivar B, but it is easy to prove that it gives the same solution, i.e. it is not independent.
The previous equation is easy to solve numerically, which allows us to find the ratio of the coverages of the RN A libraries and hence the expected values, by using Eq. (6).
Once we have the values of the expected values of the cultivars, we can find the χ 2 statistic in the usual way, i.e.
where the tildes denote the observed variables. Since the sum of x A,i + x B,i is constrained, there is only one degree of freedom.