^{1}

^{2}

^{1}

^{3}

^{1}

^{3}

^{1}

^{2}

^{*}

EMK, FC, and KDM conceived and designed the experiments and wrote the paper. EMK and ST performed the experiments. EMK, ST, and KDM analyzed the data. EMK, ST, and FC contributed reagents/materials/analysis tools.

The authors have declared that no competing interests exist.

Insertions and deletions (indels) cause numerous genetic diseases and lead to pronounced evolutionary differences among genomes. The macaque sequences provide an opportunity to gain insights into the mechanisms generating these mutations on a genome-wide scale by establishing the polarity of indels occurring in the human lineage since its divergence from the chimpanzee. Here we apply novel regression techniques and multiscale analyses to demonstrate an extensive regional indel rate variation stemming from local fluctuations in divergence, GC content, male and female recombination rates, proximity to telomeres, and other genomic factors. We find that both replication and, surprisingly, recombination are significantly associated with the occurrence of small indels. Intriguingly, the relative inputs of replication versus recombination differ between insertions and deletions, thus the two types of mutations are likely guided in part by distinct mechanisms. Namely, insertions are more strongly associated with factors linked to recombination, while deletions are mostly associated with replication-related features.

Despite the significance of insertions and deletions (indels) for human genetic disease [

Unlike for substitution rates (e.g., [

Here, we scrutinize patterns of neutral regional variation in rates of small indels across the human genome, and contrast the inferred molecular mechanisms contributing to the generation of insertions versus deletions. We identified small (less than or equal to 30 bp) insertions and (separately) deletions at neutrally evolving (see below) interspersed ancestral repeats (ARs) in the whole-genome human–chimpanzee comparison employing macaque sequences as an outgroup. Next, we applied the methods of multiple regression to rigorously examine genomic factors determining regional variation in rates of these mutations that occurred in the human lineage since the human–chimpanzee divergence.

We established a computational pipeline to detect small indels that occurred in the human lineage since its divergence from chimpanzee, and used the recently sequenced macaque genome [

Human-specific insertion and deletion rates are strikingly lower for chromosome X (insertions mean 1.2 × 10^{−4}, standard deviation (sd) 3 × 10^{−5}; deletions mean 1.9 × 10^{−4}, sd 4 × 10^{−5}) than for autosomes (insertions mean 1.6 × 10^{−4}, sd 3 × 10^{−5}; deletions mean 2.7 × 10^{−4}, sd 4 × 10^{−5}—see also ^{−16}, Kruskal-Wallis test over 1-Mb windows;

Autosomes (blue) and X chromosome (red) are indicated.

Similar to nucleotide substitutions rates [^{−5} for insertions and 4.5 × 10^{−5} for deletions;

(A) Insertion versus deletion rates. (B) Indel rates versus divergence. (C) Indel rates versus GC content. In all three instances, the quadratic fits (^{−15} for (A–C)) explained the data better than linear fits. Empty circles, chromosome X windows; filled circles, autosomal windows; purple circles, windows skewed toward insertions; green circles, windows skewed toward deletions; pink circles, windows skewed toward indels; blue circles, windows skewed toward nucleotide substitutions.

To infer the underlying molecular mechanisms contributing to variation in indel rates, we investigated various genomic features as predictors of insertion and deletion rates in 2,568 1-Mb nonoverlapping windows throughout the human genome (

Linear Regression Models for Human-Specific Insertion and Deletion Rates at 1-Mb Genomic Windows

The grey area represents 95% prediction intervals. Data points in red are windows on chromosome X.

In choosing a window size to perform our main analysis, we struck a balance among various important considerations. Larger windows increase accuracy in the computation of insertion and deletion rates, and thus reduce the error variability carried by these measurements. This fact is reflected in the share of variability explained by our regression models increasing steadily with the window size, for both indels (

RCVE (see

The choice of 1-Mb windows is also supported by autocorrelation considerations. Gaffney and Keightley [

Contrary to expectation [

The X chromosome/autosome indicator is one of the top predictors for both insertion and deletion rates. The lower indel rates for chromosome X and the higher rates for autosomes corroborate the importance of replication errors in generating indels [

Recombination can also contribute to the observed differences in indel rates between X and autosomes; despite similarity in average recombination rates between these two types of chromosomes in humans [

By contrast, location on a particular autosome is at best a minor determinant of these rates. Adding chromosomal labels other than X to the regressions leads to either slight or no increase in the total share of explained variability, and to inconsistent results at various scales that are difficult to interpret biologically (specifically, different autosomes appear significant at different window sizes;

A significant positive association between deletion or insertion rates and divergence (

The association between GC content and both insertion and deletion rates again emphasizes the role of replication in generating indels. Indeed, the curvilinear relationship that is observed by plotting indel rates against GC content (

While both insertions and deletions are associated with male recombination rates, insertions are additionally and more strongly associated with female recombination rates (

Puzzlingly, similar to nucleotide substitutions [

The incidence of SINEs appears to be strongly associated with both insertion and deletion rates, albeit in opposite ways (

Insertion rates rise with increasing content of poly(A/T) runs (

Fitting regressions for different window sizes allows us to investigate the scales at which each genomic feature is the most highly connected to indel rates (

In addition to considering different scales, we performed regressions (at the 1-Mb scale) for 1-bp human-specific indels, which constitute roughly 50% of our total dataset (

Despite the significance of the X chromosome/autosome indicator (

To further explore differences in the biological mechanisms contributing to indels, we identified and analyzed 1-Mb windows having extremely different insertion versus deletion rates. We selected 25 windows skewed toward insertions and 25 windows skewed toward deletions, each group constituting ∼1% of our original dataset (

Similarly, inspection of windows with extremely different indel rate versus divergence allows us to identify genomic factors having stronger association with either indels or nucleotide substitutions (

In a separate analysis, we investigated relationships between insertion or deletion rates and the proportion of a window occupied by so-called most conserved elements (i.e., content). Many of these elements are likely to be functional and include protein-coding exons, other transcribed regions, and conserved noncoding sequences potentially important for gene regulation and other cellular processes [

Lowess smooths are surperimposed on each plot as visualization aids.

The above discussion of genomic factors suggestive of similarities and differences in the mutagenesis of insertions and deletions leads to the following conclusions. First, our regression analyses are consistent with the importance of replication for generating both indels [

Finally, the differences between genomic factors significantly associated with insertions and deletions suggest that the relative contributions of replication versus recombination are unequal for these two types of mutations. The trends we observe for deletions are closer to those reported for nucleotide substitutions in other studies (e.g., positive association with male recombination rates and negative association with SINE occurrence [

Differentiating between insertions and deletions using the macaque sequence as an outgroup has implied that nontrivial mechanistic differences exist between the two types of mutations. The importance of recombination and replication to indel formation conveyed here warrants evaluation in future studies. Conceivably, such studies will also allow us to discriminate between the roles of replication and repair. Our in-depth investigation of neutrally evolving indels provides important insights into indel mutagenesis, with its implications for understanding human genetic diseases. In addition, it will aid in the development of better gap modeling techniques, which are crucial for improving alignment methodology and thus for inferences on genome evolution.

Alignment methods and parameters are critical for the identification of indels. Here we use the human–chimpanzee–macaque (hg18–panTro2–rheMac2) three-way genome alignments that were produced by the MULTIZ algorithm [

If indel events are independent, intergap distances are expected to follow a geometric distribution [

MAF-formatted alignment blocks were restricted to human coordinates of ARs using Galaxy [

Custom PERL scripts (available upon request) were developed for the computational pipeline to identify and filter indels. A

Filtering of putative indels was further performed to remove potential false positives (

To investigate regional variation in indel rates, we divided the human genome (hg18) into nonoverlapping windows and estimated counts or content (fraction of bases of the window) for various genomic features to create a set of potential predictors (

The calculation of recombination rates is an exception to this rule, since different sources were used for different window sizes. For 1-, 5- and 10-Mb windows, sex-specific recombination rates were obtained from the UCSC Genome Browser deCODE data track [

Human–macaque divergence was used instead of human–chimpanzee divergence because of a strong effect of ancient polymorphisms on the latter [

In addition to various quantitative predictors obtained as counts and frequencies, we considered an indicator variable which labels each window as belonging to X (“1”) or autosomes (“0”)—this is also listed in

Windows were excluded from the analysis at two stages of filtering; first, if they lacked data due to low sequencing coverage (“N” content >50% of the window) or if they lacked sufficient aligned AR coverage (<20% of the window). Second, additional windows were excluded due to lack of recombination data and/or human–macaque divergence estimates. As the X chromosome is unique in having distinct ordered physical and evolutionary blocks or “strata” (based on divergence from Y; [

All computations were conducted using the R statistical package [^{2} and similar measures, variance inflation factors (VIF)); see [

For both insertion and deletion rates (separately), model selection was performed at the 1-Mb scale (and similarly at other scales), with the following approach. We started with the pool of predictors in ^{2}, Mallow's Cp selects subsets based on a balance between small mean square error (MSE) for the corresponding regressions, and parsimony (small number of terms).

Next, the regressions corresponding to the best subsets for insertion and deletion rates were further “pruned,” eliminating terms whose coefficients were not significant after a Bonferroni correction [^{2}, with inconsistency in the results when refitting the models for different window sizes (

While many of the quantitative predictors are correlated, our regression fits for both indels are not adversely affected by multicollinearity, as shown by the relatively low values of the VIFs in

To assess the contribution of each individual predictor to the explanation of the total variability in the response, we use RCVE:

Here _{full}^{2} (share of variability explained) and the regression sum of squares of the full model (includes all significant terms), while _{reduced}^{2}—the formula for the latter has _{reduced}_{full}^{2} because it uses the same denominator for all predictors in the same model. We evaluated the predictors in our models using the standard partial R^{2}, with very similar results and consistent conclusions (unpublished data).

We also checked whether residuals from our final regression models presented troublesome spatial autocorrelations among adjacent windows on each chromosome. Diagnostic plots of the residuals' partial autocorrelation function (PACF) for various lags (here lags are measured in number of adjacent 1-Mb windows) showed no substantial evidence against the assumption of independent errors on which the regression fits rely (autocorrelation parameter values <0.2 do not violate the assumption of independent error terms); moreover, the partial autocorrelation in residuals drops substantially compared with the partial autocorrelation in the response (unpublished data).

To further explore differences between indels, as well as substitutions, we considered a number of features (some chosen among the predictors in our regression analysis, and some novel—e.g., most conserved element content obtained from the UCSC Genome browser [

We ranked all windows used in the 1-Mb regression analysis according to insertion and deletion rates separately. Next, we computed the difference between each window's ranks in terms of insertion and deletion rates, and selected windows in the ∼1% left and right tails of the distribution of rank differences. These two groups (25 1-Mb windows each) represent genomic locations extremely skewed toward deletions (versus insertions) and toward insertions (versus deletions), respectively. Note that this rank analysis is completely nonparametric and robust to the nature of the relationship between the two mutation types.

Median values of some regression predictors and other variables (e.g., fraction of a window covered by most conserved elements) were calculated for the two groups. To test whether differences in medians between the two groups were significant, we used a randomization procedure. We randomly sampled (without replacement) two groups of 25 windows each, and computed the differences in medians between them for all variables considered. Repeating this 10,000 times allowed us to construct empirical null distributions for each difference in medians for variables of interest, and thus empirical

The same approach was used to identify windows extremely skewed toward indels (versus substitutions) and toward substitutions (versus indels), and to test for differences in medians for various variables between these two groups. The windows analyzed in this section were randomly distributed among and within chromosomes (i.e., did not cluster to specific regions in the genome).

Plot (log_{10} scale) of intergap distance counts in human–chimpanzee–macaque alignments calculated in ARs after filtering. The data are shown for chromosome 1 only and are representative of the genome-wide distribution. The distribution follows closely the predicted geometric shape, with deviation only in the range of ≤4 bp.

(2.2 MB PDF)

(1.6 MB PDF)

In the box plots, edges correspond to quartiles and vertical dashed lines to the range. Notches represent standard deviations of the median. Nonoverlapping notches are evidence that the two medians differ.

(2.4 MB PDF)

(1.5 MB PDF)

(37 KB DOC)

(26 KB DOC)

Features were calculated as observed counts or contents (fraction of bases) in a window.

(39 KB DOC)

(91 KB DOC)

(83 KB DOC)

The RCVE is indicated for each predictor significant after Bonferroni correction for multiple tests. For 5-Mb windows, no autosomal labels were identified as significant predictors for both indels. For 10-Mb windows, only insertions had significant autosomal labels.

(71 KB DOC)

We are grateful to Webb Miller and Kate Rosenbloom for help in optimizing the alignment parameters, to Yogeshwar Kelkar and Ian Schenck for allowing us to use their codes, and to the Rhesus Macaque Genome Sequencing and Analysis Consortium for the macaque sequence data.

ancestral repeat

insertions and deletions

relative contribution to variability explained

variance inflation factors