Mismatch and G-Stack Modulated Probe Signals on SNP Microarrays

Background Single nucleotide polymorphism (SNP) arrays are important tools widely used for genotyping and copy number estimation. This technology utilizes the specific affinity of fragmented DNA for binding to surface-attached oligonucleotide DNA probes. We analyze the variability of the probe signals of Affymetrix GeneChip SNP arrays as a function of the probe sequence to identify relevant sequence motifs which potentially cause systematic biases of genotyping and copy number estimates. Methodology/Principal Findings The probe design of GeneChip SNP arrays enables us to disentangle different sources of intensity modulations such as the number of mismatches per duplex, matched and mismatched base pairings including nearest and next-nearest neighbors and their position along the probe sequence. The effect of probe sequence was estimated in terms of triple-motifs with central matches and mismatches which include all 256 combinations of possible base pairings. The probe/target interactions on the chip can be decomposed into nearest neighbor contributions which correlate well with free energy terms of DNA/DNA-interactions in solution. The effect of mismatches is about twice as large as that of canonical pairings. Runs of guanines (G) and the particular type of mismatched pairings formed in cross-allelic probe/target duplexes constitute sources of systematic biases of the probe signals with consequences for genotyping and copy number estimates. The poly-G effect seems to be related to the crowded arrangement of probes which facilitates complex formation of neighboring probes with at minimum three adjacent G's in their sequence. Conclusions The applied method of “triple-averaging” represents a model-free approach to estimate the mean intensity contributions of different sequence motifs which can be applied in calibration algorithms to correct signal values for sequence effects. Rules for appropriate sequence corrections are suggested.


Hybridization modes and base pairings for probe selection 1. Hybridization modes, probe attributes and interaction groups
Base pairings formed at the center position of the 25meric probe sequence (mb…middle base) or at the SNP position (SNP) which is offset by δ base positions relatively to the center position. The mb-and SNP positions are consequently identical for δ=0.

Probe selection for triple-averaging
Standard triples (xBy) are selected according to the scheme shown in part a: The interaction mode of the center base of the triple is defined by the chosen hybridization mode, the probe attributes (type, offset) and the position of 'B' (SNP-or the middle base, mb) in the probe sequence. The interaction mode determines the base pairing formed by 'B' with the target according to one of the four Ab-groups, At, Aa, Ag, Ac (see the Tables above), and the total number of mismatches per probe/target duplex, #mm. Part b shows special selections of triples with one flanking mismatch or of tandem mismatches.

'Hook' criteria for probe selection
Selection criteria considering non-specific hybridization are chosen from the hook-plot of the chipdata (see ref. [5,6] and also the figure). Briefly, the intensities of each probe pair are transformed according to Δ=<log(I PM /I MM )> allele-set and Σ=0.5 <log(I PM ⋅I MM )> allele-set ⋅ (the angular brackets denote averaging over the respective allele-set), plotted into Δ-versus-Σ coordinates and smoothed using a sliding window of ~ 500 data points. Probe-sets with relatively large contribution of non-specific hybridization, x P,N >0.5 (see Eq. (5)), are characterized by small coordinate-values Σ and Δ. Both coordinates increase with decreasing x N and level-off at a peak for vanishing contributions of nonspecific binding, x P,N ≈0 (see the figure below). The logarithmic-fraction of the probe-intensity due to non-specific hybridization can be estimated using the coordinate differences with respect to the starting point of the hook curve [6] ( ) where the sum and the difference refer to P=PM(+) and MM(-), respectively. The fraction x P,N depends on the probe type with x PM,N <x MM,N for Σ=const. Practically, a threshold of (Σ-Σ start )>0.7 is applied to obtain allele sets with an average nonspecific intensity contribution of less than 20%, i.e. <x N > allele set <0.2 with <log(x N )> allele-set = 0.5<log(x PM,N )+ log(x MM,N )> allele-set . This implies that the selected allele sets originate at least to 80% either from specific or cross-allelic hybridization.
Note that the hook-plots obtained from SNP arrays lack the horizontal starting range observed typically for expression arrays as a characteristic signature of "absent" probes without complementary targets. Non-specific hybridization to a smaller degree contributes to the signal intensities of SNP arrays compared with expression arrays in agreement with previous results [9]. This difference can be rationalized in terms of the smaller heterogeneity of genomic DNA-copies (in terms of sequence and fragment-length) and especially of the smaller range of copy number variations compared with the range of variation of mRNA-transcript concentrations. The latter can cover several orders of magnitude whereas the former typically change by less than the factor of ten. Trivially, the strand direction does not affect the strength of the respective base pairings provided that sequence motifs from both, the s-and the as-strands, are considered in the same direction. In our analyses we therefore pool the probes which are assigned to the same interaction mode independently of their strand direction (d=s, as) assuming that the respective genotypes are properly assigned on both strands. Figure: Classification of probeintensities according to their hybridization mode. So-called hook curves are plotted for homozygous-absent (ha) andpresent (hp) probes referring to cross-allelic and allele-specific hybridization modes, respectively. The 'start' coordinates of the hook curve are given by the intersection of the extrapolated ha-hook with the abscissa. The intensity fraction per probe due to non-specific binding depends on the hook coordinates (see Eq. (E1)). The right vertical line refers to (Σ -Σ start )>0.7. It was used as threshold for probe selection to characterize the interaction modes upon specific (S) and cross-allelic (C) hybridization. Above this threshold, probe intensities are distorted, on the average, by a contribution of non-specific hybridization of less than 20%. The fraction of non-specific binding slightly differs between the PM and MM probes as indicated in the figure.

Background correction and saturation effects
The figure (panels a and b) shows triple averaged mean intensities for all 64 standard triples with centre pairings taken from the At-group (WC pairings) and from the Aa-group (self complementary pairings, see also the next section). The data refer either to #mm=0 and 1 mismatches per duplex (Atgroup) or to #mm=1 and 2 (Aa-group). The mean intensity level decreases with increasing #mm as discussed in the previous section. The different triples of each class give rise to considerable variability of the intensity values. The standard deviation of the whole set of 64 triples of the At-group is SD(logI)=0.041 and 0.045 for #mm=0 and 1, respectively (part a of the figure), but more than twice as large for the mismatched Aa-(SD=0.12; part b of the figure), Ag-(SD=0.13) and Ac-groups (SD=0.09) for #mm=1 (see also Table 1). Hence, mismatched pairings with adjacent WC pairs give rise to considerably larger variation of duplex stability than triples of WC pairs. Figure: Triple-averaged probe intensities and background contribution. Panel a and b show the 64 triple averaged log-intensities of the perfect match-(At-group) and self complementary mismatch-(Aa-group) pairings. The data refer to different numbers of total mismatches per duplex (#mm, see the figure; the triples are sorted according to their central pairing Bb). These triple averages were correlated for #mm=0-versus-1 and #mm=1-versus-2 in panel c. Here also data for the mismatch-groups Ag and Ac are added. The data do not group in parallel with respect to the diagonal owing to the residual background intensity. Its consideration predicts the grouping of the data along the thick theoretical curve which was calculated using Eq. (E2) with g=11. This curve intersects the diagonal line at the background and saturation intensities, logI O =2.85 and logI sat =4.1, respectively. Correction of the intensities for the optical background (curve "O") slightly improves the linear correlation between the intensities, especially for #mm=1-versus-2 (open symbols). Consideration of the non-specific background (log I N =2.6) further improves linear correlation, however also inflates variation of the data (see also curve "O+N"). Panel d shows the triple-data of the Aa-group before (thin lines) and after (thick lines) background-correction using Eq. (E3).
In general, one expects the similar base-specific effect independently of the total number of mismatches per duplex. To assess this assumption we correlate the triple averaged log-intensities for #mm=k with that for #mm=k+1, i.e. for duplexes which differ by one mismatched pairing (see part c of the figure). Especially the triple-data of the mismatched groups (Aa, Ag, Ac) do not group in parallel with respect to the diagonal line. This behavior indicates poor correlation (solid symbols, see also part b of the figure which shows the data for the Aa-group with #mm=1 and 2) in contrast to the data of the At-group data (#mm=0, 1; part a of the figure). The discussed intensities contain contributions due to the optical and non-specific background (see Eqs. (2) and (4)). Moreover, the intensities saturate at large transcript concentrations and/or binding constants K duplex (#mm). Let us describe the probe intensities by the hyperbolic function of K duplex (#mm) [23,57] sat duplex BG duplex I c K (#mm) I(# mm) I 1 c K (#mm) I sat denotes the saturation intensity at strong binding, c⋅K duplex >>1, c is the transcript concentration. Assuming a factorial increment of the binding constant per mismatch, duplex duplex K (#mm 1) K (#mm)/g + = (see right axis in Figure 2, panel b), and varying "c K duplex (0)" in the limits 0<c⋅ K duplex (0)<∞ we get the theoretical relation between the mean intensities of duplexes which differ by one mismatched pairing (see the curves in panel c of the figure). The theoretical curves intersect the diagonal line (y=x) at low and high intensities at I=I BG and I=I sat , respectively, because Eq. (E2) assumes that background and saturation levels are not affected by the number of mismatches. Eq. (E2) predicts significant deviation from the linear relation between the intensities for #mm and #mm+1. The thick curve in panel b of the figure was calculated assuming a residual background intensity of logI BG ≈2.85. It explains the lack of linear correlation between the experimental triple data for #mm=0-versus-1 and especially of #mm=1-versus-2. The used background refers to the optical and non-specific contributions according to Eq. (4). To estimate the optical background we simply select 1% smallest intensity probes of the array, calculate their log-intensity average (logI O =2.39), and correct the intensities for this contribution, I corrO = I -I O (see open symbols in panel c of the figure). The dashed curve labeled with "O" refers to these data containing a contribution due to non-specific background intensity of about logI N ≈2.65. Intensity data which are corrected for both contributions, I corrO+N = I -I BG , are shown by the small crosses. The respective theoretical curve labeled "O+N" runs parallel with the diagonal line at decreasing intensities. The total background correction markedly inflates the variability of the data at small intensities. This effect is well known from microarray analyses as the consequence of diverging log-transformed data at vanishing argument. To avoid this trend it is common practice to confine the corrected data to a lower limit, for example by adding a small constant value to the corrected intensities. We also apply this modification using (log I N -o) with o= 0.6 instead of logI N . So far we estimated the mean optical and non-specific background levels which apply to all probes of the chip. The background contribution due to non-specific hybridization is governed by the binding reaction of non-specific transcripts (see Eq. (1)). It consequently depends on the probe sequence and thus it is specific for each probe. We previously showed that non-specific hybridization is basically characterized by Watson-Crick pairing [18]. Final background correction of the triple averaged intensities was therefore applied in a sequence specific fashion using  (8)). This correction progressively reduces the mean intensity level for #mm=1 and #mm=2 (see Figure 2, part b and the figure above, part d). The triple-specific effect is almost negligible for #mm≤1 but it affects the results for #mm=2.