Figure 1.
(A) A schematic drawing of the X chromosome delineating each evolutionary stratum [13,15].
(B) Strategy for statistical analysis and SVM training and classification.
Table 1.
Repeat Features Whose Distributions are Significantly Different in All Three Evolutionary Stratifications when Comparing Genomic Environments of Genes Subject to and That Escape Inactivation
Figure 2.
Boxplots Representing the Content of Alu, L1, L2, MIR, and ACG/CGT 3-mers in 100-kb Windows Surrounding the Transcription Starts of Genes
For each boxplot, horizontal lines indicate the locations of the lower quartile, median, and upper quartile values. Notches represent a robust estimate of the uncertainty about the medians for box-to-box comparison. Boxes whose notches do not overlap indicate that the medians of the two groups differ at the 0.05 significance level. Clear boxes represent genes that escape inactivation, while shaded boxes represent genes subject to inactivation [12]. Analyses were done with all genes for which X inactivation status is known (n = 448) or with genes from the XAR region (n = 110). y-Axes in all plots represent the average content of each sequence feature in 100-kb windows surrounding the transcription starts of genes.
Table 2.
Classification Accuracy for XAR Genes, XAR ESTs, and XCR Genes Using 5,596 Features from 50-kb and 100-kb Windows around Transcription Start Sites of the Genes
Figure 3.
Recursive Feature Reduction and Distributions of Consistent Features across the XAR Nonborder Genes
(A) The mean prediction accuracy and standard deviations (y-axis) for 100 recursive feature reduction iterations are shown for the indicated number of genes (x-axis). Green represents the CV rate using randomly selected two-thirds of the XAR nonborder genes for each set of features. The prediction rates for escaping genes (blue) and subject genes (red) in the remaining one-third are also shown. Both escape and subject prediction rates begin declining when the feature set is reduced to fewer than 53 features.
(B) The content of each feature (y-axis) in specific windows around the transcription start sites for all 82 XAR nonborder genes (x-axis) is represented as a histogram. The first 36 genes on the x-axis escape X inactivation (shaded area), and the remaining 46 are subject to X inactivation. Features found to be consistently chosen during recursive feature reduction for the creation of accurate classifiers are L1 100 kb downstream, MLT1K 100 kb upstream, and MER33 100 kb upstream. For comparison, THE1B 50 kb upstream, a randomly distributed feature, is also shown.
Table 3.
Frequently Selected Features during 100 Independent Feature Selection Experiments Involving Random Subsets of XAR Nonborder Genes and Their Individual Classification Performance on XAR Nonborder Genes
Table 4.
Classification Accuracy for XAR Genes, XAR ESTs, and XCR Genes Using a Reduced Set of 12 Features and XAR Nonborder Genes
Figure 4.
The Significance of the 12 Selected Features
(A) The three best principal components (PC1–PC3) among all 5,596 features for 50-kb and 100-kb windows (left) and the selected 12 features (right) for the 82 nonborder genes are shown projected onto a 3-D graph. Escaping genes are represented as blue circles and subject genes as red circles.
(B) These histograms show the distribution of XAR leave-one-out CV and XCR prediction rates by SVM models constructed using 1,000 random 12-feature sets taken from the 5,596 features for 50-kb and 100-kb windows. Black dots represent mean values, flanked by 95% confidence intervals denoted by error bars representing two standard deviations (2SD). Both the XAR CV rate and XCR prediction rate achieved by the selected 12 features (black arrows) exceed 2SD, and their p values calculated based on these random trials are shown.
Figure 5.
The Distribution of SVM Prediction Probabilities for Genes with Known X Inactivation Status
These histograms summarize the prediction probabilities of genes that are either (A) subject to inactivation (expressed in zero of nine somatic cell hybrids) or (B) escape from inactivation (expressed in nine of nine hybrids) [12]. Genes from the XCR, XAR border genes, and nonborder XAR genes coupled with XAR ESTs are represented by different colors.
XCI, X chromosome inactivation.