Improved Discovery of Molecular Interactions in Genome-Scale Data with Adaptive Model-Based Normalization

doi:10.1371/journal.pone.0053930

Table 1.

Number of targets called by SAM After AD or Median normalization.

More »

Expand

Figure 1.

An example of the plot used to find k.

An example (for PUF3) of the plot used by the AD Normalization method to find the number of genes to use for normalization (k). Each spot is an individual RNA. The y-axis shows the Mock-IP enrichment values and the x-axis shows the rank of those same values. All mRNAs are plotted with black circles. The heat colors show the density of mRNAs that contain a PUF3 motif site in their 3′-UTR (no motif: black, with motif: from orange to yellow). The vertical section on the left (indicated with a blue dashed circle) corresponds to RNAs that are most enriched in the Mock relative to the IP. The vertical section on the right (indicated with a blue dashed circle) corresponds to RNAs that are most enriched in the IP relative to the Mock. The vertical dashed line indicates the k value chosen by the algorithm. All the RNAs to the left of this line were used to normalize this array. 91% (311/343) of the PUF3 3′-UTR motif site containing mRNAs fall to the right of this line, suggesting that our algorithm was successful in identifying the primarily non-target (i.e. background) RNAs to use for normalization in this case.

More »

Expand

Figure 2.

Simulated data illustrates a fundamental advantage of AD normalization.

(A) Histograms of simulated IP data for RBPs with an increasing number of targets (pink, red, and dark red lines) and the Mock IP data (gray line), normalized by AD normalization. (B) Same as A, except data was normalized by median normalization. Note how AD normalization properly aligned the portion of each distribution that contained the simulated background data, while median normalization did not, resulting in lower normalized IP enrichment values for the simulated RBP IP targets. The vertical dashed lines have been added to highlight the position of the (Mock) background distribution.

More »

Expand

Figure 3.

The AD normalization method properly aligns Mock and IP distributions for PAB1 and PUF3.

The application of the AD normalization method to IP data from the RBP PAB1 results in greater enrichment relative to the Mock. (A) PAB1 IP data (red line) compared to Mock IP data (black line), both normalized by median normalization. Median normalization results in the dubious situation where there are genes with more negative enrichment values in the PAB1 IP than the Mock IP, as evidenced by the shift of the left-hand side of the PAB1 IP distribution relative to the left-hand side of the Mock IP distribution (highlighted in blue). (B) Same as in part A, except with PUF3 IP data. (C) Same as in part A, except with AD normalized data. Application of AD normalization yields a much more logical outcome where there are no longer more genes with negative enrichment values in the PAB1 IP than the Mock IP. (D) Same as in part C, except with PUF3 IP data, showing that AD normalization does not simply shift every distribution more than median normalization – it properly aligns Mock and IP distributions for RBPs with many targets and with few targets.

More »

Expand

Table 2.

The number of false targets identified from Mock-Mock comparisons after AD or Median normalization.

More »

Expand

Figure 4.

The AD no rmalization method is robust to different types of input data.

The method of identifying the normalization constant is robust to different types of input data. (A) A plot of the Log base 2 enrichment values for the RBP IP (y-axis) and the Mock IP (x-axis) for the RBP PAB1. Each point represents a specific gene. (B) Same as A, except for the RBP SCD6. (C) Same as A, except for the RBP PUF3. (D) A plot of each Mock – IP value (y-axis) vs. its rank (x-axis) for the RBP PAB1. Each point represents the Mock value minus the IP value for a specific gene. This plot is used by the AD normalization method to select the number of genes to use for normalization (called k). (E) Same as D, except for the RBP SCD6. (F) Same as D, except for the RBP PUF3. (G) A plot of the normalization constant vs. the number of genes in the gene set used for normalization for the RBP PAB1 is shown in blue. The vertical blue dotted line indicates the number of genes to be used for the normalization, chosen by the AD normalization method from the above plot. The horizontal blue dotted line indicates the corresponding normalization constant. This plot is used to find the normalization constant for a given value of k. (H) Same as G, except for the RBP SCD6. (I) Same as G, except for the RBP PUF3. The examples shown here are from RBPs with a variety of targets (300–3,000), purified using different reagents, some amplified, some un-amplified, by different experimenters over a span of over 6 years, and using different microarray platforms. Despite these differences, for each sample there was a sufficiently large range of k for which could be modeled linearly.

More »

Expand