A Horizontal Alignment Tool for Numerical Trend Discovery in Sequence Data: Application to Protein Hydropathy
Figure 2
Empirically determined probability model for protein hydropathy.
A. Inverse Chi-Squared model for the distribution of observed scores. Distributions of Equation 4 scores for HePCaT alignments of length L = 100 obtained from parameters W = 5 residues, GapMax = 4 residues, C = 0.4. Pairs of random sequences were generated, their Kyte-Doolittle amino acid hydropathies averaged over a 15-residue window, and subjected to optimal alignment using HePCaT, as described in the text. Binned data in each case was reasonably fit to the Inverse Chi-Squared probability distribution function (PDF, Equation 5), as described in Methods and tabulated in Table 1. B. Analytical parameters to estimate statistical significance. Parameters ν and σ2 for the PDF were observed to vary smoothly as a function of HePCaT alignment length, allowing the parameters, and thus alignment significance, to be analytically estimated for arbitrary alignment length using Equations 6 and 7 and parameters in Table 2. Discrete best-fit parameters for ν and σ2 are given in Table 1. Equations for displayed best-fit curves are as follows: y = 0.497609x (Hydropathy, ν), y = 0.160379–1.04167 ln(x+38.9045) (Hydropathy, σ2).