Bayesian analysis of static light scattering data for globular proteins

Static light scattering is a popular physical chemistry technique that enables calculation of physical attributes such as the radius of gyration and the second virial coefficient for a macromolecule (e.g., a polymer or a protein) in solution. The second virial coefficient is a physical quantity that characterizes the magnitude and sign of pairwise interactions between particles, and hence is related to aggregation propensity, a property of considerable scientific and practical interest. Estimating the second virial coefficient from experimental data is challenging due both to the degree of precision required and the complexity of the error structure involved. In contrast to conventional approaches based on heuristic ordinary least squares estimates, Bayesian inference for the second virial coefficient allows explicit modeling of error processes, incorporation of prior information, and the ability to directly test competing physical models. Here, we introduce a fully Bayesian model for static light scattering experiments on small-particle systems, with joint inference for concentration, index of refraction, oligomer size, and the second virial coefficient. We apply our proposed model to study the aggregation behavior of hen egg-white lysozyme and human γS-crystallin using in-house experimental data. Based on these observations, we also perform a simulation study on the primary drivers of uncertainty in this family of experiments, showing in particular the potential for improved monitoring and control of concentration to aid inference.


Data Cleaning and Preparation
Here we describe procedures involved in cleaning raw SLS data and preparing it for use. These procedures were employed for the data cases examined in this work, but are also suggested as a practical workflow for handling data of this kind more generally. Although our emphasis is on data processing, some comments are also made regarding experimental procedure; it is helpful for the analyst to be aware of these issues both to avoid misinterpretation of experimental artifacts (e.g., spurious transient signals due to air bubbles) and to provide guidance to experimental collaborators regarding elements of procedure that may complicate subsequent data analysis.
To obtain the data used in this paper, we performed static light scattering experiments on aqueous solutions of hen egg white lysozyme and human γS-crystallin, using a combination of a Dawn HELEOS multi-angle light scattering (LS) detector and an Optilab rEX refractive index (RI) detector from Wyatt Technology (Santa Barbara, CA). Figure S1 shows a flowchart for an SLS experiment. Using batch-mode injections, prepared samples were injected through a flow cell using a syringe pump starting with buffer only, followed by lysozyme samples (lowest to highest concentration), and finally ending with buffer only for baseline correction. Filters (0.1µm) were used with an aim to enhance monodispersity as samples are injected through the flow cell. Batch-mode collection does introduce potential artifacts causing some inconsistencies in the scattering intensities. These artifacts are results of air bubbles or back pressure from the injection of a new sample during data collection. In this section, we propose an automatic procedure for removing these common and unavoidable artifacts, which prove to greatly facilitate the SLS data preparation process.

Removal of Experimental Artifacts
As noted above, the primary source of experimental artifacts for both light scattering and refractive index measurements are transient changes in signal due to changes in concentration (during the transition from one concentration to another) and/or the presence of air bubbles in the syringe or the flow cell. In practice, these manifest as large deviations in the input series that are easily noted on inspection of the raw data (see left panel of Figure 2in the main text). To detect and remove these contaminated observations, we employ a local estimate of the absolute time derivative of the measured signal as a notion of measurement "stability." Specifically, let y 1 , . . . , y m be a sequence of raw scattering or refractive index measurements at times t 1 , . . . , t m . We then estimate the instability of measurement y i bŷ where | · | indicates the absolute value. To trim transients from the data set, we remove all observations y i such that where Q 50 is the 0.5 quantile (i.e., sample median), IQR is the inter-quartile range, and τ is a user-selected threshold. We typically select τ = 1, which removes all observations whose instability is estimated to be more than one IQR above the median for that time series; in the event that inspection of the trimmed series reveals some "flyaway" points remaining, smaller values of τ may be necessary.  An example of the impact of trimming on raw data series is shown in the right panel of Figure 2 from the main text. As can be seen, the above criteria have successfully removed most experimental artifacts, leaving a stable interval of points associated with each concentration. We have found that this approach (with τ = 1) typically works quite well, with reductions of τ to 0.5 or 0 usually sufficing to remove artifacts in difficult cases. Occasionally, especially dramatic transient fluctuations will yield small clusters of points between concentration levels that are not caught byŝ directly (due the to the fact that the signal fluctuation temporarily "pauses"). In these cases, we have found that applying a mild moving average smoothing toŝ by usingŝ i (y, t) = (ŝ i−1 (y, t) +ŝ i (y, t) +ŝ i+1 (y, t))/3 in place ofŝ ( y, t) readily corrects the problem.

Preparation of Raw Measurements
After removing the experimental artifacts with the procedure described in 1.1, we propose to match the observations with the concentration levels using a clustering algorithm, and then summarize these observations using a robust estimator (e.g., the median of the trimmed observations) to assign a single refractive index value and a single Rayleigh ratio value to each concentration level.

Clustering of Measurements by Concentration
Although manual identification of data points associated with each concentration condition is possible, we instead automate the process using a constrained agglomerative clustering algorithm. We initialize the algorithm by taking each set of raw measurements (post-filtering) to be an initial "cluster;" we observe that each such measurement is associated with a particular time point, which we exploit in the agglomeration process. In particular, we consider at each iteration the merger of each current cluster with its immediate temporal neighbors (i.e., the clusters immediately prior and immediately following it in temporal order). The cost of merging two clusters is determined as follows. First, the merge penalty for two points is taken to be the median absolute difference between all measurements taken on those points (i.e., if scattering at k angles is observed, the median of the absolute differences over the k angles gives the partial penalty associated with placing these two points within the same cluster. The merge cost for two clusters is then defined to be the median of the pointwise merge penalties for all point pairs in the respective clusters. At each stage in the clustering process, the merge cost for each pair of temporally adjacent clusters is computed, and the lowest cost merge is performed. This process is repeated until the number of clusters is equal to the number of design concentration conditions (plus a concentration zero baseline immediately before and after the non-zero concentrations).
This algorithm is simple to implement and produces solutions with the desired properties of temporal contiguity (a basic feature of the experimental design), repeatability, and optimality with respect to merge cost. Use of the double median merge cost minimizes the impact of any unstable measurements not removed during the filtering stage on the clustering solution.

Initial Scattering and Refractive Index Measurements
Given the assignment of (filtered) raw scattering and refractive index measurements to concentration levels, we then obtain initial estimates of R(θ, c) and n c by taking the sample median of the raw measurements in each cluster. Our choice of the sample median is motivated by robustness to remaining unstable measurements; since the number of raw observations per concentration is generally large (e.g., 50-300), efficiency is not a substantial concern. To obtain a robust initial estimate of the variability of each point estimate, we employ the nonparametric confidence intervals of (Gibbons and Chakraborti, 2011, eq. 5.4.10). (Note that, at this stage in our analysis, we seek an estimate of the variability in the measurement, not the underlying parameter. These are used to set priors for measurement errors within the subsequent stage of analysis.)

Additional Considerations for Practice
To date, we have found the above approach to work well in handling refractive index and light scattering data taken over a range of concentrations, on multiple instruments. Typically, little or no manual intervention is required. However, it is important for the analyst to visually inspect the clustering solution and initial estimates for both scattering and refractive index series to ensure that: (1) unstable observations were properly removed; (2) clustered observations are sensible (and do not e.g. group together observations taken at different concentrations); and (3) initial scattering/refractive index measurements properly represent the central tendencies of measurements at their respective concentration levels. In our experience, problems with this process invariably result from poor filtering of unstable points. This can be rectified by modifying the choice of τ and/or applying moving average smoothing toŝ as described in Section 1.1.
Although our focus here is on data analysis, some considerations relating to experimental conditions should also be borne in mind. As we have emphasized above (and show in the main text), SLS experiments and resulting inference are more sensitive to concentration than many biophysical experiments, and hence precision in sample preparation and liquid handling are a must. We have also observed that, with multiangle detectors, it is not uncommon for some detectors to function poorly or to have subtle hardware defects such as loose collimators that induce "drift" in measurements over the course of the experiment (observable by inspection as large and often dynamic changes in signal relative to other detectors under simultaneous data acquisition). Inspection of raw and processed data for bad detectors (and their correction or removal) is hence recommended. Likewise, it is recommended that the linearity of the RI signal with respect to concentration be verified; we have observed that some refractometers intended for use with SLS experiments will give erroneous readings for proteins with anomalously high dn/dc values, especially at high concentrations, manifesting as large, unphysical, and non-monotone changes in measured RI when the RI exceeds a threshold value. Removal of RI measurements for these conditions is then essential for accurate analysis. Another important experimental parameter is the temperature, which may produce apparent differences in the refractive index data if uncontrolled (Abbate et al., 1978;Tan and Huang, 2015).

Model assessment for case study on Lysozyme
Table S1 and S2 display the posterior predictive p-values of Rayleigh ratio and ∆n readings. Most of the predictive p-values are within the range of roughly 0.25 to 0.75, suggesting the model provides a good fit to the observed data, especially for ∆n readings. It is worth noting that the posterior predictive p-values for Rayleigh ratio readings are closer to 0.50 for large concentration levels, and the fit for two experiments (pH = 4.7, NaCl : 150mM; pH = 6.9, NaCl = 250mM) are slightly worse compared to others. The latter may suggest the presence of unobserved factors associated with experimental procedure affecting these data points. However, the data are nevertheless quite compatible with what would be expected under the model.  3 Model assessment for case study on human γS-crystallin Table S3 and S4 show the posterior predictive p-values for models M 21 , M 22 and M 12 , aggregated over replicates, for the Rayeligh ratio and ∆n readings. These p-values are close to 0.5.  ) 50  75  100  125  150  300  50  75  100  125  150  200  250