Sampling Enrichment toward Target Structures Using Hybrid Molecular Dynamics-Monte Carlo Simulations

Sampling enrichment toward a target state, an analogue of the improvement of sampling efficiency (SE), is critical in both the refinement of protein structures and the generation of near-native structure ensembles for the exploration of structure-function relationships. We developed a hybrid molecular dynamics (MD)-Monte Carlo (MC) approach to enrich the sampling toward the target structures. In this approach, the higher SE is achieved by perturbing the conventional MD simulations with a MC structure-acceptance judgment, which is based on the coincidence degree of small angle x-ray scattering (SAXS) intensity profiles between the simulation structures and the target structure. We found that the hybrid simulations could significantly improve SE by making the top-ranked models much closer to the target structures both in the secondary and tertiary structures. Specifically, for the 20 mono-residue peptides, when the initial structures had the root-mean-squared deviation (RMSD) from the target structure smaller than 7 Å, the hybrid MD-MC simulations afforded, on average, 0.83 Å and 1.73 Å in RMSD closer to the target than the parallel MD simulations at 310K and 370K, respectively. Meanwhile, the average SE values are also increased by 13.2% and 15.7%. The enrichment of sampling becomes more significant when the target states are gradually detectable in the MD-MC simulations in comparison with the parallel MD simulations, and provide >200% improvement in SE. We also performed a test of the hybrid MD-MC approach in the real protein system, the results showed that the SE for 3 out of 5 real proteins are improved. Overall, this work presents an efficient way of utilizing solution SAXS to improve protein structure prediction and refinement, as well as the generation of near native structures for function annotation.


Introduction
Biological functions of macromolecules can usually be understood in fine detail on the basis of their atomic structures. Experimental methods including nuclear magnetic resonance (NMR), Cryo-electron microscopy and X-ray crystallography as well as computational algorithms are vigorously developed to provide reliable atomic structures. However, despite the huge progresses that have been made in protein structure prediction and refinement in the last two decades [1,2], fundamental problems associated with simulating and scoring of protein conformations are still far from being properly resolved. In fact, the most recent Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiment indicates that within the last ten years, marginal improvement has been achieved in the predictive accuracy of the overall backbone structure [2] while diverse efforts in protein structure refinement have brought some degree of improvement [3].
Among these efforts, molecular dynamics (MD) simulations equipped with classical force fields are extensively used for studies on mechanisms of protein functions [4,5] and for structure refinement [6,7]. However, MD simulations may lead to unsatisfactory results because of problems in conformational sampling [8], especially when the desired state is energetically unfavorable in the simulation force field [6,9]. Incorporating experimental information, such as NMR short-range atom pair restraints [10,11], Cyro-EM globular contour constraints [12,13] and small angle x-ray scattering (SAXS) shape and size restraints [14,15] into molecular simulations may help overcome this obstacle. In particular, SAXS has been increasingly integrated into molecular simulations owe to its intrinsic merits [16]. SAXS is a robust and manageable technique, which can be applied under near-physiological conditions with no strict limitations on temperature, buffer conditions and macromolecular mass or size [17]. The contemporary computational studies that utilize SAXS experimental data as pseudo-potential functions or scoring functions [18][19][20] mainly focus on (i) determining the structures of multi-domain proteins and multi-protein complexes on the basis of the atomic structures of their individual subunits [21][22][23], (ii) improving the accuracy in structure prediction and refinement [14,24,25] and (iii) predicting the conformational ensembles of multi-domain proteins and multi-protein complexes with flexible linkers and loops [15,[26][27][28][29]. Although these studies greatly facilitate the application of SAXS combined with molecular simulations, it is still not clear how much the SAXS delivered structural information can improve the sampling efficiency in molecular simulations.
In the context of structure-based function annotation, it seems more practical to consider conformational ensembles rather than a static structure determined through crystallography or other techniques. Proteins generally undergo conformational fluctuations on various timescales and amplitudes to perform their biological functions, such as signal transduction, transport and catalysis [30]. It is thus often desired to know a set of representative conformations rather than a single static structure in order to understand how proteins carry out their functions. Furthermore, it has been validated that the success of structure refinement is proportional to the population of native-like decoys [31,32]. Therefore, generating a set of structures with smaller root-mean-square deviation (RMSD) from the target than their initial conformation is of great practical significance. It raises the demand for an universal energy function or a general simulation approach, besides a system-dependent scoring function, to guide simulations toward native structures.
In this work, we develop a hybrid MD-MC simulation approach that incorporates SAXS information to enrich sampling toward a target state. The structural information contained in SAXS-derived data is transformed into a Monte Carlo (MC) pseudo-energy function that acts as a soft restraint in the conformational space search. The use of MD simulations guarantees that the conformational search is limited to energetically achievable states and the simulation structures are physically meaningful. The MC judgment integrated with the MD simulation is used to bias the sampling closer to the target state. We tested our method on 60 structures with 20 identical residues generated by residue substitution to three high-resolution fragments with typical secondary structures (e.g. α-helix, β-sheet and random coil). Under this case, the influence on sampling enrichment aroused from the relationship between target state and the preferable state of a sequence in a given force field that guides the MD simulations can be averaged out. It provides a way to evaluate the power for the integration of SAXS information in the enrichment of sampling toward a target state independent on the force field.

Methods
Our computational experiments include three parts: (i) Generation of native-like structures.
(ii) Forward MD simulations on these native-like structures performed to generate a pool of decoys. Based on these decoys, an optimal SAXS-derived pseudo-potential function for MC simulations is constructed and representative models are selected through clustering. These models are used as the initial structures in the backward simulations. (iii) Backward MD and hybrid MD-MC simulations are performed on the initial structures obtained in (ii). Using the native-like structures generated in (i) as target structures, sampling enrichment of these simulations are evaluated. Here, the terms "forward" and "backward" are used to represent the simulation directions initializing and refining towards the native-like structures, respectively. A flowchart of the experiment is presented in Fig 1, and each of the parts is described below.

Generation of native-like structures
Native-like structures are selected solely by their secondary structures. We select three structural fragments with typical secondary structures from PDB library: helix (PDB code 1VCS, residues 37-56), sheet (PDB code 1PIN, residues 9-28) and coil (PDB code 1GWP, residues 80-99). The secondary structures are determined using STRIDE [33] [sheet (E, B), helix (H, I, G), coil (C, T)]. The sequences then are substituted by one of the 20 natural amino acids to create mono-residue peptides using Mutator plugin in VMD [34]. These mono-residue peptides have different secondary structural preferences [35], and they may evolve into energetically favorable secondary structure through the MD simulations in a given force field. Therefore, they are good and rational choice to evaluate the overall performance of the hybrid MD-MC method, regardless of whether the target structure is energetically favorable and achievable in MD simulations. Although these mono-residue peptides are not representative for the large diversity of real proteins, the types of residues and secondary structures are comprehensive which make it provide a theoretically sound evaluation. These 60 (3 Ã 20) structures are relaxed to get native-like structures through three steps: (i) relocation of atoms and surrounding water molecules with 2000 iterations of a conjugate gradient energy minimization; (ii) equilibration at T = 310K through a 1 ns NVT-ensemble MD simulation; and (iii) equilibration at T = 310K with a 1 ns NPT-ensemble MD simulation. A harmonic potential with the force constant of 1.0 kcal/mol/Å 2 is applied to all non-hydrogen atoms to minimize the structure change in all the three steps. The final structures are selected as the native-like structures, which are next used as the initial structures in the forward simulations and as the target structures in the backward simulations.

Forward molecular dynamics simulations
Wide distribution of decoys from the forward simulations is important to select diverse and representative structures for the sampling enrichment study, and to construct a pseudo-energy function that can reliably guide simulations towards the target structure. The forward NVT MD simulations are carried out for 5 ns at 310K, 340K and 370K. In each simulation trajectory, structures are saved every 4 ps, which produce 1250 decoys in each trajectory. Thus, for each mono-residue peptide, a pool with 11250 (1250 Ã 3 Ã 3) decoys from 3 native-like structures at 3 simulation temperatures is collected.
The initial simulation configuration are prepared using VMD [34] through merging monoresidue peptides into a TIP3P water box [36] with the edge size of 13 Å. Additional sodium or chloride ions are added to neutralize the system. The MD simulations are carried out with the periodic boundary condition using NAMD v2.9 [37]. The multiple time stepping integration scheme [38] is used to accelerate electrostatic potential computation, and short-range nonbonded interactions are computed every step using a cutoff of 10 Å with a switch distance of 8 Å. Long-range electrostatic interactions are calculated using the particle-mash Ewald method with a grid spacing of 1 Å -1 by every 2 steps. The integration time step is of 2 fs with hydrogen atoms optimization using SHAKE [39,40]. Langevin dynamics for all non-hydrogen atoms is used to keep constant temperature, and the damping coefficient is 1 ps -1 . The Nose 0 -Hoover Langevin piston [41] with an interval of 200 fs and a damping timescale of 100 fs are used to maintain a constant pressure at 1 atm.

Clustering
The clustering of decoys from the forward MD simulations is performed using SPICKER [32] with the initial cut-off RMSD of 4 Å. The cut-off RMSD can be self-adjusted to satisfy the condition that the first and largest cluster (Top 1) cover 15%-70% of all input structures. The final cut-off RMSD is 4 Å for all mono-peptides except for the poly-Gly, which is 4.6 Å. For each mono-residue sequence, the center structures in the three most populated clusters (Top 3 models) are selected as the initial structures for the backward simulations.  The consecutive MC judgments are made every 4 ps in the MD simulation trajectory, and the decoys are saved after each MC judgment for results analysis. To make a MC judgment, the SAXS intensity profile for a given structure is computed using Fast-SAXS-pro [42]. The acceptance probability in the MC judgment is given by the Metropolis criterion [43], i.e., min{exp [-(E n -E n-1 ]/k B T), 1}. Here, T is the simulation temperature, k B is the Boltzmann constant, and E n is the pseudo-energy function for the structure at the n th MC iteration. The pseudo-energy function is taken to be proportional to the discrepancy in the scattering intensity profiles between the target structure and the n th structure in the hybrid simulation.

Backward simulations
We perform 20 ns MD and MD-MC simulation at 310 K and 370 K, starting from 60 (20 sequences and Top3) models obtained from the clustering procedure. Since the MD simulations are not biased towards any target structure, we only perform one MD simulation for each of the models (i.e. 60 simulation trajectories). In the case of the MD-MC simulation, where the SAXS-derived information about the target structure is incorporated into the MC pseudoenergy function, we run 60 MD-MC simulations toward three native-like target structures, which resulted in 180 simulation trajectories in total.
Additionally, to provide a solid statistical view on sampling enrichment and minimize the bias from over-sampling at the valley in the energy landscape, we randomly select 600 target structures and 600 initial structures for the MD-MC simulations. These supernumerary simulations are carried out at 370K for 1ns with 0.5ps time interval for the MC judgments in each trajectory.
It is set to 0.2 and 0.4, respectively, for 4ps and 0.5ps intervals in the MD-MC simulations, to keep the MC acceptance ratio of most simulation trajectories above 40% to ensure acceptable simulation efficiency. The discrepancy function χ is a measure of the discrepancy between the SAXS intensity profiles for a given decoy structure and the target structure. It may have alternative forms Here, q denotes the scattering vector, q min and q max are the boundaries of q which take the values of 0.005 Å -1 and 0.6 Å -1 respectively, in this work. N is the number of the data points in the scattering intensity profiles (N = 120), and c is an amplification factor for χ (c = 1000). I(q min ) target is an approximation of I(q = 0) target , which is used to normalize the scattering intensity profiles. I(q) decoy and I(q) target denote the scattering intensity profile for a decoy and the target structure, respectively, and n is an integer number among 0, 1, 2, 3 and 4. The increase of n can gradually accentuate the match of the intensity profiles corresponding to structures at smaller scales, and emphasize the structure information in three classic regions including Guinier [46] (shape and size), Debye [47] (correlation in scattering units) and Porod [48] (interface and surface). Then the pseudo-energy function to guide the MC simulation is selected from one out of the ten formulas as given by Eqs 1 and 2. The choice is made on the basis of the ranking correlation (Spearman coefficient) between χ and RMSD. Here it is worthy to note that the dataset in this work is selected to present general trend though the integration of SAXS intensity profiles into simulation and the energy function to guide MC simulation is expected to have overall correlations and sequence independent.

Sampling efficiency
Sampling efficiency (SE) for a given simulation trajectory is defined as the probability of finding a simulation decoy with RMSD from the target (R) smaller than RMSD of the initial structure from the target (R 1 ). It can be computed as follows Any simulation trajectory with the known target structure can thus be assigned a specific value of the actual SE. In addition to the actual SE, we also introduce the hypothetical SE, which is based on the assumption of even sampling of objects with spherical symmetry. In a simple geometrical model, the sampling range R 2 can be represented by the length of a line segment (1D), the radius of a circle (2D) or the radius of a sphere (3D) that encloses 95% of all simulation decoys. R 1 , on the other hand, is given in this simple picture by the distance between the points (in 1D, 2D or 3D) representing the initial and target structures. The hypothetical SE is then given by the ratio of the overlapped length (1D), area (2D) or volume (3D), as illustrated in Fig  3, to the overall sampling length (1D), area (2D) or volume (3D). The hypothetical SE is thus given by the following formulas The hypothetical SE is a continuous function of the ratio R 2 /R 1 . It takes the values between 0 and 0.5, which reflects the assumption about even sampling. The actual SE, on the other hand, takes the values between 0 and 1. Thus, if the actual SE is significantly higher or lower than the hypothetical SE, it indicates uneven or biased sampling.

Forward simulations
First of all, we attempt to verify the hypothesis of even sampling. Decoys from the forward MD simulations are superposed to their initial structures. The rotational vectors generated in this process are represented by discrete dots in Cartesian coordinate as shown in S1 Fig. The overall sphere-like distribution of these dots supports the assumption that the 180 simulation trajectories in the forward MD simulations of 20 mono-residue sequences with 3 types of secondary structure at 3 temperatures follow even sampling. We further cluster these decoys according to their sequences and analyze the three most populated clusters (Top3 clusters). The center structures in the Top3 clusters are selected as representative models and termed as Top3 models. The coverage of Top3 clusters (i.e. the fraction of all decoys in the Top3 clusters) and the average of RMSD of each two among the Top3 models are summarized in S1 Table. The representative models cover a broad region in the conformational space with the average of RMSD of any two structures among the Top3 models between 5.3 Å and 11.5 Å. The coverage of the Top3 clusters ranges from 36.24% to 77.79% for all 11,250 decoys generated from 9 simulation trajectories with identical sequence. These results indicate that the decoys generated in the forward MD simulations are broadly distributed. Therefore, those decoys could be used to construct the pseudo-energy function through SAXS intensity profiles, and the Top3 models are suitable to be the initial structures for the backward simulations.
To construct the energy function for the hybrid MD-MC simulations, the correlation of χ and RMSD is tested. The RMSD of decoys from their initial structures (RMSD I ) is calculated and then merged into 1.0 Å bins. The histogram of RMSD I is presented in Fig 4a. It indicates that the majority of decoys departs from their initial structures from 2 to 9 Å in RMSD and significantly enriches in space from 3 to 4 Å. We also observe that the distribution of decoys shift to broader conformation space with elevating temperatures, which is a reasonable deduction from that higher temperature make the energy barrier easier be crossed and so to allow broader sampling space [49].
For each decoy, the discrepancy in scattering intensity profiles (χ I ) in reference to its initial structure is calculated. The Spearman ranking correlation coefficients between χ I and RMSD I with different definitions of χ I are summarized in Table 1. The results indicate that the correlation decreases with increasing exponent n for both formulas given in Eqs 1 and 2. Eq 1 with n = 0 provides the highest correlation coefficient of 0.819, and so this definition is used later in this study. The distribution of χ I against RMSD I shown in boxplot is presented in Fig 4b and the label of boxplot is depicted in Fig 4d. It can be seen that the correlation between χ I and RMSD I is not monotonous, though they have overall positive correlation. There is a region from 8 to 11 Å, in which they are negatively correlated. This result clearly shows that the implement of SAXS information in simulations will not always improve structure prediction and refinement, which was also observed by other researchers [23]. Additionally, the contribution from large q values is also tested for q max = 0.3 Å -1 , and the results are summarized in S2 Table. The correlation coefficients for q max = 0.3 Å -1 showed the same dependence on the exponent n as that using q max = 0.6 Å -1 , and the highest correlation coefficient is 0.802. These results indicate the cutoff of large q range has minor impact for the match in structures and scattering intensity profiles.
Additionally, since the protein size quantified by the radius of gyration (R g ) is a central parameter that can be steadily obtained from SAXS data, we calculate R g of the decoys from their SAXS intensity profiles. The ratio R g /R gI which denotes the relative size change of decoys compared to their initial structures is presented in Fig 4c. The mean of R g /R gI has no obvious correlation with RMSD I when it is smaller than 7 Å, and shows a drastic increase when RMSD I > 8 Å. Visual inspection of the simulation structures indicates that this increase normally originates from the peptide unfolding and peptides behaving like random coils. Overall, the radius of gyration of the peptide is less sensitive to changes in RMSD than that of χ, and so we decide to keep χ as the only contributor to the pseudo-energy function. Here, it is worthy to note we found the Pearson correlation coefficient for the R g directly calculated from structures [50] and that using Guinier fitting from SAXS intensity profiles [51] is 0.922. It indicates SAXS profiles can steadily deliver the size information of structures within our dataset.

Backward simulations
The performance of the hybrid MD-MC simulations in sampling enrichment is firstly evaluated. The mean RMSD T and the fractions residues in the secondary structures which are in accord with the target structure, and the mean SE of trajectories in the MD-MC simulations are calculated and compared with those from the parallel MD simulations. These quantities as  Table 1. Spearman correlation coefficients between the discrepancy functions (χ) and RMSD. The correlations are estimated based on 22,5000 decoys generated in the forward MD simulations. The discrepancy function with ten different forms are for q max = 0.6 Å -1 .  Overall, these three parameters prove that incorporation of SAXS profile into the MD-MC hybrid simulations provide superior performance than the parallel MD simulations to enrich simulation decoys towards target structures. Fig 4b shows that the relationship between χ and RMSD is not straightforward. For this reason, we group the backward simulation trajectories according to the difference in RMSD between the initial and the target structures (R 1 ), and the ratio of sampling range (R 2 ) rescaled by R 1 (R 2 /R 1 ). The histogram of R 2 /R 1 and the actual SE in 180 trajectories of the backward simulations are presented in Fig 6. The corresponding mean values are summarized in Table 2.
Since the majority of trajectories are distributed in 0.5 < R 2 / R 1 < 2, and the number of trajectories in the group of R 2 /R 1 > 2 is too small to afford more reliable statistical averages and distributions, the corresponding statistics results of 600 trajectories are also given in S3 Fig and S3 Table. The logarithmic scale is used to highlight the steep decrease in sampling efficiency in the region of R 2 /R 1 >2. The actual SE for both MD-MC and MD simulations decreases with increasing R 2 /R 1 . Most trajectories exhibit the actual SE that has a consistent tendency to the hypothetical SE curves in 3D. The hypothetical SE curve can be regarded as the upper limit for the statistical mean of the actual SE when R 2 /R 1 >2, because the means of the actual SE are always lower than the hypothetical SE curves. While in the region of R 2 /R 1 < 2, the actual SE can fluctuate between 0 and 1, which indicate that this region is much affected by the bias of the classic force field. This is also reflected by the value of mean R 2 which is almost a constant regardless of whether the target can be detected in both MD-MC and MD backward simulations (S3 Table) Overall, with the increase of R 2 /R 1 , where target structures become gradually detectable in simulations, the improvement in SE becomes more prominent by hybrid simulations. When target structures can be entirely sampled in simulation trajectories, i.e., the region of R 2 /R 1 > 2, the improvement of SE can reach up to 200.0%.
In analogy to the classification of easy, medium and hard targets in protein structure prediction and refinement [52], according to the similarity between the initial and target structures (R 1 ), the structure similarity of simulation decoys in referent to their targets (RMSD T ) is computed and analyzed. Two groups are distinguished as easy targets for R 1 < 7 Å and hard targets for R 1 > 7 Å, to match the yield point in the χ vs. RMSD curve shown in Fig 4b. The comparison of SE is also listed in Table 2. The improvements due to the hybrid simulations are 118.9% and 4.5% for easy and hard targets, respectively. Besides the overall probability of sampling enrichment, also the reduction of RMSD T for the MD-MC hybrid simulations as compared to the parallel MD simulations, denoted here as dRMSD T , as a function of R 1 is shown in Fig 7. When R 1 < 7 Å, i.e. for easy targets, 49 and 45 out of 57 cases at 370K and 310K, respectively, have negative dRMSD T values. For hard targets, there are 71 and 55 out of 123 cases at 370K and 310K have negative dRMSD T values. These results indicate higher simulation temperature can help sample in a broad conformational space, and then can improve the sampling enrichment of the hybrid simulations toward target structures. The probability to bring the initial structures to the target structures is higher for easy targets (79%-86%) and relatively lower for hard targets (45%-58%), which agrees with the improvement in SE and the consensus in protein structure prediction and refinement.
Besides, we also compared our actual SE with the calculated values via the theoretical model proposed by Reva et al. [53]. They found that for globular proteins, the probability to generate a random conformation with matched compactness to a target within a given RMSD (R) follows a normal distribution Here, <R> and σ, dependent on the size of proteins, are the mean and standard deviation of the distribution of possible RMSD values, respectively. But, as a rational approximation, <R> and σ may be set to 3.333N 1/3 (N is the number of residues) and 2.0 Å [53,54]. Results in  To demonstrate how did the MD-MC hybrid simulation enrich sampling through the integration of SAXS profiles, we selected five typical simulation trajectories as shown in Fig 8. The associated parameters are listed in Table 3. There are two easy targets (poly-Asn and poly-Phe) and three hard targets (poly-Pro, poly-Ala and poly-Ser), and their locations are marked in Fig  7. For the energy function coupled with the MC simulation, χ T is efficiently minimized and equilibrated within in a relatively short simulation time. The fluctuations of RMSD T are smaller in the MD-MC simulations than that in the parallel MD simulations, which is consistent with the smaller R 2 in the former simulation reported before. Additionally, the closest simulation decoys to their target structures, the superposition of the two structures and the match in SAXS profiles are also presented. Generally, whether the incorporation of SAXS profiles into simulations can enrich sampling is majorly challenged by the energy landscape given by the MD force field for a particular sequence and the positions of the initial and target structures located in, by the temperature which affects the chances of trapping the simulation trajectories in local energy minima, and by the ratio of R 2 /R 1 reflecting the probability of the target structures to be sampled.
For poly-Asn, since the preferred structure in the MD force field is far from the target structures, the MC perturbations can only suppress the simulations deviate from the target, so the improvement of the hybrid simulation is limited with some extent in the refinement of the sheet secondary structure. While for poly-Phe, both the MD force field and the MC energy function have consistent preferences to the target structure, so the hybrid simulation affords remarkable improvement, especially in the straightening of helix. In both cases, the target structures are partially detectable with R 2 /R 1 > 1 in both backward simulations. Poly-Pro exhibits large conformational fluctuations in the MD simulations. The hybrid MD-MC simulation provides a globule constraint to stabilize it in a conformation close to the target structure. While poly-Ala has a strong preference to form helix guided by CHARMM force field, both the hybrid MD-MC and the parallel MD simulations reach consistent conformation. Because the   target structure is only marginally detectable with R 2 /R 1 close to 1 in both simulations, the input of SAXS information may disturb the dip of most favorable conformation. The last case is poly-Ser, unlike the above four cases where the MC energy function is dominant or competitive to the MD force field in the hybrid simulation, the MD force field overwhelm the MC judgment perturbation. The results show that the structure of poly-Ser keeps on significant fluctuations during the whole simulations and fails to converge close to the target structure. It raises the awareness of that χ and RMSD are not linearly correlated. Since the spherical average eliminates the one-to-one correspondence between three-dimensional structures and their one-dimensional scattering intensity profiles, degenerate structures and energy states originated from the complex energy landscape for protein folding are still the obstacles for the MD-MC hybrid simulation in providing ensured guidance for variant systems.

The hybrid MD-MC simulations on real proteins
In order to explore the applicability of our method for real protein system, we also carried out the hybrid MD-MC simulations at 370 K for 5 real proteins using calculated SAXS profiles, including bovine antimicrobial peptide (random coil), arenicin-2 (β-sheet), magainin 2 (αhelix), ubiquitin and cytochrome C with multiple secondary structures, whose structures were taken from PDB library as the target structures. The initial structures for three peptides are generated as before. For ubiquitin and cytochrome C, the first model from NMR structures (PDB code: 2LD9 and 1OCD) are regarded as their respective initial structures. Meanwhile, hybrid MD-MC simulations for ubiquitin and cytochrome C also were performed using the experimental SAXS profiles to evaluate the effects of hydration layer and experimental errors, which were download from Small Angle Scattering Biological Data Bank (SASBDB) [55] with the code of SASDAQ2 and SASDAB2, respectively. The simulation results for 5 proteins are listed in Table 4. Comparing to the parallel MD simulations, hybrid MD-MC simulations achieved higher SE and smaller mean RMSD T (denoted by the negative dRMSD T ) for bovine antimicrobial peptide, magainin 2 and ubiquitin, while there is not obvious improvement for cytochrome C and a decrease for arenicin-2. These are almost consistent with the cases in the 60 mono-residue peptides that the targets of bovine antimicrobial peptide, magainin 2 and ubiquitin belong to easy targets and are detectable due to R 2 /R 1 > 1, while the target for arenicin-2 belongs to hard target where R 2 /R 1 < 1. The minimal RMSD T for cytochrome C with 2.98 Å is similar to the result with 3.2 Å achieved by Zheng et al. [14] from the initial structure with R 1 6 Å, in which they adopted a coarse-grained model and kept the secondary structure rigid.
We also compared the structures of initial, target and hybrid MD-MC simulations and their SAXS profiles for ubiquitin and cytochrome C, as illustrated in    To use experimental SAXS profiles, the discrepancy function χ is defined as log IðqÞ cal À log IðqÞ exp À D offset dI log ðqÞ ð6Þ The new terms δI log (q) and Δ offset are the experimental errors and the offset between logI cal (q) and logI exp (q) at q = 0, respectively. In calculated SAXS profiles using the Fast-SAXS-pro, the contributions from hydration layer are considered by water molecules within a shell along protein surface with a thickness of 6 Å (about the sum of 3 Å thick of first hydration layer and 2.8 Å diameter of water) from all non-hydrogen atoms in the protein. The weighting factor w of 4% is used to account the contribution of the hydration layer. The simulation results are shown in Table 4. For ubiquitin and cytochrome C, the SAXS intensity profiles either from experiments or from direct calculation without the consideration of experimental errors and hydration layer are not distinguishable. It suggests that the discrepancy in SAXS intensity profiles is majorly contributed from the change in protein structures. At last, it is necessary to note that the selection of dataset in this work is to ensure the comprehensive in the types of residues and secondary structures, rather than a set of representative sequences and structures for real proteins. Therefore, this work only provides a theoretically sound evaluation on the performance for the integration SAXS information through the hybrid MD-MC simulations. A stringent test based on carefully selected dataset using similar protocols requires much more efforts and is still undergoing. Further, the merits of solution SAXS technique to characterize the structure of proteins in simulated fluids make it continuously receive more and more attentions. Advancement in the protocol to implement SAXS information in simulations and the rigorous evaluation of performance is still in demand.

Conclusions
In this work, we developed a hybrid MD-MC method that utilizes the low-resolution structural information contained in SAXS data for sampling enrichment. The MD-MC simulations, on average, could bring the initial structure closer to the target state than the unbiased MD simulations. A hypothetical curve of sampling efficiency (SE) against sampling range (R 2 /R 1 ) is proposed. Simulations of 600 trajectories showed a qualitative agreement between the actual and hypothetical SE against R 2 /R 1 . These results indicated that the chances of peptide structure refinement are not just related to similarity between the initial and the target structures, but also dominated by the sampling range in simulations. We found that the MD-MC method is most effective for easy targets with R 1 < 7 Å and when the target could be detected in the simulation trajectories. The improvement can have over 79% probability to reduce the RMSD to target structures and reach more than 200% in the enrich of SE. Higher simulation temperature can strengthen the superior of the MD-MC hybrid simulation comparing to the parallel MD simulations. Overall, this work presents a way of utilizing experimentally accessible information on target structures to improve protein structure refinement and function annotation.  Table. Spearman correlation coefficients between the discrepancy functions (χ) and RMSD. The correlations are estimated based on 22,5000 decoys generated in the forward MD simulations. The discrepancy function with ten different forms are for q max = 0.3 Å -1 . (DOC) S3 Table. Sampling efficiency as a function of R 2 /R 1 . The number of trajectories, the mean of R 2 , R 1 and SE are calculated based on 600 trajectories at 370K. Here, R 2 is the sampling range in simulations, R 1 is RMSD between the initial structure and the target structure, SE is the sampling efficiency of a simulation trajectory. (DOC)