Fig 1.
A typical control flow of the 3D EM map fitting algorithm developed in this work. The first step of a fitting procedure is the inital exhaustive search. Here one needs to define suitable scoring functions that are amenable for fast correlation computation via the chosen search scheme. Here, we are using FFT-based algorithms for the fast computation of non-uniform rigid-body correlations. The scoring functions may account for various structural aspects such as scattering potential or pockets in the molecular surface. The exhaustive search is followed by an information driven reranking scheme which among others might include the mutual information score or skeleton-secondary structure score. The final output of the procedure will be the best fit between the atomic structure and the 3D EM map.
Fig 2.
Schematic of representations used in our algorithms.
(A) PDB schematic, showing the target volume V๐ and the complementary volume . (B) 3D EM map schematic, showing the target volume V๐ and the complementary volume
. Detailed definitions can be found in the Materials and Methods section.
Fig 3.
Comparison of PF2 fit with other software in synthesized EM fitting at 3ร .
A molecule is fitted into the synthetically generated EM map B with resolution 3ร (transparent green). The top-ranked result ๐1 (red/yellow) is compared to the original PDB molecule ๐ (blue). (A) Top-ranked result using PF2 fit โSE(3) with 8ยฐ uniform rotational sampling and 0.5ร translational step size. RMSD โ 0.88ร . (B) Top-ranked result using the Colores package; the โnopowellโ option is turned on. RMSD โ 3.2ร . (C) Top-ranked result using Colores with default options. RMSD โ 2.3ร . The fitted PDB ๐1 is in yellow. (D) Top-ranked result using the ADP_EM package, with bandwidth L = 25. RMSD โ 0.94ร .
Fig 4.
Comparison of PF2 fit with other software in synthesized EM fitting at 10ร .
(A) The synthetically generated 3D EM map is a Gaussian blurred version of the PDB 7CAT (chains A and B), with resolution R = 10ร , and random noise added to obtain a signal-to-noise ratio of unity. The PDB ๐ (inset) is chain B of the same protein. The top-ranked result ๐1 (red/yellow) is compared to the original PDB molecule ๐ (blue). (B) Top-ranked result using PF2 fit โSE(3) with 8ยฐ uniform rotational sampling and 0.5ร translational step size has RMSD = 0.73ร . (C) Top-ranked result using Colores with default options has RMSD = 1.096ร . (D) Top-ranked result using the ADP_EM package, with bandwidth L = 25 has RMSD = 0.814ร .
Table 1.
Average rank, rounded to the nearest integer, of best RMSD result returned by PF2 fit โSE(3) in the initial search stage for synthetic maps at different resolutions.
The figure in brackets in the second and third columns denotes the rank in the presence of noise at SNR = 1. See the section on โDatasetsโ for a list of PDBs used in this experiment. Note that even if the rank of the best RMSD is lower for SCCS in some cases, the actual RMSDs are generally lower, cf. Figs 5 and 6.
Fig 5.
Resolution robustness and comparison of scattering potential (SCCS) and Gaussian (GCCS) scores for synthesized data.
We plot the RMSD of the top-ranked result as a function of the resolution of the EM map used for the fit. See the section titled โDatasetโ for a list of PDBs used in this experiment. (A) Average resolution-dependent RMSD of the top-ranked result returned by PF2 fit โSE(3) in the absence and presence of noise for the GCCS and the SCCS. (B) Average Z-Score for the ten top results in the absence of noise. Z-Scores in the presence of noise follow the same trend.
Fig 6.
Effect of complementary space scoring for synthesized data.
Using the complementary space scores from (A) Eq (8) and (B) Eq (9), with wcomp = 1, wtarget = 1 we plot the RMSD as a function of the resolution of the EM map. See the section titled โDatasetโ for a list of PDBs used in this experiment.
Table 2.
Results of applying PF2 fit โSE(3) on a selection of datasets from the cryoEM modeling challenge (Experiment 3) using both the GCCS and SCCS.
An error measure similar to ETR is provided as the number of residues excluded outside a given iso-contoured molecular surface. The SCCS yielded on average results that exclude 2โ4 fewer residues than the GCCS.
Fig 7.
Speed-accuracy trade-offs in PF2 fit.
(A) The plot displays the average runtime (divided by 2000) using the GCCS scoring term, and the corresponding error (in RMSD) when PF2 fit is applied on the synthesized EM dataset. Notice that the runtime increases linearly with the number of samples in SO(3), but the average error is quite steady between 0.4 to 0.5ร except for the case when only 2000 samples were used. We believe that such robustness stems from the low discrepancy of the sampling. (B) We compared the average speeds of PF2 fit on the synthesized EM dataset with GCCS, SCCS and NCCS using the same expansion degree (L = 20). The plot shows that NCCS is faster than GCCS, specially when fewer samples are used. On the other hand SCCS is marginally slower (around 0.1%) than GCCS.
Table 3.
Average rank of best RMSD result returned by PF2 fitโSE(3) after reranking.
In the initial stage GCCS was used. The figures in brackets denote the rank in the presence of noise at SNR = 1. We see a strong decrease in rank for the skeleton-secondary structure score with and without noise while the mutual information score remains predictable across the range of resolutions. See the section on โDatasetsโ for a list of PDBs used to generate the synthetic maps used in this experiment. Note that even if the ranks of the best RMSD solution, on average across all experments, show no improvement over Table 1 (mostly because GCCS already does an excellent job of ranking them)- the ranks actually improved for several of the experiments (73/318 for MIS, and 5/318 for MIS). Please see Section โThe performance of reranking increases with resolutionโ for details.
Fig 8.
Comparison of PF2 fit with other software in subunit-assembly fitting.
Fitting the PDB molecule ๐ (1GC1) to the EM map ๐ of SIV 20ร (EMD5020), using the GCCS. Two different views of the molecules are given: (A) Results from PF2 fit. The ETR is 0.03. (B) Results from colores with default options. The ETR is 0.1. (C) Results from ADP_EM with L = 25. The ETR is 0.08.
Fig 9.
Fitting the PDB molecule ๐ (1AONb) to the GroEL 3D EM map ๐ (EMD 1461) at 7.7ร .
(A) Full 3D EM map ๐ with segmented subunit ๐s (inset, top). The molecule ๐ is fitted into ๐s using PF2 fit (inset, bottom). (B) Initial guess for rigid-body fit into ๐. (C) PF2 fit generates translational samples local to the initial guess to find the depicted correct result. Correctness is measured by deviation from the rigid-body fit in (A). The result has an RMSD of 0.3ร from the fitting result in (A) and is ranked at number four in a run of PF2 fit with angular resolution of 10ยฐ.
Fig 10.
Speed-accuracy trade-offs for NCCS.
NCCS is computed on a non-uniform grid based on the atom positions. If the grid is sparse, then it is expected that a lower degree expansion of the spherical basis functions would sufficiently represent it. We applied NCCS with the expansion degree (L) varied between 5 to 20, on the synthesized EM dataset (blurring to 12A resolution) and using 30k samples in SO(3) space. The plot shows that the error decreases and runtime increases with L. However, the change is runtime is more pronounced than the change in error, for example, the runtime is 35% faster for L = 5 while the error is only 5% more than that of L = 20.