Preparing for the 3D data deluge: real time structural search at the scale of the PDB and beyond

Detection of protein structure similarity is a central challenge in structural bioinformatics. Comparisons are usually performed at the polypeptide chain level, however the functional form of a protein within the cell is often an oligomer. This fact, together with recent growth of oligomeric structures in the Protein Data Bank (PDB), demands more efficient approaches to oligomeric assembly alignment/retrieval. Traditional methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangements. These challenges can be overcome by comparing electron density maps directly. But, brute force alignment of electron density distributions is a compute intensive search problem. We developed a 3D Zernike moment normalization procedure to orient electron density maps and assess similarity with unprecedented speed. Similarity searching with this approach enables real-time retrieval of proteins/protein assemblies resembling a target, from PDB or user input, together with resulting alignments (http://shape.rcsb.org).


Introduction
Structure similarity searching within the growing PDB archive [1,2] revolutionized our understanding of protein evolution [3,4,5]. Over billions of years organisms in the natural world have generated stable, functionally useful three-dimensional protein shapes, which have been repeatedly reused on scales ranging from short structural motifs to oligomeric complexes. With more than 150,000 publicly-available PDB structures, efficient methods for detecting and quantifying protein structure similarity are essential.
Structure superposition tools were initially developed in the 1970s [6,7] and the first algorithms for general structural alignment came in the 1990s [8,9,10] , with more advanced methods appearing over the following decade [11,12,13,14]. As the PDB grew, efficient searching of the entire archive became both important and difficult. Archive wide retrieval was first addressed by the Dali server [15] and subsequently by PDBeFold [16] and TopSearch [17] (see Hasegawa and Holm [18] for a review of the field). With a few exceptions, these methods have focused on the task of aligning single polypeptide chains or parts thereof.
Protein functional units are, however, not necessarily confined to the boundaries of domains or individual chains. They are often oligomeric, sometimes with multiple distinct quaternary structures resulting in similar functional units. Today, approximately half of the structures in the PDB are oligomeric (data not shown). In the wake of the 3DEM "resolution revolution" the fraction of oligomeric structures represented in the archive is increasing year-on-year (data not shown).
The ever-increasing amount of structural data, combined with increasing complexity of the structures, requires development of faster, more accurate methods to process and classify structure similarity. Traditional comparison methods use atom level information, which can be complicated by the presence of topological permutations within a polypeptide chain and/or subunit rearrangement(s) within an oligomeric assembly. While solutions that address these problems exist [14,19,20,21], they are computationally expensive and will not necessarily scale with continued growth of the PDB.
Alternative approaches looking beyond purely atomic information have been explored. One utilizes geometric descriptors, e.g., interatomic distance distributions, yielding fast but less precise methods [22,23]. Other related methods compare surface shapes. This research community has coalesced around the SHREC 3D shape retrieval contest [24], which occasionally features a protein track (most recently in [25]). For surface descriptorbased protein structure analysis, 3D-Surfer [26] has implemented a fast shape comparison service with numerous applications [27]. Surface descriptions of proteins, however, completely disregard information contained in the density distribution beneath the surface. These challenges can be circumvented by aligning and comparing electron density maps. But, alignment of electron density distributions has proven to be a computationallyintensive search problem [28,29,30].
Herein, we exploit a 3D Zernike moment normalization procedure to implicitly orient electron density maps and assess similarity in moment space with unprecedented speed. The general approach was suggested in [31] but has not been applied to date. Our normalization procedure produces rotation-invariant features that retain information about the shape of the original object. Differences in features can be readily visualized. This approach to shape retrieval is highly performant, yielding structure alignments as byproducts of the normalization procedure. Since the method uses electron density volumes, it is agnostic with respect to topological differences in either the tertiary or quaternary structures.
Based on these principles, we have developed a search system that enables real-time retrieval of similar protein assemblies to a target assembly, obtained from the PDB or uploaded by a user, together with their alignment (http://shape.rcsb.org). An exhaustive search of 600,000+ bioassemblies and chains in the PDB requires less than one second on a single core of a typical CPU (e.g., Intel Core i7-7567U), without precomputed clustering or results caching. The BioZernike software library used for normalizations/alignments is open source and freely available (https://github.com/biocryst/ biozernike)

Canterakis Norms for Complete 3D Zernike Moment Invariants
As in [32], we use ζ-coding for 3D Zernike moment rotation. Briefly, given Cayley-Klein parameters a and b which define a rotation R(a, b): where a, b ∈ C and aa * + bb * = 1, and the modified 3D Zernike momentsΩ where the rotation can be expressed as: To obtain the rotated value for a particular Ω R m nl , we expand expression 5 and collect coefficients for ζ m . For example, Now we illustrate how to fix two degrees of rotational freedom by setting Ω R 2 22 = 0.
After substitution a b * → t we obtain a 4 th degree polynomial equation in t: that can be solved trivially using libraries of mathematical routines. (N.B. Both the coefficients of the polynomial and its roots are complex numbers.) Next, we fix one more degree of freedom by setting the imaginary part of another moment to 0. Let us choose which correspond to eight alternative rotations defined by matrix 1. (N.B. The form of equations 7 and 8 depends only on the indices l and m, and therefore can be derived analytically in advance for efficient computation at run-time.) Finally, the rotated 3D Zernike moments can be obtained from:

BioZernike descriptors
For the 3D Zernike moments calculation, the structure is upscaled if its average dimension is less than 50Å and downscaled if it is greater than 200Å, using variable volume grid width. Subsequently, for every representative atom a Gaussian density is placed into the volume that corresponds to the amino acid/nucleotide weight and spherically averaged size. Representative atoms are defined as Cα for amino acids and backbone phosphate groups for nucleotides. Zernike moments are calculated up to order 20. Canterakis = 0 if n is odd. As the absolute values of the alternative solutions are averaged in each case, the third degree of freedom is lost and choice of a particular {Ω m nl } = 0 has no effect. Every such CN yields a vector of size 946, as opposed to 121 parameters obtained for a 3DZD of order 20. The final CN-based descriptor is a concatenation of the CNs of chosen orders and has 3784 components.
For the vector of geometric features GEO, we calculate the distance distribution from the center of mass of the structure to all its representative atoms. Next, we include in the vector moments of this distribution: standard deviation, skewness, kurtosis, as well as 10 th , 20 th , ...90 th percentiles. In addition, we include the structure radius of gyration, nominal molecular weight, and standard deviation of the coordinates along the principal axes, corresponding to the dimensions of the structure. The final GEO descriptor has 17 components.

Distance function
For the distance function training, we prepared a CATH-based dataset as follows: A non-redundant subset of CATH domains with up to 40% sequence identity was obtained from www.cathdb.info. Homologous super-families with fewer than 20 members were removed. Structures within each superfamily were superposed with TM-align using an all-versus-all strategy. A representative domain was selected for each superfamily with the maximal median TM-score of alignments to other members of the superfamily. The dataset was subsequently pruned so that the domains within each superfamily align with the representative domain with TM-score at least 0.75. Finally, we selected super-families starting from the most populous ones, which have TM-score between their representative domains at most 0.45. This procedure yielded 2685 structures (divided among 151 families). All-versus-all comparisons produced a training set with 46572 'positive' (same super-family) and 3556698 'negative' (different super-families) data points.
We defined the distance function for the composite BioZernike feature vectors as: where g 1 and g 2 are the geometric feature vectors being compared, m 1 and m 2 are the CN-based feature vectors, and w g and w m are the respective weights. The weights were fitted to the training set using regularized logistic regression. 10-fold cross-validation was performed on the superfamily level. The regularization parameter that maximized the Matthews correlation coefficient of the predictions of the excluded data was used for the final training with the entire dataset. Learned weight coefficients were constrained to non-negative values, which led to sparse solutions (e.g., 1458 weights were non-zero after the final training on the CATH dataset).
Importantly, the obtained distance function is by no means definitive, but rather an illustration of a general approach. The procedure can (and should) be repeated with the problem-specific training sets, yielding appropriate functions based on the BioZernike descriptors.

Domains
The domain test set was prepared based on the independent ECOD subset using the same procedure as for the CATH-based training set. Additionally, if an F-group representative domain could be aligned to any CATH superfamily representative domain with TMscore 0.75 or more, the group was excluded from the test set. Ultimately, 761 domain structures (divided among 34 families) remained. All-versus-all comparisons resulted in 13603 'positive' and 275577 'negative' data points.

Assemblies
500 biological assemblies were randomly selected from all PDB entries such that no two assemblies have density correlation score [33] larger than 0.5, to ensure distinct shapes. Afterwards, normal mode analysis from the ProDy package [34] was used to sample 4 additional conformations of each assembly, resulting in 2500 total structures evenly split into 500 classes. As in the domain set evaluation, all-versus-all comparisons were assessed, yielding 5000 'positives' and 3118750 'negatives'.

Reference methods
3D-Surfer 3DZD descriptors were obtained directly from the web server using default parameters. During the course of this work, we discovered a bug in the original 3DZD library [35], which caused the invariants of the same order to be cumulative. For the sake of fairness, we corrected the descriptors obtained from the 3D-Surfer server for this bug and notified the server maintainers. The correction somewhat improved the discrimination power ( Figure S1). Euclidean distance was used as the scoring function.
As Omokage score is not available for use with arbitrary structures, we implemented the procedure described in [23]. First, all structures were converted to representative point-sets with Situs v3.1 [36]. Second, we calculated the iDR profiles and implemented the Omokage score according to [23]. The implementation was independently validated using 1000 random comparisons between structures selected from PDB30. Comparisons were scored with our implementation and the PDBj omokage-pairwise web service. As shown in Figure S2, scoring is consistent between the implementations and insignificant deviations are likely due to the different versions of the Situs package used.

Protein superposition and similarity assessment with complete 3D Zernike moment invariants
We achieve fast similarity computation using protein structural descriptors. Robust descriptors must capture information relevant to their intended use (e.g., binding sites for virtual drug screening, solvent-accessible surfaces for protein docking, structural organization for establishing functional or evolutionary relationships) while being inexpensive to compute, quick to compare with other descriptors, and readily interpretable. Most highthroughput structure analysis pipelines involve balancing the tradeoff between speed and accuracy of the underlying representation. 3D Zernike moments, derived by Canterakis in [31], allow decomposition of an arbitrary volumetric function f (x) into a set of parameters Ω m nl : where Z m nl (x) are 3D Zernike functions, N is the maximum order of the decomposition, n ∈ [0, N ], l ∈ [0, n] so that (n − l) is even, and m ∈ [−l, l].
These parameters are independent, insensitive to noise, and, importantly, embody a hierarchy of shape representation. The latter property is of particular significance, as it enables intuitive interpretation of the information content in the moments of certain order (Figure 1).
Limiting their use, 3D Zernike moments are not invariant under rotation. While special properties of the spherical harmonic functions can be exploited to align two sets of moments, the resulting procedure is slower than classical coordinate-based methods [30]. A popular software library [35] implemented the 'trivial' rotation invariant descriptors from 3D Zernike moments, i.e., norms of the vectors Ω nl = (Ω l nl , Ω l−1 nl , ..., Ω −l nl ). We will refer to these descriptors as 3DZDs (3-Dimensional Zernike Descriptors) for consistency with prior work [27]. While this approach is straightforward and has proven to be widely applicable [26,37], the information loss is obvious: every (2l + 1)-dimensional vector of parameters is reduced to a single invariant. These simpler 3DZD invariants are the base of 3D-surfer [26], the first widely available tool that made use of the Zernike moment decomposition for protein shape matching.
In his work Canterakis [32,31] did derive a special normalization of the 3D Zernike moments, making them rotationally invariant. However, we found that this normalization does not perform well for the shapes of proteins and macromolecular complexes due to the abundance of symmetric oligomeric arrangements (data not shown).
Here we generalize the approach of Canterakis by developing normalization routines with wider applicability. These routines yield complete, rotationally invariant 3D Zernike moments (referred to hereafter as Canterakis Norms or CNs). Conceptually, a CN rotates an object so that selected moments become equal to predefined values. As the SO(3) group has three degrees of freedom and Zernike moments are complex numbers, we set one moment and the imaginary part of another to zeros. This orients an object in a uniquely determined standard position (Figure 2a). In terms of computation, generation of a CN corresponds to solving a system of two polynomial equations, which can be done efficiently using standard libraries (see Methods).
CNs immediately give rise to a computationally inexpensive global structure alignment. Indeed, if the same moments are normalized to their standard values for two objects, their induced standard positions are likewise equivalent (Figure 2a). By normalizing moments from various orders, we obtain alternative alignments that may be more or less suitable for a particular application (Figure 2b). Moreover, since the alignment is performed not between any two objects, but from every object to its standard position, an arbitrary number of structures may be aligned in linear time (Figure 2c).
It has to be noted that the system of polynomial equations mentioned above has alternative solutions. While there is a theoretically sound approach to break this ambiguity described in [32] (fixing signs of the selected moments), in practice it is not universally applicable because we calculate an approximation of the Zernike functions on a discrete volumetric grid. Thus, if the selected moment's value is small at the outset, the gridding effect may lead to its sign changing randomly based on the initial orientation. Nevertheless, we found that for structure alignment, we can either choose the ambiguity-breaking rule at runtime, based on the observed moment magnitudes, or circumvent the problem entirely by testing all alternatives with negligible loss of performance.
Finally, but most importantly, the complete moments of objects oriented in the same standard position are comparable. This critical property underpins our novel search procedure presented below.

Search procedure and evaluation
Our search procedure is depicted schematically in Figure 3, and relies on the newly designed BioZernike descriptor, a composite descriptor based on several CNs augmented with geometric features.
The simpler 3DZD descriptors are usually calculated for an object surface, following the original implementation [35]. We reasoned that the density volume contains valuable information for protein structure similarity retrieval. Therefore, we use simulated volumes (Figure 3c) as the basis for 3D Zernike moments calculation. Thus our newly introduced system has two important differences with 3D-Surfer [26]: use of full volumetric data instead of surface only and use of the complete CN invariants instead of the simpler 3DZD invariants.
CNs of different orders may be more or less appropriate for various shapes. Moreover, resolving ambiguity for alternative solutions depends on a particular symmetry that an object may possess. To make the CN-based descriptor versatile while retaining performance, we used several CNs of orders 2 to 5 and then average the absolute values of the alternative solutions (Figure 3d).
The 3D Zernike moments are defined for objects scaled to a unit ball which loses size-related information. To compensate for this fact, we developed a geometry-based descriptor (GEO). It includes features that can be quickly obtained from the set of representative atoms, such as structure dimensions along its principal axes or statistical properties of the interatomic distance distribution (see Methods).
Together the CN-based and the GEO descriptors constitute what we term a BioZernike descriptor. In order to judge similarity of 2 structures, we developed a distance measure that compares their BioZernike descriptors. We hypothesized that the oftenused Euclidean distance is suboptimal choice for comparing 3D Zernike moments-based descriptors, because of the hierarchical structure of the representation (Figure 1). Instead, we have followed a machine learning approach to determine weights for the descriptor components using a training set. We used a non-redundant subset of CATH [4] families for this purpose (see Methods).
Retrieval of similar structures was evaluated on a non-redundant subset of ECOD [5] and on a set of biological assemblies with distinct density shapes. 3D-Surfer [26] and Omokage [23] were selected for benchmarking purposes, as both operate on similar principles and represent the current state of the art. 3D-Surfer uses 3DZD descriptors and Euclidean distance to compare them, considering only the solvent-accessible surface of a protein. Omokage scoring utilizes properties of the interatomic distance distribution function (similar to our GEO score).
As shown in Figure 4, 3DZD descriptors applied directly to the density volume retain significantly more information pertaining to similarities and differences within the protein domain families. Retrieval performance is further improved by 1) using the custom distance function obtained via the training set, 2) using CNs instead of 3DZDs, and 3) augmenting CNs with GEO scoring. (AUCs for all evaluated methods are listed in Table  1.)

BioZernike library
An important result of this study is an open-source, customizable library that implements all routines required to obtain a BioZernike descriptor starting from a protein structure https://github.com/biocryst/biozernike. It is written in Java language and can be integrated into any project written in a JVM-compatible language. For instance it can be used together with BioJava to take advantage of its comprehensive structural bioinformatics capabilities [19].
The BioZernike library includes structure-to-volume conversion based on the gmconvert program [33]. We implemented dynamic scaling of the volume grid size to make both speed and precision equally suitable for smaller proteins and larger macromolecular assemblies. 3D Zernike moments were implemented after [35], with further optimization for batch processing. In addition to the classical 3DZD invariants, the library contains routines for calculating CN-norms introduced here and their application to multiple structure alignments.
The library is developed and continuously validated for processing of large amounts of structural data, such as those at RCSB PDB, which leads to a highly optimized and efficient implementation. For example, calculating a full BioZernike descriptor for PDB ID 5J7V (the largest macromolecular assembly represented in the PDB archive at the time of writing; 8280 component homo-oligomer containing 5,340,600 amino acid residues) takes ∼10 seconds. For a more typical oligomeric PDB structure, such as PDB ID 4HHB (hemoglobin α 2 β 2 hetero-tetramer containing 574 residues) the processing time is ∼30 milliseconds. The full processing for the entire archive as of November 2019 (all assemblies and all polymeric chains) takes 7 hours using 6 parallel threads. The time needed for descriptors comparison and moment alignment is negligible. The performance of the BioZernike library is showcased on the website http://shape.rcsb.org. The library's comprehensive implementation, together with its flexibility and speed makes it especially useful for developers who wish to create novel applications for classification and comparison of structural data.

Similar quaternary structure but different subunits
A very interesting set of comparisons is that of folds that are conserved globally across evolution while the subcomponents have been re-arranged in different ways as a result of gene fusion or duplication. The different biological assemblies are thus very similar overall but have different stoichiometries.
A striking example of this property is the Macrophage migration inhibitory factor (MIF). These tautomerase enzymes are conserved across the entire tree of life ( Figure  5a). The assembly fold is composed of β − α − β motifs, with a central β barrel formed by the different subunits coming together around a 3-fold axis of symmetry. Overall the assembly has quasi D3 point group symmetry. In eukaryotes (human: 4GRO; mouse: 1MFI; nematode: 2OS5) and cyanobacteria (2XCZ Prochlorococcus marinus) the enzyme is trimeric with a pseudo 2-fold symmetry within the subunits. In some bacteria the enzyme is hexameric, with both A6 (2X4K Staphylococcus aureus) and A3B3 (3EJ9 Pseudomonas pavonaceae) systems known.
Another well-known case in this category is that of the DNA-clamps (overall quasi D6 symmetry, with A3 stoichiometry in archaea and eukaryotes and A2 in bacteria) [38], also detectable with our system. Domain swaps are a further example that belong to this category and represent a widespread phenomenon in structural biology [39]. These assemblies do not present different stoichiometries but rather conformational changes in the subunits that allow for domains to be swapped with partner subunits. Conservation of quaternary structures between swapped and non-swapped assemblies is also detectable with our BioZernike system.

Different quaternary structure but similar subunits
Another case we can easily study and find using the BioZernike system involves divergent quaternary structure assemblies composed by the identical or highly similar subunits.
For example, the TRAP (trp RNA-binding attenuation protein) proteins from Geobacillus stearothermophilus have been determined to have both A11 and A12 stoichiometries, respectively with C11 and C12 point group symmetries (Figure 5b). Since both the global shape (low order Zernike moments) and the details (high order Zernike moments) are matching well, this case does not represent a problem for our retrieval system.
Other well-known cases were revealed through a mutation or an environmental change (e.g., pH) that favored one assembly relative to the other. One case in which conformational changes in the individual subunits lead to very different quaternary structures is porphobilinogen synthase (PBGS) [40], which exists as both a hexamer (1PV8) and octamer (1E51) (Figure 5c). Here, a single point mutation F12L dramatically shifts the equilibrium from octamer to hexamer, which produces an altered pH-rate profile; the hexameric assembly is active only at basic pH. Proteins, like PBGS, that can come apart, change subunit conformation, and reassemble differently with functional consequences have been named morpheeins, and can reflect a dissociative allosteric mechanism [41].
Our search system makes possible automatic identification of such proteins. In this case, the search can be performed only on parts of an assembly, such as an individual chain or a domain. Moreover, a distance function can be designed specifically to focus on higher-order Zernike moments so that structural features of the sub-components are weighted more heavily than the overall shape of the assembly.

Discussion
Protein structure retrieval and alignment with volumetric data provides several advantages versus traditional atomic-data based systems. First, it avoids the challenge of dealing with chain topology (arrangement of secondary structure components along the polypeptide chain) important in many biological systems (e.g., in circular permutants). Second, it obviates the problem of quaternary structure topology, be they local changes as in domain swapping or global changes with subunits that merge or split while conserving the overall quaternary arrangement. A further advantage of our method is that it automatically solves the chain matching problem [21]. These advantages combined with speed, yield a system that compares favorably to quaternary structure search and alignment tools in terms of scalability [14,42,20,19]. Moreover, this method can be applied directly to the growing number of electric Coulomb potential maps obtained using 3D electron microscopy (3DEM) and available from the EMDB data resource (https://www.ebi.ac.uk/pdbe/emdb/). At the same time, volumetric data preserves information content far better than shape based on surface representation. Thus, as the benchmarks above demonstrate, there are clear advantages of our volume-based system when compared to surface based methods [26,27]. Trivial examples such as hollow viral capsid versus a full spherical protein assembly can readily exemplify the difference.
Another key to the success of our method is the fact that the newly derived Canterakis normalizations (CNs) preserve much more information than the widely used 3DZDs. This is clearly demonstrated by the alignments that are naturally obtained from CNs.
As shown in the benchmark, our system outperforms other fast descriptor-based search approaches in terms of both precision and sensitivity. At the same time, it allows real-time (milliseconds) PDB archive-wide retrieval without the need to resort to ad-hoc strategies for speeding up the calculation. Run times of currently available services (Dali, TopSearch, PDBeFold) are measured in seconds or minutes solely because of additional speed-up strategies like pre-clustering or parallelization. Our system's faster performance applies equally to user input atomic coordinates, adding only minimal overhead typically measured in milliseconds. Such a system constitutes a valuable tool for structural biologists, allowing for real-time hypotheses generation at the conclusion of a structure determination campaign.
Importantly, the speed and accuracy of this method opens up the possibility of automated structural classification at any level, an avenue that we shall explore in future work.
One interesting application is multiple structure profiles: using normalized complete moments to parameterize entire protein families. A related 'consensus shapes' notion was introduced in [30], however we feel that using descriptor profiles rather than simple averages is the key. Akin to multiple sequence profiles, it would enable more sensitive searching through profile-to-profile alignments. Moreover, shape variation can be easily studied and visualized for an ensemble of structures. Further applications include scalable multiple structure alignment, alignment of 3DEM maps, and automated structural model-to-electron density (or 3DEM) map fitting.  Figure 4.    '3DZD (our implementation)' corresponds to our implementation of the 3DZD descriptors that takes into account the whole density distribution, rather than the protein surface only.  Figure S1: (a) 3DZD library bug results in characteristic non-decreasing wave pattern (red) of the descriptor, often found in literature. The same descriptor without the bug is shown for comparison in blue. (b) Precision-recall curve using Euclidean distance on the test set improves with the corrected version of the descriptor. Figure S2: Omokage score validation. The score calculated by our implementation (axis X) is plotted versus the score obtained from the PDBj server (axis Y ) for 1000 random comparisons between structures in PDB.