Abstract
The Protein Data Bank (PDB) is one of the richest open‑source repositories in biology, housing over 242,000 macromolecular structural models alongside much of the experimental data that underpins these models. By systematically collecting, validating, and indexing these models, the PDB has accelerated structural biology discoveries, enabling researchers to compare new entries against a vast archive of solved structures and, more recently, powering protein structure prediction. Leveraging this wealth of data, structural bioinformatics has uncovered patterns, such as conserved protein folds, binding‑site features, or subtle conformational shifts among related proteins, that would be impossible to detect from any single structure. Through the democratization of structural data and open-source analytical tools, now amplified by the power of large language models, a broader community of researchers is equipped to drive new scientific discoveries using structural data. However, good structural bioinformatics requires understanding some of the nuances of the underlying experimental data, data encoding conventions, and quality control metrics that can affect a model’s precision, fit‑to‑data, and comparability. This knowledge, combined with developing good controls, statistics, and connections to other databases, is essential for drawing accurate and reliable conclusions from PDB data. Here, we outline 10 recommendations for doing structural bioinformatic analyses crafted to pave the way for others to uncover exciting discoveries.
Author summary
Here, we provide a roadmap for users to leverage the Protein Data Bank’s vast collection of protein structural models into reliable and valuable insights. It lays out 10 clear rules that help readers quality control their data, choose fair comparison sets, and judge model quality so results aren’t led astray by noise, bias, or overconfidence. The guide also shows how to connect structures to other databases. By highlighting best practices, such as utilizing re-refined models and being aware of common pitfalls, we guide users to leverage this rich data for enhanced biological insights. These guidelines will enable stronger, more reproducible structural analyses that accelerate drug discovery, illuminate disease mechanisms, and make open data broadly useful across the life sciences.
Citation: Wankowicz SA (2025) Ten rules for a structural bioinformatic analysis. PLoS Comput Biol 21(10): e1013094. https://doi.org/10.1371/journal.pcbi.1013094
Editor: Patricia M. Palagi, SIB Swiss Institute of Bioinformatics, SWITZERLAND
Received: April 30, 2025; Accepted: October 9, 2025; Published: October 31, 2025
Copyright: © 2025 Stephanie A. Wankowicz. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No data involved.
Funding: The author(s) received no specific funding for this work.
Competing interests: The author has declared that no competing interests exist.
Introduction
In 1971, with only seven structures, the first, and still active, open-access database in biology was created: the Protein Data Bank (PDB) [1]. From these modest beginnings, the PDB has grown to include over 242,000 structures [2], and its systematic archiving of macromolecular models has reshaped the field, transforming our understanding of the relationship between structure and biological function and enabling advances from elucidating enzyme catalysis to the rational design of new therapeutics. Critically, when the Research Collaboratory for Structural Bioinformatics (RCSB) PDB was established, one of their first major undertakings was a large-scale remediation of legacy data, addressing inconsistent formats, incomplete metadata, and nonstandard nomenclature to enable systematic analysis and ensure that the archive could support future large-scale structural bioinformatics. The remediation effort, later extended by the wwPDB partners (PDBe, PDBj, and Biological Magnetic Resonance Bank [BMRB]), standardized chemical components, corrected errors, and transitioned the archive to the current, more robust mmCIF format [2–4]. These efforts laid the foundation for structural bioinformatics by ensuring that PDB data were reliable, interoperable, and machine-readable.
Today, the PDB is maintained as a single, global archive through the Worldwide Protein Data Bank (wwPDB) consortium, which coordinates deposition, validation, and dissemination of macromolecular structures. The consortium comprises regional data centers, RCSB PDB in the United States, PDBe in Europe, PDBj in Japan, each providing unique portals, visualization tools, and database integrations tailored to their respective communities [5–7]. All sites share a unified deposition system, ensuring that structures are consistently validated and mirrored worldwide within 24 hours of release [8]. In addition, the Electron Microscopy Data Bank (EMDB), jointly maintained with the PDB, serves as the central repository for cryo-electron microscopy (cryo-EM) electron potential maps, enabling joint deposition of maps and models. Together, these interconnected entities provide a comprehensive and interoperable ecosystem that continues to accelerate discovery across the structural biology community.
The original vision of the PDB centered on the deposition of individual structures, where each entry told a story about a protein’s fold, function, or interaction. This one-structure, one-story mode of structural biology has yielded an enormous wealth of knowledge and fundamentally shaped our understanding of biology. But one of the greatest strengths of the PDB lies in its ability to uncover new biological insights beyond the scope of a single structure. This has spurred the field of structural bioinformatics, which, through examining patterns over tens to thousands of structures, has identified relationships of protein families and folds, the role of evolution driving protein structure and function, and information on macromolecular interactions and catalysis [9–16]. Structural bioinformatics also created the critical foundation for the protein structure prediction breakthrough [17–21]. By analyzing thousands of structures at once, you gain the statistical power to detect subtle variations that, when aggregated, have the potential to reveal robust patterns in allostery, ligand binding, macromolecular assembly, and catalysis [9,12]. These analyses can be incredibly powerful alone or in conjunction with other bioinformatic databases, prospective experiments, or theoretical models.
At the core of every PDB structural model lies the atom table, which records the atomic coordinates along with key attributes such as atom type, residue identity, B-factor (atomic displacement parameter), and occupancy. The adoption of the mmCIF format has provided a far richer and more extensible representation than the legacy PDB format. Unlike the fixed-column limitations of PDB files, mmCIF can accommodate the growth of structural biology, including new ligands with five-character identifiers and very large macromolecular assemblies that exceed the capacity of the original format [22]. Another key advantage of mmCIF is its ability to map these deposited coordinates to canonical protein sequences, enabling seamless integration with UniProt and related databases [23]. mmCIF also underpins emerging resources such as PDB-IHM [24], which supports the deposition of integrative and ensemble models, and it provides the extensibility needed for new schemas, including recent work on hierarchical representations of conformational and chemical heterogeneity [25].
The democratization of coding skills, facilitated by large language models, has enabled more users to delve into structural bioinformatics, build hypotheses, support experimental findings, or make independent discoveries. However, structural data is nuanced and can be challenging to work with; without proper quality control or control analyses, it can lead to inaccurate conclusions. Users should understand limitations, potential pitfalls, and caveats to use the data to its full potential.
The guidelines outlined here stem from the lessons and challenges I and others have encountered while performing structural bioinformatics projects, with many of the lessons being applicable for machine learning applications as well. Although predicted structures are increasingly valuable in bioinformatics research [26–28], this article emphasizes experimentally derived structures. While each rule is not universally applicable, consider each recommendation carefully and evaluate how it may relate or be tweaked to fit their problem.
So, first things first, what question do you want to ask, or hypothesis do you want to test? For example, are you looking at the overall protein fold, or does your question require you to know the rotamer angles and only look at wild-type structures of a specific protein? Knowing the answers to these questions is critical for determining your selection criteria, statistical power, analysis, and controls, as outlined below.
Recommendation #1: Define your biological selection criteria
When starting a structural bioinformatics project, the first step is to define the biological criteria for your study. Consider the structures you need to answer your research question, whether it involves all lysozymes, a specific tyrosine kinase, or all enzymes. Additionally, you may want to further refine your dataset based on ligands. Small molecules such as glycerol or DMSO are often crystallographic additives, while other molecules may be native or synthetic ligands, leading to differences in how you want to classify each structure. Further, assess whether your protein is part of large complexes by examining the identities of other chains. Identical chains typically reflect symmetry-related protomers, whereas distinct macromolecules reveal a multi-protein complex (see more details in Recommendation #4).
Beyond structural selection, sequence-level considerations remove redundancy and drive clustering and alignment analyses. A significant fraction of PDB entries corresponds to homologous proteins or multiple structures of the same protein. Depending on your question, you may want to filter based on sequence or structure. You can cluster by sequence, using MMseq or CDHit [29,30], or based on structure, using TM-score and CATH [31,32], selecting representatives from each cluster for your downstream analysis based on resolution and R-factors or other metrics (see more in Recommendation #2). Tools like the PISCES server automate this by removing sequences above a chosen identity threshold and keeping the highest-quality structure from each group [33].
The RSCB PDB offers precomputed sequence clusters at certain thresholds (100–30%), based on the MMseqs2 algorithm [30], which applies variable identity thresholds based on modeled residues. Note that this does not include unmodeled residues, often in terminal or loop regions, which can impact this analysis. For sequence alignments, the PDBe supports multiple sequence alignment (MSA) using Clustal Omega [34] and allows retrieval of FASTA sequences for custom alignment. By leveraging the SIFTS database [35], you can map PDB entries onto CATH or SCOP structural hierarchies, UniProt sequence records, and select structures by fold, superfamily, or sequence-based functional annotation (see more in Recommendation #9) [23,31,36]. Additionally, you can use structural alignments, such as those performed with FATCAT, TM-align, CE, or Smith-Waterman 3D alignment, to provide insights into sequence and structural relationships [37–41]. These can be powerful in identifying similar shapes of proteins, yet with sequence differences. Many of these selections can be made using one of the three PDB APIs [5,7,42].
Recommendation #2: Determine how you will quality control your data
Beyond determining the biological and sequence selection criteria, it is crucial to consider the experimental data underlying structures to ensure a quality dataset. This begins with identifying the methods of structure determination, such as X-ray crystallography, cryo-EM, nuclear magnetic resonance (NMR), or neutron diffraction. Additional factors include resolution (not applicable for NMR), agreement with structure determination data, and stereochemical accuracy.
Resolution, the most common criterion for structural bioinformatics analysis, sets the theoretical limit on the precision of the structural model and is reported for all structures. High resolution, better than 2.5 Å, is essential for accurate side chain positioning, whereas lower resolution models can still yield valuable insights into overall fold and backbone conformation. In cryo-EM, however, resolution is estimated differently than in crystallography: it is typically calculated using the Fourier Shell Correlation (FSC) between two independently reconstructed half-maps [43]. The FSC curve reflects the degree of agreement between the two maps as a function of spatial frequency, and the resolution is conventionally reported at the point where the correlation falls below a given threshold (commonly 0.143). However, the FSC is not a direct measure of atomic detail in the same way that crystallographic resolution is, but rather a measure of the global similarity between two noisy reconstructions. Complicating matters further, cryo-EM maps often exhibit substantial variation in local resolution across the structure, meaning a single global resolution metric may not faithfully capture the interpretability of all regions of the map [44]. It is also important to note that in both X-ray and cryo-EM, while two structures can have the same resolution, they can be modeled to different levels of accuracy, necessitating the exploration of other metrics.
The PDB publishes global validation metrics, including knowledge-based assessments of atomic models, evaluations of the underlying experimental data, and measures of agreement between model and data, and provides detailed validation reports for each entry, following standards established by the X-ray, NMR, and cryo-EM validation task forces [45–49]. Even at high resolution, nearly all structures have a few local errors, but at lower resolutions, errors become more widespread. As structural models can vary widely in quality, these validation metrics are important to consider maintaining scientific reliability and to minimize the risk of errors propagating into biological interpretation, drug design, or computational modeling.
Geometric metrics, such as Ramachandran outliers, are derived from tools such as MolProbity and PROCHECK [50–52]. In X-ray crystallography, R-values quantify the agreement between model and data, with higher values reflecting poorer fits; a value near or above 0.3 is commonly used as a threshold for poor quality [53]. Further, depending on your question, you also may want to examine geometry or fit to real space data of individual residues using tools such as Molprobity, Ringer, or real space correlation coefficients [52,54–57].
In cryo-EM, there is an expanding set of approaches being introduced to evaluate the quality of deposited maps and models [58]. As discussed above, global measures such as FSC between half-maps remain the gold standard for estimating overall map resolution, while model–map FSC curves assess the consistency between the atomic model and experimental maps [47,59]. Increasingly, local validation has become critical. EMRinger evaluates the accuracy of side-chain placement by comparing electron potential peaks with expected rotamer positions, and Q-scores quantify how well the electron potential supports individual atoms and residues [60,61].
You should also be aware of unmodeled regions, often loops or termini. These can be identified by manually comparing the FASTA sequence, representing the input construct, against the sequence in the PDB structural model or with tools like Seqatoms [62]. One may decide to exclude proteins with large missing segments or model in missing loops (see Recommendation #3). The PDB-REDO team has developed an algorithm (Loopwhole) to help fill in many of these missing loops, but it is most effective when high-quality homologous structures are available and when the experimental electron density supports accurate grafting and refinement [63]. Models that include filled loops will be classified as “rebuilt” in the PDB-REDO database. If you include structures with unresolved regions, acknowledge this limitation and adjust your analysis accordingly (see more in Recommendation #8).
Beyond proteins, structures often include small molecules, nucleic acids, carbohydrates, or other molecules of varying quality [64]. Small molecule ligand quality is assessed by agreement with experimental data and geometric accuracy [65], with the latter being evaluated against Cambridge Structural Database reference structures [66]. Metals are also checked by CheckMyMetal, which evaluates metal coordination geometry, bond valency, and potential steric clashes [67]. For nucleic acids, PDB-REDO has introduced validation routines to assess the normality of Watson–Crick base-pair geometry, while DNATCO provides complementary validation of DNA and RNA backbone conformations [68,69].
Ultimately, determining the appropriate experimental selection criteria depends on your research question. For instance, if your research focuses on side chain positioning, higher resolution, lower R-values, and precise stereochemical validation are critical. Alternatively, if you are looking for information on the overall protein fold, a broader selection of structures may be acceptable.
Recommendation #3: Re-processing structural model data
Most structural bioinformatic approaches take information directly from coordinate (PDBx/mmCIF) files. By taking information directly from the coordinate files, you are taking on any errors or biases the original modelers had. Where possible, it is recommended to use X-ray structural models from PDB-REDO [70–72], which reanalyzed the majority of structures in the PDB with experimental data (structure factors), providing uniform automated re-refinement, combined with structure validation and difference-density peak analysis. Since the deposition of reflection data was only encouraged beginning in 1998 and became mandatory in 2008, older structures, whose experimental data were less frequently archived in the PDB, are underrepresented in PDB-REDO [73]. While many models without experimental data can be informative, they come with caveats due to different and older data processing pipelines.
It is also possible to re-process all structures yourself [74,75]. If you are new to refinement, there are many tutorials to get you started [76,77]. Re-processing data can ensure that experimental data is processed in the same way, or allow the application of specific modeling modality or tooling within a refinement program, such as multiconformer modeling, ensemble refinement, 3D variation analysis, or quantum refinement [78–80]. To be able to reprocess your data, you need experimental data to be available, such as MTZ files for X-ray crystallography, maps, half maps, or particle stacks for cryo-EM from the EMDB [81,82], or raw NMR data from the BMRB [81]. After re-processing, similar quality control metrics, as described in Recommendation #2, should be used to evaluate structures.
Recommendation #4: The PDB and structural models are weird and biased
The PDB is not a uniform sample of all proteins. Because high-resolution crystallography, which comprises the majority of the PDB, favors small, globular, soluble proteins, membrane and flexible or disordered proteins account for roughly 20%–30% of genes but make up less than 2% of PDB entries [83]. Moreover, publication bias further distorts the distribution of structural models with drug targets, enzymes, and other high-value human proteins accounting for a disproportionate share of PDB entries. This skewing of many structures of the same protein is becoming even more pronounced with an increase in fragment‐screening campaigns [84]. As a result, certain protein families dominate the PDB, artificially amplifying their characteristic features in any global analysis. You must consider these redundancies in your analysis, as discussed in Recommendation #2.
In addition to redundancy, it is also important to understand what structural unit is represented in a PDB file. The database distinguishes between the asymmetric unit, the crystallographic unit directly observed in the experiment, and the biological assembly, which represents the functional quaternary structure in vivo. The PDB provides separate mmCIF files for biological assemblies, which are either specified by the authors or inferred computationally by tools such as PISA [85]. For most biological analyses, the biological assembly is the appropriate choice, though it should be noted that approximately 20% of these assemblies may be incorrect, with ProtCID and ProtCAD databases being valuable for sorting true assemblies from crystallographic artifacts [86,87].
Beyond the bias of what structural models exist in the PDB, structural models can be odd and biased. First, it is important to remember that PDB models are just models. They do not explain all the underlying experimental data and can vary depending on the processing pipeline (see Recommendation #3). For example, in X-ray crystallography, crystal contacts, nonbiological interactions between symmetry-related molecules within the crystal lattice, can artificially stabilize particular conformations or create interfaces that don’t exist in solution, potentially skewing structural bioinformatics analyses of protein dynamics, flexibility, and genuine interaction sites. We also previously showed that binding site residues are often better modeled than residues outside the binding site [88]. Further, regions of unmodeled residues can arise for many reasons, including resolution and subjective modeling, but automated refinement pipelines cannot correct all of them. All of these issues can lead to structures having different biases. In addition, structures often include unmodeled blobs, frequently ligands.
Finally, all structural data contains extensive conformational and compositional heterogeneity modeled with varying accuracy and encoding [25]. These include anisotropic B-factors, alternative atom locations (altlocs), or multiple models. Anisotropic B-factors describe the direction and magnitude of atomic displacement, while alternative atom locations (altlocs) represent multiple conformations modeled for a single atom [78]. Multiple models, often used in ensemble structures, provide different plausible conformations that together capture the underlying structural variability [89]. While there are ways to encode some of these metrics more uniformly, some encodings cannot be interchanged. Additionally, most bioinformatics libraries, including Biopython, strip out much of this encoding, potentially introducing biases into downstream analyses [90]. To guard against these biases, it is essential to document any data exclusions or alterations made to the data, ensuring accurate comparisons downstream.
Recommendation #5: Consider your analysis’s sample size, statistics, overfitting, and uncertainty
After dataset selection and quality control comes the fun part, looking at and identifying what drives differences between structures. Descriptive bioinformatic analyses, such as cataloguing residue types and counts within binding pockets, are straightforward, but any comparative study requires careful attention to sample size and statistical power. Smaller groups demand larger effect sizes to achieve significance, and paired comparisons should employ paired statistical tests to account for within‐pair correlations. Equally important is judging whether observed differences, such as shifts in binding site residue rotamers or altered pocket volumes, are biologically meaningful [91].
When comparing two unpaired groups, choose parametric or nonparametric tests based on data distribution. Parametric tests assume normality, while nonparametric tests are more flexible when distributions are skewed (e.g., residue B-factor values or pocket volumes). For paired data, for example, wild-type vs. mutant, or bound vs. unbound structures, use paired t-tests or Wilcoxon signed-rank tests. Further, be wary of multiple hypothesis testing. Consider adjusting p-values using Bonferroni or false discovery rate corrections. You can also use resampling methods such as jackknife, bootstrap, or cross-validation to help estimate variability and confidence intervals. Applying well‐chosen controls helps guard against false positives and ensures that your findings reflect genuine structural phenomena rather than quirks of a particular dataset.
Avoiding overfitting is equally critical, whether you are working in bioinformatics or machine learning. Where possible, never develop and validate hypotheses on the same data without independent testing. Splitting your dataset into train, test, and validation sets, or employing k-fold cross-validation, is even recommended when defining new structural descriptors or clustering algorithms. Further, consider how you partition your test set, whether by sequence similarity, structural features, or other criteria, to avoid overfitting or memorization [92].
Recommendation #6: Determine and apply the correct controls
Choosing the proper controls is one of a bioinformatic study’s most challenging and often overlooked aspects. Fortunately, the abundance of publicly available structural data makes incorporating negative and positive controls feasible. Controls must directly address the null hypothesis you wish to reject. Negative control datasets, where no effect is expected, are usually easier to define, while positive control datasets, datasets known to exhibit the effect, can be harder to assemble. For example, if you’re testing whether a novel structural motif alters protein function, you might compare your proteins of interest against a set of homologous structures that lack the motif. Differences that persist between the groups are more likely to stem from the motif than background variation. You can also randomize specific features, such as residue type or solvent exposure, to break genuine signals, or selectively choose structures that should not display the phenomenon under study [12]. This strategy ensures that any detected signal isn’t merely an artifact of the overall distribution of structural features.
For example, consider a case where you argue that hydrophobic residues in binding sites are inherently less dynamic. Alternative explanations might include differences in solvent exposure, secondary‐structure context, or biases introduced by your dataset (for instance, selecting only certain CATH classes or ligand types). A robust negative control would examine hydrophobic residues outside binding pockets matched for solvent accessibility and local secondary structure. While it may be impossible to control every variable perfectly, assessing your metric across complementary subsets is critical for demonstrating that your findings reflect genuine biological effects rather than quirks of data selection.
Recommendation #7: Understand how metrics are compared across your structures
Without careful evaluation, comparison metrics can lead to incorrect conclusions. For instance, larger proteins naturally exhibit higher overall root mean squared distance (RMSD) values, a common metric for comparing the two structures’ similarities. Normalizing RMSD by sequence length or reporting RMSD per residue can correct this. Many structure alignment tools, including DALI and TM-align, provide Z-scores indicating the likelihood that an observed similarity would occur by chance [32,93]. Alignment and comparison in torsion space also provide a powerful way to distinguish functionally relevant conformational states. Torsion-angle-based approaches preserve subtle, biologically meaningful differences that are often obscured in atomic coordinate space [94,95].
B-factors, also called temperature factors, atomic displacement parameters, or Debye–Waller factors, estimate each atom’s displacement parameter, combining thermal motion of the atom with static disorder from the crystal lattice [96]. Because they arise from the refinement process, B-factors are influenced by data resolution, model bias, occupancy, and lattice packing. As a result, high B-factors do not necessarily guarantee high flexibility in solution. To use them reliably, it’s best to normalize B-factors, for example, by Z-scoring within a structure, comparing structures with very similar crystallographic parameters [97,98], and, when possible, corroborating with another metric of flexibility. There are a plethora of other comparison metrics that can be used to compare groups or pairs of PDBs [93,99–101]. Understanding how these metrics are derived and how best to apply them to your analysis is essential to ensuring you use them properly and avoid introducing bias.
Recommendation #8: Appropriately connect and compare structures
When comparing two groups of structures, it is crucial to balance confounding variables to ensure that biological differences, rather than methodological or crystallographic artifacts, drive the observed differences. Differences in resolution, space group, unit‐cell parameters, data processing, and data collection parameters can lead to incorrect conclusions. Depending on the question, this can also include differences in local metrics such as MolProbity or validation scores [102]. Even reprocessing identical raw data with identical refinement settings can yield subtly different models due to stochasticity built into those processes to help with the complex refinement optimization process [74,75]. To minimize such artifacts, applying consistent processing pipelines (such as PDB-REDO) and, where possible, matching crystallographic parameters is important.
These controls become even more critical when looking at pairs of structures, such as ligand-bound versus apo or mutant versus wild type. In these analyses, you often look for subtle conformational changes you want to ensure are not driven by nonbiological artifacts. We recommend pairing structures based on biological differences and ensuring that they have similar crystallographic properties. Some general guidelines include using datasets with resolutions within 0.3 Å, identical space groups, and unit cell dimensions that differ by no more than 10%. While these criteria are not always achievable, deviations can introduce artifacts: differences in crystal contacts or solvent volume may affect the conclusions you can draw.
In some cases, it is valuable to collect structures with diverse crystallographic properties from the same or closely related proteins. Such comparisons can provide insight into conformational heterogeneity and, in particular, are useful for studying loop conformations that crystal contacts may influence. By grouping structures into distinct crystal forms, one can analyze loop conformations across different crystallographic contexts and disentangle genuine biological flexibility from artifacts introduced during crystallization [103,104].
Additionally, you must determine how you will compare structures across groups for all comparisons. For most comparisons, you will need to align structures, often based on the alpha carbon; however, other options include aligning the entire structure or taking sequence into account. Global metrics, such as RMSD, allow you to ignore sequence or small length differences, but if you want to compare specific sections of the protein or amino acids, this will take more care and thought. For example, you may want to compare how a specific loop compares among homologs. This will require aligning structures around that loop or to all residues besides the loop, and also ensuring that crystal contacts are not driving these conclusions.
Comparing structures of the same protein, you can compare using chain and residue IDs, but a standard numbering scheme is required. This can be done by manually renumbering chains and residues or employing algorithms such as PDBRenum to map PDB residue numbers onto UniProt numbering, which also allows for integration with other databases (see Recommendation #9) [105]. If PDBs are similar, you can also align them based on a MSA. One thing to note is that while the MSA will enable renumbering, a single residue number may still correspond to different residue types.
Additionally, it is worthwhile to see if existing databases or collections have the comparisons you want. For example, multiple databases pair apo-holo structures together, although depending on your question, you may want to further curate this database down based on crystallographic properties [106].
Recommendation #9: Connect your analysis to other databases or prospective experiments
By connecting PDB structures with other bioinformatics databases, you can enrich your analyses with sequence features, domain architectures, pathway contexts, and chemical insights, uncovering deeper relationships between structure, function, and activity. The PDBe API provides programmatic access to sequence, taxonomy, and functional annotations [5]. Family and domain classifications, including Pfam, SCOP, ECOD, and CATH [14,31,35,107,108], are accessible via SIFTS [35]. SIFTS also offers residue-level mappings between PDB structures and UniProt sequences, enabling the labeling of functional sites onto PDB structures [109]. This facilitates comparative analyses, such as examining conformational changes across a family or correlating structural motifs with functional annotations from Gene Ontology or InterPro [110,111]. PDBs can be connected to pathway and chemical databases such as KEGG and Reactome via UniProt [112,113]. PDBe-KB further consolidates annotations from multiple specialist resources, providing an integrated knowledge base that highlights functional and biological insights mapped onto PDB entries [114]. In addition, the 3D-Beacons network connects structural biology resources across multiple providers, ensuring consistent and federated access to experimental and computational models [115]. While these resources are highly complementary, they are not entirely overlapping, as each database captures different aspects of biological knowledge, and careful integration is often necessary to avoid redundancy or misinterpretation.
Additionally, many PDB structural models have small molecules. PDBe provides excellent ligand pages and tools for analysis within the database [116,117]. Additionally, small molecule information can be linked to existing databases. The PDB’s Chemical Component Dictionary assigns ligand IDs that can be cross-referenced with ChEMBL, PubChem, or DrugBank [118–120]. Additionally, external databases such as PDBBind and BindingDB can group chemical or binding information and link it back to PDB information [121,122]. These databases enable easier retrieval of assay data, clinical information, or physicochemical properties of ligands. A growing number of ‘curated’ databases also look at protein-ligand interactions, post-translational modification, nucleic acid interaction sites, among many others [123–127]. You can then use the pre-calculated metrics or the curated PDB list to calculate the metrics you are interested in.
Additionally, bioinformatics can serve as an excellent partner for hypothesis generation or for supporting prospective experiments. For example, structural bioinformatics can pinpoint the specific residue(s) to mutate to test a desired functional effect, or evaluate whether an experimentally derived hypothesis, such as a loop–domain interaction, holds across homologous structures and influences protein activity.
Recommendation #10: Visualize everything!
One of the best things about structural biology is visualizing what you are discovering. Looking at structures and the metrics you are using via Pymol or Chimera is a powerful quality control tool for your bioinformatic analyses [128,129]. For example, calculating the comparison between two structures and then manually exploring the metric in a visualization software for a given metric. You can ask: Are you aligning the structures or residues correctly? Does the quantification of the metric you are getting make sense? Once you have confirmed that metrics are calculated correctly and you have results you want to show, Pymol, Chimera, or Coot offer various representations for pieces of the molecule, underlying experimental data, and distance measurements [128–130]. PyMOL can also load molecular dynamics trajectories to visualize conformational changes. ChimeraX’s plugin infrastructure efficiently handles larger structures.
Discussion
Structural bioinformatics provides a robust framework for identifying patterns in macromolecular structures, integrating with other databases, supporting theoretical approaches, and informing prospective experiments [9,12,131]. For example, overlaying quantitative proteomics and large-scale sequence variation onto structural clusters enables identifying regulatory hotspots and prioritizing functionally relevant variants. Additionally, structural bioinformatics can be incredibly powerful in supporting or refuting hypotheses from prospective experiments. While we did not focus on this, AlphaFold or other structure models can help fill gaps where experimental structures are absent [18–20], including now expanding beyond proteins [18,132]. However, users must remain mindful of the “last-Ångstrom” problem, where these prediction models are often inaccurate in very precise measurement, including molecular interactions, residue networks, and the lack of conformational ensembles stemming from these predicted structures [133,134].
Beyond single-structure analyses, statistical and integrative structural biology approaches can help merge structural models to detect new or more subtle changes in structures or structural ensembles. Further, while most of this article focused on how to detect subtle differences using bioinformatics, these tools can be used to go the other way spatially by integrating cell-scale data to construct multiscale assemblies in their native contexts. We bridge atomistic observations to emergent cellular behaviors, closing the loop between structure, function, and phenotype [135].
Finally, many concepts presented in this paper should also be considered when doing machine learning on protein structures. While protein structure prediction has led to an explosion of machine learning algorithms and approaches applied to structural data, many issues that hinder bioinformatic analyses also arise when splitting datasets in machine learning [92,136,137]. In particular, researchers must carefully avoid information leakage by ensuring that homologous proteins, redundant structures, or closely related crystal forms are not distributed across training and test sets, as this can lead to overly optimistic performance estimates. Incorporating these principles into structural bioinformatics ensures that computational results remain reliable, reproducible, and ultimately informative for guiding experimental design.
References
- 1. Bernstein FC, Koetzle TF, Williams GJ, Meyer EF Jr, Brice MD, Rodgers JR, et al. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977;112(3):535–42. pmid:875032
- 2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. pmid:10592235
- 3. Henrick K, Feng Z, Bluhm WF, Dimitropoulos D, Doreleijers JF, Dutta S, et al. Remediation of the protein data bank archive. Nucleic Acids Res. 2008;36(Database issue):D426–33. pmid:18073189
- 4. Berman H, Henrick K, Nakamura H, Markley JL. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. 2007;35(Database issue):D301–3. pmid:17142228
- 5. Mir S, Alhroub Y, Anyango S, Armstrong DR, Berrisford JM, Clark AR, et al. PDBe: towards reusable data delivery infrastructure at protein data bank in Europe. Nucleic Acids Res. 2018;46(D1):D486–92. pmid:29126160
- 6. Rose Y, Duarte JM, Lowe R, Segura J, Bi C, Bhikadiya C, et al. RCSB Protein Data Bank: architectural advances towards integrated searching and efficient access to macromolecular structure data from the PDB archive. J Mol Biol. 2021;433(11):166704. pmid:33186584
- 7. Kinjo AR, Bekker G-J, Wako H, Endo S, Tsuchiya Y, Sato H, et al. New tools and functions in data-out activities at Protein Data Bank Japan (PDBj). Protein Sci. 2018;27(1):95–102. pmid:28815765
- 8. Young JY, Westbrook JD, Feng Z, Sala R, Peisach E, Oldfield TJ, et al. OneDep: unified wwPDB system for deposition, biocuration, and validation of macromolecular structures in the PDB archive. Structure. 2017;25(3):536–45. pmid:28190782
- 9. Du S, Kretsch RC, Parres-Gold J, Pieri E, Cruzeiro VWD, Zhu M, et al. Conformational ensembles reveal the origins of serine protease catalysis. Cold Spring Harbor Laboratory; 2024.
- 10. Modi V, Dunbrack RL Jr. Defining a new nomenclature for the structures of active and inactive kinases. Proc Natl Acad Sci U S A. 2019;116(14):6818–27. pmid:30867294
- 11. Brenner SE, Chothia C, Hubbard TJ. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A. 1998;95(11):6073–8. pmid:9600919
- 12. Wankowicz SA, de Oliveira SH, Hogan DW, van den Bedem H, Fraser JS. Ligand binding remodels protein side-chain conformational heterogeneity. Elife. 2022;11:e74114. pmid:35312477
- 13. Finkelstein AV, Ptitsyn OB. Why do globular proteins fit the limited set of folding patterns?. Prog Biophys Mol Biol. 1987;50(3):171–90. pmid:3332386
- 14. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247(4):536–40. pmid:7723011
- 15. Jones S, Thornton JM. Principles of protein-protein interactions. Proc Natl Acad Sci U S A. 1996;93(1):13–20. pmid:8552589
- 16. Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature. 1992;358(6381):86–9. pmid:1614539
- 17. Alford RF, Leaver-Fay A, Jeliazkov JR, O’Meara MJ, DiMaio FP, Park H, et al. The rosetta all-atom energy function for macromolecular modeling and design. J Chem Theory Comput. 2017;13(6):3031–48. pmid:28430426
- 18. Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. pmid:38718835
- 19. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
- 20. Krishna R, Wang J, Ahern W, Sturmfels P, Venkatesh P, Kalvet I, et al. Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science. 2024;384(6693):eadl2528. pmid:38452047
- 21. Dunbrack RL Jr, Karplus M. Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol. 1993;230(2):543–74. pmid:8464064
- 22. Westbrook JD, Fitzgerald PMD. The PDB format, mmCIF, and other data formats. Methods Biochem Anal. 2003;44:161–79.
- 23. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(Database issue):D204-12. pmid:25348405
- 24. Vallat B, Webb BM, Westbrook JD, Goddard TD, Hanke CA, Graziadei A, et al. IHMCIF: an extension of the PDBx/mmCIF data standard for integrative structure determination methods. J Mol Biol. 2024;436(17):168546. pmid:38508301
- 25. Wankowicz SA, Fraser JS. Comprehensive encoding of conformational and compositional protein structural ensembles through the mmCIF data structure. IUCrJ. 2024;11(Pt 4):494–501. pmid:38958015
- 26. Nomburg J, Doherty EE, Price N, Bellieny-Rabelo D, Zhu YK, Doudna JA. Birth of protein folds and functions in the virome. Nature. 2024;633(8030):710–7. pmid:39187718
- 27. Monzon V, Haft DH, Bateman A. Folding the unfoldable: using AlphaFold to explore spurious proteins. Bioinform Adv. 2022;2(1):vbab043. pmid:36699409
- 28. Osmanli Z, Falgarone T, Samadova T, Aldrian G, Leclercq J, Shahmuradov I, et al. The difference in structural states between canonical proteins and their isoforms established by proteome-wide bioinformatics analysis. Biomolecules. 2022;12(11):1610. pmid:36358962
- 29. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610
- 30. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8. pmid:29035372
- 31. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH—a hierarchic classification of protein domain structures. Structure. 1997;5(8):1093–108. pmid:9309224
- 32. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33(7):2302–9. pmid:15849316
- 33. Wang G, Dunbrack RL Jr. PISCES: a protein sequence culling server. Bioinformatics. 2003;19(12):1589–91. pmid:12912846
- 34. Sievers F, Higgins DG. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 2018;27(1):135–45. pmid:28884485
- 35. Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, et al. SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2013;41(Database issue):D483-9. pmid:23203869
- 36. Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res. 2000;28(1):257–9. pmid:10592240
- 37. Bittrich S, Segura J, Duarte JM, Burley SK, Rose Y. RCSB protein Data Bank: exploring protein 3D similarities via comprehensive structural alignments. Bioinformatics. 2024;40(6):btae370. pmid:38870521
- 38. Li Z, Jaroszewski L, Iyer M, Sedova M, Godzik A. FATCAT 2.0: towards a better understanding of the structural diversity of proteins. Nucleic Acids Res. 2020;48(W1):W60–4. pmid:32469061
- 39. Liu Z, Zhang C, Zhang Q, Zhang Y, Yu D-J. TM-search: an efficient and effective tool for protein structure database search. J Chem Inf Model. 2024;64(3):1043–9. pmid:38270339
- 40. Guda C, Lu S, Scheeff ED, Bourne PE, Shindyalov IN. CE-MC: a multiple protein structure alignment server. Nucleic Acids Res. 2004;32(Web Server issue):W100-3. pmid:15215359
- 41. Topham CM, Rouquier M, Tarrat N, André I. Adaptive Smith-Waterman residue match seeding for protein structural alignment. Proteins. 2013;81(10):1823–39. pmid:23720362
- 42. Piehl DW, Vallat B, Truong I, Morsy H, Bhatt R, Blaumann S, et al. rcsb-api: Python toolkit for streamlining access to RCSB Protein Data Bank APIs. J Mol Biol. 2025;437(15):168970. pmid:39894387
- 43. Rosenthal PB, Henderson R. Optimal determination of particle orientation, absolute hand, and contrast loss in single-particle electron cryomicroscopy. J Mol Biol. 2003;333(4):721–45. pmid:14568533
- 44. Kucukelbir A, Sigworth FJ, Tagare HD. Quantifying the local resolution of cryo-EM density maps. Nat Methods. 2014;11(1):63–5. pmid:24213166
- 45. Gore S, et al. Validation of structures in the Protein Data Bank. Structure. 2017;25:1916–27.
- 46. Montelione GT, et al. Recommendations of the wwPDB NMR validation task force. Structure. 2013;21:1563–70.
- 47. Henderson R, Sali A, Baker ML, Carragher B, Devkota B, Downing KH, et al. Outcome of the first electron microscopy validation task force meeting. Structure. 2012;20(2):205–14. pmid:22325770
- 48. Gutmanas A, Adams PD, Bardiaux B, Berman HM, Case DA, Fogh RH, et al. NMR Exchange Format: a unified and open standard for representation of NMR restraint data. Nat Struct Mol Biol. 2015;22(6):433–4. pmid:26036565
- 49. Read RJ, Adams PD, Arendall WB 3rd, Brunger AT, Emsley P, Joosten RP, et al. A new generation of crystallographic validation tools for the protein data bank. Structure. 2011;19(10):1395–412. pmid:22000512
- 50. Williams CJ, Headd JJ, Moriarty NW, Prisant MG, Videau LL, Deis LN, et al. MolProbity: more and better reference data for improved all-atom structure validation. Protein Sci. 2018;27(1):293–315. pmid:29067766
- 51. Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr. 1993;26(2):283–91.
- 52. Meyder A, Nittinger E, Lange G, Klein R, Rarey M. Estimating electron density support for individual atoms and molecular fragments in X-ray structures. J Chem Inf Model. 2017;57(10):2437–47. pmid:28981269
- 53. Brünger AT. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature. 1992;355(6359):472–5. pmid:18481394
- 54. Prisant MG, Williams CJ, Chen VB, Richardson JS, Richardson DC. New tools in MolProbity validation: CaBLAM for CryoEM backbone, UnDowser to rethink “waters,” and NGL Viewer to recapture online 3D graphics. Protein Sci. 2020;29(1):315–29. pmid:31724275
- 55. Lang PT, Ng H-L, Fraser JS, Corn JE, Echols N, Sales M, et al. Automated electron-density sampling reveals widespread conformational polymorphism in proteins. Protein Sci. 2010;19(7):1420–31. pmid:20499387
- 56. Tickle IJ. Statistical quality indicators for electron-density maps. Acta Crystallogr D Biol Crystallogr. 2012;68(Pt 4):454–67. pmid:22505266
- 57. Kleywegt GJ, Harris MR, Zou JY, Taylor TC, Wählby A, Jones TA. The Uppsala electron-density server. Acta Crystallogr D Biol Crystallogr. 2004;60(Pt 12 Pt 1):2240–9. pmid:15572777
- 58. Lander GC. Single particle cryo-EM map and model validation: It’s not crystal clear. Curr Opin Struct Biol. 2024;89:102918. pmid:39293191
- 59. Grigorieff N. Resolution measurement in structures derived from single particles. Acta Crystallogr D Biol Crystallogr. 2000;56(Pt 10):1270–7. pmid:10998623
- 60. Barad BA, Echols N, Wang RY-R, Cheng Y, DiMaio F, Adams PD, et al. EMRinger: side chain-directed model and map validation for 3D cryo-electron microscopy. Nat Methods. 2015;12(10):943–6. pmid:26280328
- 61. Pintilie G, Zhang K, Su Z, Li S, Schmid MF, Chiu W. Measurement of atom resolvability in cryo-EM maps with Q-scores. Nat Methods. 2020;17(3):328–34. pmid:32042190
- 62. Brandt BW, Heringa J, Leunissen JAM. SEQATOMS: a web tool for identifying missing regions in PDB in sequence context. Nucleic Acids Res. 2008;36(Web Server issue):W255-9. pmid:18463137
- 63. van Beusekom B, Joosten K, Hekkelman ML, Joosten RP, Perrakis A. Homology-based loop modeling yields more complete crystallographic protein structures. IUCrJ. 2018;5(Pt 5):585–94. pmid:30224962
- 64. Liebeschuetz JW. The good, the bad, and the twisted revisited: an analysis of ligand geometry in highly resolved protein-ligand X-ray structures. J Med Chem. 2021;64(11):7533–43. pmid:34060310
- 65. Shao C, Westbrook JD, Lu C, Bhikadiya C, Peisach E, Young JY, et al. Simplified quality assessment for small-molecule ligands in the Protein Data Bank. Structure. 2022;30(2):252-262.e4. pmid:35026162
- 66. Groom CR, Bruno IJ, Lightfoot MP, Ward SC. The Cambridge structural database. Acta Crystallogr B Struct Sci Cryst Eng Mater. 2016;72(Pt 2):171–9. pmid:27048719
- 67. Zheng H, Cooper DR, Porebski PJ, Shabalin IG, Handing KB, Minor W. CheckMyMetal: a macromolecular metal-binding validation tool. Acta Crystallogr D Struct Biol. 2017;73(Pt 3):223–33. pmid:28291757
- 68. de Vries I, Kwakman T, Lu XJ, Hekkelman ML, Deshpande M, Velankar S, et al. New restraints and validation approaches for nucleic acid structures in PDB-REDO. Acta Crystallogr D Struct Biol. 2021;77(Pt 9):1127–41. pmid:34473084
- 69. Černý J, Božíková P, Schneider B. DNATCO: assignment of DNA conformers at dnatco.org. Nucleic Acids Res. 2016;44(W1):W284-7. pmid:27150812
- 70. Joosten RP, Vriend G. PDB improvement starts with data deposition. Science. 2007;317(5835):195–6. pmid:17626865
- 71. Joosten RP, Womack T, Vriend G, Bricogne G. Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Crystallogr D Biol Crystallogr. 2009;65(Pt 2):176–85. pmid:19171973
- 72. Joosten RP, Salzemann J, Bloch V, Stockinger H, Berglund A-C, Blanchet C, et al. PDB_REDO: automated re-refinement of X-ray structure models in the PDB. J Appl Crystallogr. 2009;42(Pt 3):376–84. pmid:22477769
- 73. Jiang J, Abola E, Sussman JL. Deposition of structure factors at the Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 1999;55(Pt 1):4. pmid:10089388
- 74. Murshudov GN, Skubák P, Lebedev AA, Pannu NS, Steiner RA, Nicholls RA, et al. REFMAC5 for the refinement of macromolecular crystal structures. Acta Crystallogr D Biol Crystallogr. 2011;67(Pt 4):355–67. pmid:21460454
- 75. Adams PD, Afonine PV, Bunkóczi G, Chen VB, Davis IW, Echols N, et al. PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr. 2010;66(Pt 2):213–21. pmid:20124702
- 76. Shabalin IG, Porebski PJ, Minor W. Refining the macromolecular model—achieving the best agreement with the data from X-ray diffraction experiment. Crystallogr Rev. 2018;24(4):236–62. pmid:30416256
- 77. Du S, Wankowicz SA, Yabukarski F, Doukov T, Herschlag D, Fraser JS. Refinement of multiconformer ensemble models from multi-temperature X-ray diffraction data. Methods Enzymol. 2023;688:223–54. pmid:37748828
- 78. Wankowicz SA, Ravikumar A, Sharma S, Riley B, Raju A, Hogan DW, et al. Automated multiconformer model building for X-ray crystallography and cryo-EM. Elife. 2024;12:RP90606. pmid:38904665
- 79. Burnley BT, Afonine PV, Adams PD, Gros P. Modelling dynamics in protein crystal structures by ensemble refinement. Elife. 2012;1:e00311. pmid:23251785
- 80. Zubatyuk R, Biczysko M, Ranasinghe K, Moriarty NW, Gokcan H, Kruse H, et al. AQuaRef: machine learning accelerated quantum refinement of protein structures. bioRxiv. 2025;:2024.07.21.604493. pmid:39071315
- 81. Hoch JC, et al. Biological magnetic resonance data bank. Nucleic Acids Res. 2023;51:D368–76.
- 82. wwPDB Consortium. EMDB-the Electron Microscopy Data Bank. Nucleic Acids Res. 2024;52(D1):D456–65. pmid:37994703
- 83. Choy BC, Cater RJ, Mancia F, Pryor EE. A 10-year meta-analysis of membrane protein structural biology: detergents, membrane mimetics, and structure determination techniques. Biochim Biophys Acta Biomembr. 2021;1863:183533.
- 84. Erlanson D, Burley S, Fraser J, Fearon D, Kreitler D, Nonato MC, et al. Where to house big data on small fragments?. American Chemical Society (ACS); 2025.
- 85. Krissinel E, Henrick K. Inference of macromolecular assemblies from crystalline state. J Mol Biol. 2007;372(3):774–97. pmid:17681537
- 86. Xu Q, Dunbrack RL Jr. The protein common interface database (ProtCID)—a comprehensive database of interactions of homologous proteins in multiple crystal forms. Nucleic Acids Res. 2011;39(Database issue):D761–70. pmid:21036862
- 87. Xu Q, Dunbrack RL. The protein common assembly database (ProtCAD)-a comprehensive structural resource of protein complexes. Nucleic Acids Res. 2023;51(D1):D466–78. pmid:36300618
- 88. Wankowicz SA. Modeling bias toward binding sites in PDB structural models. Cold Spring Harbor Laboratory; 2024.
- 89. Woldeyes RA, Sivak DA, Fraser JS. E pluribus unum, no more: from one crystal, many conformations. Curr Opin Struct Biol. 2014;28:56–62. pmid:25113271
- 90. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
- 91. Gaudreault F, Chartier M, Najmanovich R. Side-chain rotamer changes upon ligand binding: common, crucial, correlate with entropy and rearrange hydrogen bonding. Bioinformatics. 2012;28(18):i423–30. pmid:22962462
- 92. Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem Sci. 2023;15(9):3130–9. pmid:38425520
- 93. Holm L, Sander C. Dali: a network tool for protein structure comparison. Trends Biochem Sci. 1995;20(11):478–80. pmid:8578593
- 94. Ginn HM. Torsion angles to map and visualize the conformational space of a protein. Protein Sci. 2023;32(4):e4608. pmid:36840926
- 95. Nicholls RA, Fischer M, McNicholas S, Murshudov GN. Conformation-independent structural comparison of macromolecules with ProSMART. Acta Crystallogr D Biol Crystallogr. 2014;70(Pt 9):2487–99. pmid:25195761
- 96. Sun Z, Liu Q, Qu G, Feng Y, Reetz MT. Utility of B-factors in protein science: interpreting rigidity, flexibility, and internal motion and engineering thermostability. Chem Rev. 2019;119(3):1626–65. pmid:30698416
- 97. Ringe D, Petsko GA. Study of protein dynamics by X-ray diffraction. Methods Enzymol. 1986;131:389–433. pmid:3773767
- 98. Carugo O. B-factor accuracy in protein crystal structures. Acta Crystallogr D Struct Biol. 2022;78(Pt 1):69–74. pmid:34981763
- 99. Tyzack JD, Fernando L, Ribeiro AJM, Borkakoti N, Thornton JM. Ranking enzyme structures in the PDB by bound ligand similarity to biological substrates. Structure. 2018;26(4):565-571.e3. pmid:29551288
- 100. Yeturu K, Chandra N. PocketMatch: a new algorithm to compare binding sites in protein structures. BMC Bioinformatics. 2008;9:543. pmid:19091072
- 101. Meslamani J, Rognan D, Kellenberger E. sc-PDB: a database for identifying variations and multiplicity of “druggable” binding sites in proteins. Bioinformatics. 2011;27(9):1324–6. pmid:21398668
- 102. Chen VB, Arendall WB 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr. 2010;66(Pt 1):12–21. pmid:20057044
- 103. Creon A, Scheer TES, Reinke P, Mashhour AR, Günther S, Niebling S, et al. Statistical crystallography reveals an allosteric network in SARS-CoV-2 Mpro. Cold Spring Harbor Laboratory; 2025.
- 104. Yabukarski F, Doukov T, Pinney MM, Biel JT, Fraser JS, Herschlag D. Ensemble-function relationships to dissect mechanisms of enzyme catalysis. Sci Adv. 2022;8(41):eabn7738. pmid:36240280
- 105. Faezov B, Dunbrack RL Jr. PDBrenum: a webserver and program providing Protein Data Bank files renumbered according to their UniProt sequences. PLoS One. 2021;16(7):e0253411. pmid:34228733
- 106. Feidakis CP, Krivak R, Hoksza D, Novotny M. AHoJ-DB: a PDB-wide assignment of apo & holo relationships based on individual protein-ligand interactions. J Mol Biol. 2024;436(17):168545. pmid:38508305
- 107. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49(D1):D412–9. pmid:33125078
- 108. Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, et al. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol. 2014;10(12):e1003926. pmid:25474468
- 109. Choudhary P, Anyango S, Berrisford J, Tolchard J, Varadi M, Velankar S. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci Data. 2023;10(1):204. pmid:37045837
- 110. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9. pmid:10802651
- 111. Paysan-Lafosse T, et al. InterPro in 2022. Nucleic Acids Res. 2023;51:D418–27.
- 112. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–61. pmid:27899662
- 113. Gillespie M, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–92.
- 114. PDBe-KB consortium. PDBe-KB: a community-driven resource for structural and functional annotations. Nucleic Acids Res. 2020;48:D344–53.
- 115. Varadi M, Nair S, Sillitoe I, Tauriello G, Anyango S, Bienert S, et al. 3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources. Gigascience. 2022;11:giac118. pmid:36448847
- 116. Choudhary P, Kunnakkattu IR, Nair S, Lawal DK, Pidruchna I, Afonso MQL, et al. PDBe tools for an in-depth analysis of small molecules in the Protein Data Bank. Protein Sci. 2025;34(4):e70084. pmid:40100137
- 117. Kunnakkattu IR, Choudhary P, Pravda L, Nadzirin N, Smart OS, Yuan Q, et al. PDBe CCDUtils: an RDKit-based toolkit for handling and analysing small molecules in the Protein Data Bank. J Cheminform. 2023;15(1):117. pmid:38042830
- 118. Gaulton A, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45:D945–54.
- 119. Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, et al. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024;52(D1):D1265–75. pmid:37953279
- 120. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9. pmid:30371825
- 121. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics. 2015;31(3):405–12. pmid:25301850
- 122. Liu T, Hwang L, Burley SK, Nitsche CI, Southan C, Walters WP, et al. BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res. 2025;53(D1):D1633–44. pmid:39574417
- 123. Zhang C, Zhang X, Freddolino L, Zhang Y. BioLiP2: an updated structure database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2024;52(D1):D404–12. pmid:37522378
- 124. Joosten RP, te Beek TAH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 2011;39(Database issue):D411-9. pmid:21071423
- 125. Li F, Fan C, Marquez-Lago TT, Leier A, Revote J, Jia C, et al. PRISMOID: a comprehensive 3D structure database for post-translational modifications and mutations with functional impact. Brief Bioinform. 2020;21(3):1069–79. pmid:31161204
- 126. Mitra R, Cohen AS, Tang WY, Hosseini H, Hong Y, Berman HM, et al. RNAproDB: a webserver and interactive database for analyzing protein-RNA interactions. J Mol Biol. 2025;:169012. pmid:40126909
- 127. Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019;47(W1):W345–9. pmid:31114880
- 128. Lill MA, Danielson ML. Computer-aided drug design platform using PyMOL. J Comput Aided Mol Des. 2011;25(1):13–9. pmid:21053052
- 129. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, et al. UCSF Chimera—a visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605–12. pmid:15264254
- 130. Casañal A, Lohkamp B, Emsley P. Current developments in Coot for macromolecular model building of Electron Cryo-microscopy and Crystallographic Data. Protein Sci. 2020;29(4):1069–78. pmid:31730249
- 131. Lovell SC, Word JM, Richardson JS, Richardson DC. The penultimate rotamer library. Proteins. 2000;40(3):389–408. pmid:10861930
- 132. Hekkelman ML, de Vries I, Joosten RP, Perrakis A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat Methods. 2023;20(2):205–13. pmid:36424442
- 133. Lyu N, Du S, Ma J, Herschlag D. An evaluation of biomolecular energetics learned by AlphaFold. Cold Spring Harbor Laboratory; 2025.
- 134. Wankowicz SA, Bonomi M. From possibility to precision in macromolecular ensemble prediction. 2025.
- 135. Young LN, Villa E. Bringing structure to cell biology with cryo-electron tomography. Annu Rev Biophys. 2023;52:573–95. pmid:37159298
- 136. Chakravarty D, Schafer JW, Chen EA, Thole JF, Ronish LA, Lee M, et al. AlphaFold predictions of fold-switched conformations are driven by structure memorization. Nat Commun. 2024;15(1):7296. pmid:39181864
- 137. Sellner MS, Lill MA, Smieško M. Quality matters: deep learning-based analysis of protein-ligand interactions with focus on avoiding bias. Cold Spring Harbor Laboratory; 2023.