Fig 1.
Growth of the Protein Data Bank archive.
(A) The currently largest asymmetric structure in the PDB—the HIV Capsid (PDB ID 3J3Q) contains over 2.4 million atoms. (B) The number of depositions per year (obsoleted or superseded entries are excluded). (C) The average structure size (asymmetric unit size for crystallographic structures). (D) Electron microscopy structures are contributing ~10 million atoms per year for the past 3 years (1% of the archive).
Table 1.
Data categories described in MMTF format.
Fig 2.
Steps in the creation of a MMTF file from a PDBx/mmCIF file.
After parsing a PDBx/mmCIF file, DSSP secondary structure is calculated and bond information is added for all residues. Custom encoding strategies are applied to the different data categories to achieve a compact representation. These data are serialized in binary form and then further compressed with standard compression tools to create a compressed MMTF file.
Fig 3.
Dictionary entry for amino acid serine.
Fig 4.
Workflow for encoding columnar data within MMTF.
(A) Columnar data are first converted to integer arrays. Depending on the type of the values in the array, three types of custom encoding are applied to: 1. Repeated values, 2. Sequential values, 3. Small differences between adjacent values. All encoded values are finally encoded as a byte array. (B) Example of encoding 2,000 occupancy values by integer encoding (x100) followed by run-length encoding. (C) Example of encoding 2000 atom serial numbers by applying delta and run-length encoding. (D) Example of encoding atom coordinate values by integer encoding (x1,000), delta encoding, and recursive index encoding into a 16 bit signed integer array. Here, the value 32,867 exceeds the maximum value (32,767) for a 16-bit signed integer. Therefore, recursive index encoding decomposes this value into two numbers 32,767 and 100 that sum up to the original value. All subsequent values are within range and are represented directly by their values 2,001, and 1,053.
Fig 5.
Data structure of custom encoded record in MMTF.
A Codec Type describes the columnar encoding strategy. A Codec may describe the combination of several encoding strategies. For example, coordinate data are encoded by a Codec that combines integer encoding, delta encoding, recursive index encoding. Data Length represents the number of values that have been encoded, and here the Codec Parameter for coordinate encoding is a divisor to convert integers to floating point numbers.
Table 2.
MMTF file types.
Fig 6.
Third party software integration through MMTF APIs and web services.
The PDB archive can be accessed in MMTF format through RESTful web services. APIs available in common programming languages provide efficient access to the MMTF data. Third party applications then access the data through the language-specific APIs.
Fig 7.
Comparison of the gzipped file sizes for the PDB archive (~127,000 entries) in PDBx/mmCIF, PDB, and MMTF formats as of March 2017.
About 500 large structures (> 99,999 atoms or > 62 chains) cannot be represented in the PDB format, however, they are available as split PDB files (.tar.gz files) and take up about 2.7 GB, which is included in the reported PDB file size. For MMTF, we report the size of the all atom representation (MMTF-full) and the reduced representation (MMTF-reduced).
Fig 8.
Comparison of BioJava load time for the PDB archive using different file formats.
Load time for the PDB archive (~127,000) entries using the gzip compressed PDBx/mmCIF, PDB, and MMTF formats. For MMTF, we report the load time for individual gzipped files, as well as, the load time for uncompressed Hadoop Sequence Files containing MMTF records in the full (all atom, MMTF-full) and the reduced format (MMTF-reduced). For PDB file loading, about 500 large structures that cannot be represented in the PDB format (>99,999 atom, > 62 chains) were excluded.
Fig 9.
Comparison of the average load times for different file formats using three software libraries in three programming languages on a set of 1000 random PDB entries.
Fig 10.
Comparison of the average load times per structure using the MMTF format for three structure sizes.
The benchmarks contain 100 structures each around the 25, 50 and 75 percentile of the PDB size distribution: Q25 (2,309–2,313 atoms), Q50 (4,054–4,063 atoms), Q75 (7,862–7,885 atoms).
Table 3.
Average load time for large PDB entry 3J3Q with about 2.4 million atoms.
Fig 11.
Traversal of the structure hierarchy using the MMTF API.
These code snippets (A) Java, (B) JavaScript, and (C) Python demonstrate how to load and decode an MMTF file (PDB ID 4CUP) from http://mmtf.rcsb.org and then traverse the hierarchical data structure (Models -> Chains -> Groups -> Atoms). The code shown here loops through the Model and Chain hierarchy. For each model, the model index is printed, and for each chain, the chainId, chainName, and number of groups (residues) are printed. In an analogous fashion, the group and atom data can be traversed.
Table 4.
Applications that support the MMTF file format.
Fig 12.
Main applications of the MMTF file format.
(A) MMTF enables fast transfer, parsing, and low client side overhead for high-performance visualization in web-based viewers and in particular mobile devices. (B) MMTF can be represented in “Big Data” formats and the small size enables high-performance, in-memory analysis and calculations of the entire PDB archive using Big Data frameworks for parallel processing.