A Systematic Analysis of the Structures of Heterologously Expressed Proteins and Those from Their Native Hosts in the RCSB PDB Archive

Recombinant expression of proteins has become an indispensable tool in modern day research. The large yields of recombinantly expressed proteins accelerate the structural and functional characterization of proteins. Nevertheless, there are literature reported that the recombinant proteins show some differences in structure and function as compared with the native ones. Now there have been more than 100,000 structures (from both recombinant and native sources) publicly available in the Protein Data Bank (PDB) archive, which makes it possible to investigate if there exist any proteins in the RCSB PDB archive that have identical sequence but have some difference in structures. In this paper, we present the results of a systematic comparative study of the 3D structures of identical naturally purified versus recombinantly expressed proteins. The structural data and sequence information of the proteins were mined from the RCSB PDB archive. The combinatorial extension (CE), FATCAT-flexible and TM-Align methods were employed to align the protein structures. The root-mean-square distance (RMSD), TM-score, P-value, Z-score, secondary structural elements and hydrogen bonds were used to assess the structure similarity. A thorough analysis of the PDB archive generated five-hundred-seventeen pairs of native and recombinant proteins that have identical sequence. There were no pairs of proteins that had the same sequence and significantly different structural fold, which support the hypothesis that expression in a heterologous host usually could fold correctly into their native forms.


Introduction
It is a routine practice to obtain satisfactory yields of proteins for structure determination and functional characterization using recombinant DNA technologies [1][2][3][4]. Protein production using natural materials requires a large quantity of the source organism and only small amount of protein can be obtained. When it comes to undertake a new project which needs purified proteins, the first thought in mind is usually how to obtain them in a recombinant form. The capability of harvesting sufficient quantity of the desired protein by recombinant technology makes it widely available for biochemical characterization [5], commercial application [6] and industrial processes [7].
When using the convenient recombinant DNA technology, it's a common sense that it is better to employ the eukaryotic expression systems to overexpress the desired protein since it can provide correct post-translational machinery and molecular chaperones [8]. However, practically not all recombinant proteins are obtained from eukaryotic expressing systems. For example, the well-established prokaryotic expression system Escherichia coli used as a protein factory and it has become the most popular expression platform for its low cost, easy transformation and fermentation, and high protein yields [9]. The expression systems different from the native environments may result in differences in structure and function of the target proteins [10].
In the literature [11], we can find examples which showed that the recombinant proteins exhibited some differences in structure and function compared with those of their native forms. For example, crystal structures of native yeast fumarase (NY-fumarase) and recombinant form (RY-fumarase) are independently determined by two separate laboratories. A comparison of the two crystal structures (with the same space group P4 2 2 1 2) was carried out. It was found that, except a point mutation which probably resulted from PCR error there were no significant conformational changes observed in or around the mutated regions, however, a somewhat large difference between the two crystal structures was observed in the D3 domains of the NY and RY-fumarases between residues Pro439 to Pro485 of the C-terminus. The most significant difference was found around residue K456 and G457 [11]. The result suggested that there indeed exists difference between the naturally purified and recombinantly expressed structures.
Another report [12] unequivocally demonstrated the conformational differences between native and recombinant horseradish peroxidase through the data of tritium planigraphy. The results showed that the recombinant enzyme is compactly folded and highly hydrophobic compared with the native one. A study on another enzyme, prolidase [13], showed that, however, the recombinant form may not be completely folded. It found that the recombinantly expressed enzyme prolidase had a higher specific activity and slightly less thermostable than the native one. This phenomenon may result from the fact that the recombinantly expressed enzyme isn't completely folded, and perhaps this additional flexibility leads to enhanced catalytic activity. This conclusion is in accordance with another research on ovalbumin [14]: the circular dichroism study revealed that the recombinant protein showed a slightly less compact structure than its native form.
Difference in function between the native and recombinant proteins is a good indication of the difference in the structure. Such differences can be also found in the literature. For instance, the efficacy of mannose-terminated glucocerebrosidase from native and recombinant sources was compared, the results showed that the formation of IgG antibody in the native source was greater (40%) than in the recombinant source (20%) [15]. Another report on hirudin showed that, the native hirudin demonstrated more pronounced effects on the expression of vascular endothelial growth factor (VEGF) and random skin flap survival than the recombinant one in venous congested rat model [16]. The study on fungal laccases also showed differences in the function: the enzyme affinity and the redox potential were decreased in the recombinant source [17]. In the field of recombinant drugs, such examples can be also found. The recombinant human erythropoietin (rHuEPO) has been used successfully to treat the anaemia of chronic renal failure for decades. But during 1998 to 2001, it was suddenly found that 21 patients treated by rHuEPO developed neutralizing anti-erythropoietin antibodies. After withdrawal of the rHuEPO therapy, the antibodies decreased slowly in all cases. Apparently the problem is related to the treatment using rHuEPO. Comparing the native endogenous erythropoietin and the rHuEPO revealed minimal differences in glycosylation and slight difference in the sialic acid composition of oligosaccharide groups, resulting in a functional difference with the native one [18].
Although the differences in conformation and function between the recombinant proteins and native ones were reported in the literature, a majority of studies still showed that the differences are most likely negligible [19]. Nowadays, we have witnessed the big advance in structural biology [9]. The RCSB PDB archive currently holds more than 100,000 macromolecular structures. The already available structural information in the archive gives us a good chance of a systematic investigation of the structures present in the RCSB PDB can shed light on the differentially expressed and purified protein structures with identical amino acid sequence. To conduct the investigation, we mined the data in the RCSB PDB archive, for proteins with identical amino acid sequence in both native and recombinant sources in the archive, and then compared the structures to see if they are identical in structure when their sequences are identical. The results showed that, in the RCSB PDB archive, the structures of the proteins of the same sequence from the two different sources are virtually the same, which provide evidence to support the common believed intuitive assumption that expression in a heterologous host usually could fold correctly into their native forms.

The Data Source
To determine whether there are any differences in the 3D structure of a native protein (the protein was isolated from a native source) and its recombinant form (the protein was obtained from a genetically manipulated source), we performed a comparison using the structural information in the existing RCSB PDB archive [20]. The comparison was carried out as follows. First, the 3D structures of the native source protein and recombinant source protein were downloaded from the RCSB PDB archive. Because the 3D structures of the same protein determined by X-Ray Diffraction (XRD), Electron Microscopy (EM) and Nuclear Magnetic Resonance (NMR) were not absolutely the same [21], we compared the structures obtained from the XRD method (89.1% of the protein structures were determined by the X-ray method). The downloaded RCSB PDB file only contains one chain. Second, the released structures from both the native and recombinant sources were extracted based on the following criteria: the structure name and chain length must be the same; the length of the compared protein should be more than 40 residues (because the RMSD 100 formula only applies to the alignment of structures that include more than 40 residues); the protein sequence similarity must be 100% (some residues at the beginning or end are exceptions, but in this case, only the common fragments of both chains were considered); and the structure resolutions should be as close as possible. With this strategy, we downloaded the structures of 85% of the native source proteins and 75% of the recombinant source proteins, but in the end, only 517 pairs of proteins, which were used for the structural comparison, met the criteria.

Protein Structure Alignment
There are many excellent servers available for protein structure comparisons, including CE [22], FATCAT-flexible [23], TM-align [24], DALI [25], VAST [26], STRUCTAL [27] and Dee-pAlign [28], more information and the details of each method can found in systematic review [29,30]. In the current work, only three widely used methods (CE, FATCAT-flexible and TMalign) were employed to align the protein structures.
The three methods have their own online task-submitted servers, and a code-localized approach is available for users to download the latest released version of the methods to their personal computers. When comparing high-volume data, it is better to install the downloaded code and local RCSB PDB to increase speed and save time. At the same time, these servers are integrated into the Sequence & Structure Alignment module of the RCSB PDB website (http:// www.rcsb.org/pdb/secondary.do?p=v2/secondary/analyze.jsp#Sequence) [31]. Given that we did not use a large amount of data, we completed our structural alignment work online within the RCSB PDB website [32]. A script was written to batch submit the aligned structures and obtain valuable structural similarity estimators on the output pages.

Secondary Structural Element Alignment
The secondary structures were also used to compare the structures of the native and recombinant sources. For this purpose, the widely used DSSP method was used to analyze the protein secondary structural alignments [33,34]. We only considered three types of backbone conformations: helix (3 10 helices, α-helices and π-helices), sheet (β-sheet and β-bridge) and loop (any other type). Not all of the secondary structural elements were aligned. Only those pairs with higher mean RMSD 100 values and lower TM-scores were analyzed.

Hydrogen Bonding Calculation
Because hydrogen bonding energy is more essential to the stabilization of the protein structure than any other backbone-backbone interaction force, we calculated the number of backbonebackbone hydrogen bonds and their energies for all of the protein structures determined using native and recombinant sources. The hydrogen bonding energies were calculated by the DSSP program. As the DSSP method defined, there is a hydrogen bond when the bond energy is below -0.5 kcal/mol. We calculated all of the hydrogen bonding energies and the numbers of hydrogen bonds in all of the native and recombinant structures compared.

Backbone O-Atom and Backbone N-Atom Hydrogen Bonds Contact Matrix
The hydrogen bonds of the overall aligned structures were analyzed. The contact matrices of the hydrogen bonds between the back-bone O-atom and back-bone N-atom of some specially compared structure pairs were also displayed. Those parameters can be retrieved from the WHAT IF Web Interface [35]. This server can calculate the contacts between all of the atoms in a submitted RCSB PDB file. In this work, we set the contact distance to 1 Angstrom. The results returned all of the contacts (backbone to backbone, backbone to side chain and side chain to side chain). Only backbone to backbone contacts were considered, and the others contacts were disregarded.

The Data Set
There were more recombinant source structures deposited in the RCSB PDB archive than native ones, and the released structures were sorted by species. To obtain all of the possible pairs of structures from both native and recombinant sources, we first downloaded the native source data and then downloaded the corresponding recombinant source data. Because not every protein in the archive had both a recombinant source structure and a native source structure, we downloaded 85% of the native source structures and 75% of the recombinant source structures from the archive. Then, we filtered the data using the screening criteria mentioned in the Materials and Methods section and collected 517 pairs of structures that met our criteria, in which 336 pairs of recombinant proteins (65%) expressed in prokaryotic host (Escherichia coli) and 118 pairs (23%) are obtained in eukaryotic host. There are cases where the expression host information is not available for certain entries present in the PDB archive, the details are listed in S1 Dataset. These pairs of structures were submitted to online servers for structural comparison using the CE, FATCAT-flexible and TM-align methods. Lastly, we obtained structural similarity estimators for the data analysis in the next step.

Global Comparison
When analyzing structural similarities, it is essential to obtain quantitative estimators. Every structural alignment method has its own quantitative estimator. These estimators were used to analyze structural similarity. For the TM-align method, the TM-score is used as the similarity estimator, and the CE method employs the Z-score as the similarity estimator. The P-value is the similarity estimator of the FATCAT-flexible estimator, and the RMSD is the common similarity estimator of the three methods. The details on these estimators are shown below.
TM-Align Estimator: TM-Score. The TM-align method employs the TM-score as a structural similarity estimator. It is normalized so that the compared structures are not dependent on the structure size. The TM-score has a standard threshold. A TM-score = 1 means that the two compared structures are identical, while a TM-score > 0.5 indicates that the two compared structures have a similar fold. A TM-score < 0.17 implies that the structural similarity of the two structures is random [24]. Fig 1A shows the distribution of the TM-score of 517 pairs of structures. There was no pair in which the TM-score was < 0.17. Most (510 pairs, 96.7% of the total number) of the TM-scores were > 0.82. From the distributions of the TM-score, we concluded that there was no clear structural difference between the structures of the native and recombinant sources. But the TM-score is meant to address similarity among distant homologs and TM-scores are below 0.9 and above 0.5 identify similar folded regions but also shows important differences. To be prudent, the scores of identical protein below 0.82 were better evaluated. The details are listed in Table 1. At the same time, the structural superposition of those structure pairs is pictured. Fig 2 shows the structure superposition of a pair (2R8S.H and 3IVK.A) whose TM-score (= 0.648) is the lowest among all pairs. From the figure we can observe that a domain deviate obviously from its counterpart when another domain matched well. Apparently, the flexibility of the loop that connects the two domains results in the different conformations. When superposed the two domains separately, it can be seen that they matched each other very well, indicating that although their relative positions are different, their folds are unchanged. Meanwhile, it can be found that the space groups of the crystal obtained from different sources are different (native source with C 1 2 1 space group and recombinant source with C 2 2 2 1 space group). Furthermore, the crystal growth details are also different (data is shown in S1 Table), which is coincident with the conclusion that the protein conformers can be shifted in crystal packing arrangement by varying space groups result from various crystallization conditions [36]. The other six pairs of structural superposition are showed in S1-S4 Figs, and the conclusion is the same.
CE Estimator: Z-Score. The CE method employs the Z-score to assess structural similarity. In this method, a Z-score > 3.5 indicates that the two compared structures are significantly similar. When a Z-score is < 2, the similarity of the compared structures is considered to lack statistical significance. Typically, proteins with a similar fold will have a Z-score of 3.5 or better. Z-scores are dependent on protein size. The Z-score of a smaller structure is smaller. Fig 1B  shows that for the pairs analyzed, no pair had a Z-Score less than 3.5 and that all of the Z-scores were more than 4.25.Thus, judging from the Z-score, we could conclude that there was no clear structural difference between the two types of structures.
FATCAT-Flexible Estimator: P-Value. The P-value is another estimator utilized to assess structural similarity in the FATCAT-flexible method. The smaller the P-value, the higher the structural similarity. According to the FATCAT-flexible method, a P-value < 0.05 means that the two structures compared are significantly similar. The distribution of the P-values of the 517 pairs of structures is shown in Fig 1C. In this Fig, it can be seen that no P-value was over 0.05. Judging from the P-value, we concluded that there was no clear structural difference between the two sources.  The Common Estimator: RMSD and RMSD 100 . When comparing the global level of similarity of identical proteins from the native and recombinant sources, the root-mean-square distance (RMSD) and RMSD 100 are commonly employed.
The RMSD is the measure of the average distance between two aligned proteins. The three structural alignment methods (CE, FATCAT-flexible and TM-align) all employ the RMSD as a structural similarity estimator. In general, smaller RMSD values are associated with protein structure pairs that have greater similarity. However, there are no reports to determine the exact RMSD cut-off value to judge how small a RMSD must be to prove that the compared structures are similar. RMSD values are dependent on the following parameters: (i) the crystallographic resolution of the protein structures that are compared; (ii) the length of the compared proteins and the fitness region of the aligned structures; (iii) the definition of RMSD in different alignment algorithms; and so on. The RMSD is higher when comparing a pair of crystal structures in which one structure has higher resolution and the other has a lower resolution than two crystal structures that both have very high resolutions [37]. Additionally, the length of the aligned protein chain plays an important role in the RMSD value. For example, two proteins with a RMSD of 2 Å are considered similar when the number of aligned Cαs is over 150, while the same value calculated between two Asp-His-Ser structures may occur by coincidence [38].
As expected, the RMSDs calculated by different alignment methods were not exactly the same, and the difference between them is much more significant. To minimize the bias, we normalized the RMSD to RMSD 100 , and the RMSD 100 was used to analyse the similarity between the native and recombinant structures.
RMSD 100 is the RMSD normalized to 100 residues to minimize protein size biases [39]. The RMSD 100 is defined as equation (1) Where N is the number of amino acids residues. The RMSD is the value calculated by the different methods (CE, FATCAT-flexible and TM-Align).  Table 2. The RMSD value obtained from the FATCAT-flexible method had a tendency to be slightly smaller, but the mean value was larger. Unsurprisingly, the FATCAT-flexible method superposes the alignment of equivalent residues in a "flexible" mode. Compared with the "rigid" mode of the TM-Align and CE methods, the FATCAT-flexible method is better optimized. When the RMSD is normalized to RMSD 100 , the mean value of the FATCAT-flexible method only decreased by 0.05 compared with the RMSD mean value, while the mean values of the CE and TM-Align methods decreased much more significantly. For the CE method, the mean value of RMSD was 0.73 Å and the RMSD 100 was 0.58 Å. For the TM-Align method, the RMSD mean value was 0.75 Å and the RMSD 100 value was 0.59. Overall, the RMSD 100 values showed little difference among the TM-Align, FATCAT-flexible and CE methods, which shows that normalizing the RMSD to the RMSD 100 can minimize protein size biases. Because of this, the mean RMSD 100 was used for further analysis. The mean values of the RMSD 100 from the RMSDs found by the TM-Align, FATCAT-flexible and CE methods were calculated. The results showed that only 1% (6/517) of the compared structure pairs had a RMSD 100 value of more than 2 Å. Therefore, we cannot make a conclusion whether a structural difference between the structures for proteins obtained from native and recombinant sources exists. Therefore, we calculated the secondary structural elements of the six pairs of structures that had RMSD 100 values of more than 2 Å. The results are listed in Table 2.
Secondary Structural Elements. As shown by the above mentioned protein alignment estimators (RMSD 100 , P-value, TM-score and Z-score), there is no distinct discrepancy between the native source proteins and recombinant source proteins. To be prudent, the amount of secondary structural elements were also analyzed to compare the native and recombinant structures. For this purpose, the widely used DSSP method was used for the protein structure alignment. We only considered three types of backbone conformations: helix (3 10 helices, α-helices and π-helices), sheet (β-sheet and β-bridge) and loop (any other type). Not all of the second structures of the compared structures were aligned. Only the structure pairs with RMSD 100 values of more than 2 Å and TM-scores of less than 0.8 were analyzed. The results are shown in Table 3. Table 3 shows that the secondary structural element content of the listed structure pairs is different and that there is no trend among them. We can easily see that three pairs (1WDN.A-1GGG.A, 2AVY.U-3UOQ.U and 2R8S.H-3IVK.A) were listed repetitively. Because of this, their detailed secondary structural elements were analyzed by the DSSP method. The results are shown in Fig 4, Fig 5 and Fig 6. From Fig 4, we can see that the native source structure 2AVY.U is a predominantly alpha helix protein, while the recombinant source structure 3UOQ.U forms a beta sheet at positions  two and three. Other positions are also alpha helices, but compared with the native source, they are located at different sites. Overall, the two structures show some differences (61% common fragments). Fig 5 shows that the overall conformation fits well. The overall structural similarity is 94%. With the exception of the very beginning and the end, the recombinant source structure losses three residues, which are marked with asterisks. In addition, the native source structure has more alpha helices, while the same residues are formed a loop in the recombinant structure, which indicates that the native structure is much more stable than the recombinant one. From Fig 6, it can also be seen that percentage of the overall common elements is 87.2%. Although we can observe some differences, the common fragments are the same. The differences are may be due to other reasons. One example is that the X-ray diffraction resolutions are different for different sources.  recombinant sources. Another reason may be the processing of the software that calculated the final results. We also cannot exclude the possibility at this moment that the structures of those pairs are truly different.

Local Details
Hydrogen bond. To further analyze whether those three pairs of structures (1WDN.A-1GGG.A, 2AVY.U-3UOQ.U and 2R8S.H-3IVK.A) are different, we calculated the number of backbone-backbone hydrogen bonds in all of the protein structures determined from native and recombinant sources because hydrogen bonding energy is essential in stabilizing the protein structure more than any other backbone-backbone interaction force. The DSSP program was employed to identify the backbone-backbone hydrogen bonding energy.
The number of backbone-backbone hydrogen bonds are plotted against the DSSP H-bonding energy at different cut-offs given by the DSSP program in Fig 7. Fig 7A shows the plot of the compared structure pairs (517 pairs). From these results, we can see that the overall tendencies of the native source and recombinant source H-bonding energy distribution coincide well. As for the local details, the number of H-bonds in the native source at a cut-off from -3.5 kcal/mol to -2.5 kcal/mol is slightly higher than that of the recombinant source, while that from -2.0 kcal/mol to -1.0 kcal/mol is much lower, suggesting that the structure of the native source is much more stable than the recombinant one. This can be concluded because a strong hydrogen bonding energy is approximately -3 kcal/mol [40]. Fig 7B shows the number of backbone-backbone hydrogen bonds plotted against the DSSP H-bonding energy at different cut-offs given by the DSSP program of the three structure pairs mentioned above (1WDN.A-1GGG.A, 2AVY.U-3UOQ.U and 2R8S.H-3IVK.A). The result is the same as that in Fig 6A. Generally speaking, we can conclude that the hydrogen bond numbers plotted against the hydrogen bonding energy of the native and recombinant sources fit well with each other, which means that there is no significant difference between the native and recombinant sources.
The overall hydrogen bond numbers and hydrogen bond energies from the native and recombinant sources fit well, but the secondary structural elements seem to show a slight difference. We plotted the contact matrix for the hydrogen bonds of the backbone O-atoms to the backbone N-atoms for 2 pairs of structures (1WDN.A-1GGG.A and 2R8S.H-3IVK.A) in Fig   Fig 6. Comparison of the different secondary structural element fragments of 2R8S.H (native source) and 3IVK.A (recombinant source). Helixes (3 10 helices, α-helices and π-helices) are labelled as H and colored with red; E stands for a sheet (β-sheet and βbridge) and is colored with green. The short dash represents a loop (any other types).  (Fig 8A). While the number of hydrogen bonds in 2R8S.H and 3IVK.A (Fig 8B) were not different, the native structure has more hydrogen bonds, but all of the positions of the hydrogen bonds in the recombinant structure 3IVK.A fit well with that of the native one. The number of hydrogen bond differences in the same two structures from different methods is also reported. When comparing the same protein structure solved by X-ray crystallography and NMR, we can see that the number of hydrogen atoms resolved by X-ray crystallography is more than that of the NMR-solved structures [21]. From Fig 7C and 7D, it can be observed that the contact matrix for the backbone Oatom to backbone N-atom hydrogen bonds of 1CSR.A-1CSI.A and 1BZ0.B-1BZ1.B match 100%. Thus, we can conclude that the there is no significant difference between the native and recombinant sources

Discussion
The growth and improvement of the RCSB PDB have substantially richened the protein structures from both native and recombinant sources in decades. Until Jun. 17, 2016, a total of 119,635 structures have been deposited. Given the large numbers of structures determined from both native and recombinant sources, it makes structure comparison possible on a large scale. In this study, we searched the RCSB PDB archive for all of the proteins that had structural data from both native and recombinant sources and compared their 3D structures to determine whether there are any notable differences between the structures. However, after the comparisons, we did not find any protein pairs that show a notable difference in protein fold. There are only conformational shifts found in the loop region for some pairs. This result indicated that, in all of the compared cases, the recombinant proteins could fold correctly into their native forms.
To our expectation and on the basis of principles, the structures of the same protein obtained from native and recombinant sources should share the same fold and conformation within appropriate "errors". If they do not match with each other, there should be some suitable reasons account for it. In an earlier research, an extensive analysis of the structural differences within pairs of crystal and NMR structures of the same protein has been investigated [41]. The structural superposition and the distributions of atomic positions relative to a mean structure were employed to analyze the difference. The results showed that the backbone RMSD of the crystal structure is larger than the average RMSD of the NMR ensemble, and the observed structural differences due to the presence of variability in loops are likely associated with either physical (structure determination protocols, structure quality and structure determination conditions) or methodological factors (methodology used to determine the structural models), or a combination of the two. Also, other researches are also carried out focusing on this issue. The structure comparisons of crystal and NMR structures of the same protein were performed by both global features (RMSD and second structural elements) and local details (hydrogen bonding), the results suggested that conformational differences are caused by loop regions because of various crystal packing and the mathematical treatment of experimental results [21,42].
In our current work, after a thorough examination, it turned out that the total number of proteins that have both structures from native and recombinant sources in the RCSB PDB archive is small. Only 517 proteins (1,034structures) in the current RCSB PDB archive met the criteria (mentioned in Materials and Methods). Compared with the total number of structures (119,635 structures on Jun. 17, 2016) already deposited in the RCSB PDB archive, 1,034 is the tip of an iceberg. From this small number of structures, we did not find any notable differences in the structures between the recombinant and native proteins. After comparing the native and recombinant structures by different structure alignment software, we found that most structure pairs superposed well as the cutoff of similarity estimator, but some structure pairs are beyond the cutoffs (e.g., TM-score < 0.82 and RMSD100 >2 Å). To further examine what differences exist between the pairs, we used the second structural elements and hydrogen bonding for comparison. The results showed slight differences in their secondary structural elements and the number of hydrogen bonds, but the distribution of the hydrogen bonding energy and the position of the hydrogen bonds coincided well. Further examining the diffraction quality, we found that the larger the delta resolution of native and recombinant structure, the larger difference showed in their second structural elements and hydrogen bonding number. Thus, we attributed the differences to the poor diffraction quality of the compared structure pairs. This is coincident with the conclusion that the difference is higher when comparing a pair of crystal structures in which one structure has higher resolution and the other has a lower resolution than two crystal structures that both have very high resolutions [37].
To get more detailed information, we plotted the structural superposition of those structure pairs beyond the similarity estimator cutoffs in Fig 2, S1-S4 Figs. Obviously it can be seen that the structure deviate from the loop regions that connect various domains. The flexibility of the loop results in the different conformations and "larger" RMSD 100 and "lower" TM-scores. When superposed the domains separately, they align very well and the TM-score is surprisingly high. These facts show that the loop variety between the various domain results in the conformational discrepancy.
It is well-known that proteins show multiple conformational sub-states in their native environment, the crystallization state just selects the lowest energy conformers for protein molecular arrangement. It has been found that the crystal structures of the same protein in different crystal forms show significant differences [43,44]. In our study, there are 7 structure pairs the TM-score of which is smaller than 0.82. All of these pairs were obtained from the crystals grown under different crystallization conditions (the data are shown in S1 Table), which result in different space groups (Table 1). Therefore, we concluded that the conformational differences are caused by the different crystallization environments. We have also noticed that, among the 517 structure pairs, there are 48.5% (252 pairs) showed different space groups in the crystal lattice (S1 Dataset), implying that even though the space groups are different, most of the structures pairs still match well.
Meanwhile, the ligands binding to the proteins is central to many essential functions (enzyme catalysis, drug action and receptor activation). Generally, the protein conformation shift exists between the unbound and bound states, or two different ligands bound states [45,46]. So, the ligands binding should be taken into consideration when discussing the protein conformation shift. Our result shows that there are docked ligands in 3 of the 7 low TM-score (<0.82) structure pairs, indicating that the ligands binding is also an alternative explanation for the conformation shift.
Furthermore, post-translational modifications (PTMs) are known to be essential to diversify their protein functions [47]. Only a minimum modification can result in the local distortions of protein structure [48]. When comparing the structural difference, the PTMs must be considered. Until Jun. 17, 2016,there are 37,782 nonredundant proteins have experimentally verified PTM sites deposited in the dbPTM (http://dbptm.mbc.nctu.edu.tw/download.php). Unfortunately, the 517 structure pairs in our study do not have experimentally verified PTM sites. So we can't investigate the effects of PTMs on structural difference. However, we cannot exclude the possibility of the conformation shift induced by the PTMs.
Last but not least, in the field of macromolecular crystallography, it is usually believed that the conformation of a protein represents a single structure unless there is sufficient evidence available for alternative conformation. As a result, multiple conformers may be deposited in the RCSB PDB by chance. And it has proved that a majority of proteins in the RCSB PDB have multiple conformational states [49]. So it is suggested that an ensemble of models is more suitable to represent a protein crystal structure rather than just a single model conducted before [50]. Statistically, it is found only 25% of high-resolution structures represent a single conformer in the RCSB PDB, the remaining 75% exist at least 2 conformations and the RSMD deviation is above 0.6 Å [49]. So, the various conformations of the same protein deposited into the RCSB PDB is one possible factor.
Many studies have shown that differences between recombinant proteins and native proteins do exist. The eukaryotic proteins obtained in the prokaryotic expression host unquestionably lacking the posttranslational modifications. However, in this work, based on the examination of 517 pairs of native and recombinant structures, it seems that no major structural differences have been found. Although there are "exceptions", it is proven that the corresponding individual domains of the "exceptions" can superpose very well, the structure deviation is mainly caused by the flexible loops connecting the individual domains, which can be regarded as the conformation shift and the conformation shift can result from the poor diffraction resolution, different space groups, various crystallization conditions, binding and unbinding ligands and possible PTMs. Such a phenomenon may be due to two reasons. One reason (probably the most important one) is that the most recombinant proteins do fold correctly in various expression systems. Another one, when solving the three dimensional structure of a target protein by X-ray diffraction crystallography, an existing template for the structure modeling is usually required. Based on this template, the two structures will be almost the same after modeling. These two reasons would certainly reduce the possibility of having too many duplicate structures in the RCSB PDB database.
Apparently, the structures that we compared in this study represent a rather small portion (approximately 1%) of the whole RCSB PDB archive. The results obtained from such a small sample may not represent all possible cases. Although we did not find an exception showing notable differences between recombinant structures and their native structures in our comparisons, exceptions have been reported. Thus, we cannot exclude the possibility that a recombinant protein may fold differently from its native type. Maybe that structural distortions of native and recombinant structures do occur in some cases but fail at the expression, purification, or crystallization stages and so they are not observed. Also, even if a different result might have been obtained, it would be rarely reported unless a strong proof exists. Finally, when the accurate crystallographic structural data depositing into the RCSB PDB, it is usually obeyed the command of the existing software tool, if the raw structural data is complex, some associated uncertainties information may be discarded. So it is suggested that only minor procedural modifications would be required at the level of deposition. Then the result come from the raw data could be more convincible [50]. Judging from our current results (all compared protein pairs showed no clear structural difference), we can conclude that the recombinant expression systems of compared structures are no bias with the native environment.

Conclusions
With the rapid development of genetic engineering technology and its convenience in the characterization of protein structure and function, an increasing number of proteins have been obtained using various genetic expression systems. Because there are reports suggesting that the recombinant structure of a protein may be different from its native one, we cannot exclude the possibility that a recombinant protein possesses a structural difference from that of its native type. Although misfolded protein exceptions exist and there are proteins that have different 3D structures even though their sequences are identical, it is unclear how prevalent these exceptions are. Therefore, in this study, we employed the three most popular protein structure alignment methods (CE, FATCAT-flexible and TM-Align) for a global comparison of the structures of identical proteins that were determined from protein obtained by native or recombinant sources. The structure similarity was assessed by the RMSD, TM-score, P-value and Z-score for global comparison. Then, the secondary structural elements and hydrogen bonds were used to probe the local details of the structures that were compared. A total of 517 pairs of native and recombinant protein structures were culled from the RCSB PDB archive, and the structures of each pair were compared one by one. The alignment results showed that there was no significant difference in the 3D structures of all of the proteins in the compared pairs. Our study showed that no example of protein difference was found in the existing RCSB PDB archive, which provides evidence to support the common believed intuitive assumption that expression in a heterologous host usually does not influence structure and function of the target protein.
Supporting Information S1 Dataset. The dataset used in the manuscript.  Table. The crystal growth details of the structure pairs with TM-score < 0.82. (PDF)