Principal component analysis of coronaviruses reveals their diversity and seasonal and pandemic potential

Coronaviruses and influenza viruses have similarities and differences. In order to comprehensively compare them, their genome sequencing data were examined by principal component analysis. Coronaviruses had fewer variations than a subclass of influenza viruses. In addition, differences among coronaviruses that infect a variety of hosts were also small. These characteristics may have facilitated the infection of different hosts. Although many of the coronaviruses were conservative, those repeatedly found among humans showed annual changes. If SARS-CoV-2 changes its genome like the Influenza H type, it will repeatedly spread every few years. In addition, the coronavirus family has many other candidates for new pandemics.

I am afraid there was a misinterpretation. HCoVs are conservative ( Fig. 3A) and, with the current data, the evolutionary rate of SARS-CoV-2 cannot be estimated. Coronaviruses have not evolved rapidly when compared with influenza ( Fig. 1, S2A  Fig.). HCoVs are more conservative than flu viruses (Fig. 3A). The annual change is very small (Fig. 3B) in comparison with the influenza N1H1. To clarify this, I added a Supporting figure (S2G Fig.) where the difference in the evolutionary rate is very clear.
>12. Line 128: Please revise this sentence for aptness. Human outbreak strains must be replaced with more appropriate scientific terminology, for eg., Human CoVs of epidemic potential or epidemic human CoVs. I replaced the term accordingly.
I replaced the wording accordingly.
>14. The scientific aptness for usage of words needs to be carefully proof-read throughout the manuscript. Unfortunately, English is not my first language, and so as was stated, the manuscript was professionally edited for language before submission. I have reordered English editing. I hope that this improves the manuscript.
>15. Lines 135-136: Can this statement be supported by any other epidemiological study (studies based on respiratory clinical specimens and sero-surveillance) reference?
Unfortunately, no. The differences in the viruses are not detectable by ordinary clinical or serum analysis. Only DNA sequencing can reveal differences. However, for a long time, we used the wrong methods for sequence analysis; this might have obscured many obvious pieces of evidence. I refer to a new figure added in the Supporting Information.
>16. Lines 138-139: Kindly re-frame the sentence for better clarity as this seems one of the important observations of the study. For eg., what "variety" is being referred to here?
I have added explanations to clarify what was selected. I wish to thank the reviewer for the comment. I found that the European sample (Normandy, France) was reported in 2017, but it was collected in 2002. I corrected Fig  2A and 2B accordingly.
>17. Even though the data is depicted in figures, the text should maintain clarity of thoughts of the authors.
I have reordered the English edition. I hope that this improves the manuscript.
>18. Line 147: contrastive to be replaced by "in contrast". I replaced the wording accordingly.
>19. The manuscript needs to be extensively revised for usage of English language in order to aptly express the important scientific findings of this work.
I have reordered English editing. I hope that this improves the manuscript. >20. Line 152: "Emvecovirus" to be corrected "Embecovirus". I replaced the wording accordingly.
>21. Lines 159-162: Important statement, but requires revision for clarity. For instance, it can be "The spike protein mediated infection by CoVs…more adaptable…" I replaced the wording accordingly.
>22. Line 176-177: "Adaptation to the genetic system of a new host may alter codon usage and several amino acids" needs reference.
I added a reference. >24. Line 188: "variable residues during three decades" Which residues are being talked about and the time span? If possible, elaborate a bit more.
I added an explanation for the residues and why those residues seem to have changed.
>25. Line 190: "infectibility" or infectivity? I wish to thank the reviewer. It was infectibility. >26. Line 191: "influenza A did." To be replaced with "influenza A virus did." I replaced the wording accordingly. >27. Line 198: "The ORF lengths for the influenza virus are within a certain range..." What is this range? Add a reference also. This comparison of ORFs must be further expanded for better understanding of the readers. I have added a reference and explanations.
28. Line 203: "Spike protein… and may cause antibody-dependent enhancement". Why and how? Must be elaborated here with proper references. I wish to thank the reviewer for the comment. Here, I cited a wrong reference. I corrected the reference and added a new one related to the phenomenon "antibody dependent enhancement." 29. The comparison of coronavirus and influenza viruses still requires to be addressed in a more extensive manner in relation to the present findings. For instance, lines 198-203: can an example of influenza virus protein be taken here for comparison? I wish to thank the reviewer for this comment. Instead of taking a protein as an example, I explained how all proteins changed. In addition, I added information on the length of the ORFs to make the explanations tangible.
30. Lines 224-228: How can this issue be addressed. Elaborate a bit more and if possible include this in a separate section for conclusion.
I added an explanation for the methodology and explained the real classification of coronaviruses.
Reviewer #2: 1. Title of manuscript is not very apt for the content of manuscript. I would suggest to select more apt title. I added a term "seasonal" The current title should be suitable for the summary and the concluding paragraph newly placed at the end of the discussion section.
2. Line 33-"Rather, those repeatedly found among humans showed annual changes". Cite a reference for this sentence.
I have cited a reference here.
3. Line 44-" a 2010-2015 study in China reported that 2.3% and 30% of patients were positive for coronavirus and influenza virus, respectively [6]; a similar ratio was found in another large study 46 [6]. Please provide correct referencing for larger study mentioned in the sentence. I wish to thank the reviewer for this comment. I apologise for the incorrect citation. I corrected it.
4. Full forms of abbreviations used are missing.
I have corrected this accordingly. 5. Materials and methods need extensive revision to indicate type of sample used, which all sequences were compared and analyzed. Software and tools used to compare and conclude the findings.
All the samples were examined using the same method, as has been written. I added an explanation at the beginning of the paragraph.
6. Line 104-110 sentences need to be reframed and structured for clarity of readers. It is not clear as a sentence talks about influenza and suddenly next sentence seems to be talking about coronavirus.
Throughout the Results section, what was discussed was coronavirus. Examples of influenza were used for comparisons with coronavirus. To clarify this, I inserted the term "coronavirus" in every sentence that contains the names of the subclasses in the Results section.
7. Line 114 "Similarly to other RNA viruses [5], many indels were observed, especially in some smaller ORFs" replace similarly with similar.
I replaced the wording accordingly. 8. Many long sentences have been used. For the clarity of readers reframe sentences.
I checked the length of all sentences. Some were compound sentences. Although they are not harder to understand, I changed those into single sentences (including sentences separated by semicolons). Some had a non-restrictive relative clause. Those are difficult to separate and should not cause reading difficulty. The average number of words per sentence was 15.37. This is within the range (12-17) that is asked for scientific writing https://www.aje.com/arc/editing-tip-sentence-length/. I have re-ordered English editing; I hope this improves the readability. 9. Line 128 "Each of the human-outbreak strains had similar ones in bats or camels, with minor differences 129 ( Fig. 1 and S1 Table)". Please correct it grammatically I altered the wording.
10. The word contrarily has been used many a times. Use other forms of the word. The manuscript used the word "contrarily" three times. I have used "in contrast" four times, so maybe this should be avoided. I replaced two instances with "on the other hand" and "to the contrary." 11. Line 158-These characteristics corroborate the assessment that "coronaviruses can apparently breach cell type, tissue, and host species barriers with relative ease. Relative to what?
The authors of this article did not specify it. However, they compared coronaviruses with influenza or ebola viruses. So, I added "other viruses." 12. Authors must clearly mention why they are comparing influenza and Coronavirus. How that may help in curbing present pandemic.
I added some information about influenza and quoted a phrase of Sun Tzu.
13. Figures should be discussed extensively in the result and discussion section.
I deleted a part of the Introduction section, which briefly introduced the manuscript; this deleted explanations about the Figures. Some text in the Materials and Methods just show cues to understanding, so they do not use for discussions. I believe that this helps with readability.
14. Results need extensive discussion as at many places information is missing or is not very clear.
I have added some Supporting Figures. I hope that this will improve clarity.
15. Thorough revision of english language and sentence reformation is required.
I requested a professional re-editing of the manuscript. I hope that this will improve the English.
16. Reference number 15 and 17 need formatting I deleted some letters "_" from both of them. The corona and influenza viruses have similarities and differences in infectivity, spreadability, 55 and symptomatology. These differences are based on their genomes, which are important for 56 estimating how SCoV2 will act in humans. Supporting material (S1-S3 data). 73 To prepare a comprehensive data set for SCoV2, 2796 full-length sequences were obtained 74 from the Global Initiative on Sharing All Influenza Data (GISAID) database [24] and added to 75 those used for Fig. 1. Some records were preliminary and contained many uncertain bases 76 (designated by "N"), which may be counted as indels. To cancel such artefacts, the 77 corresponding regions were replaced with the average data, which cancels the corresponding 78 bases from the results of the PCA. The list of subjected sequences is available in S3 data.

121
The coronaviruses separated into distinct classes ( Fig. 1), which could be further divided into 122 subclasses (S1 Fig.). For example, SARS-CoV and SCoV2 belong to different subclasses of 123 Sarbecovirus (S1 Fig. and S1 Table). The origin of the graph (0, 0) coincides with the mean data. The variation magnitude, estimated by the mean distance ̅ , was 0.11. This is much smaller than 137 those of single subclasses of influenza A virus (S2 Fig.), such as H1 or H9. The value has been 138 scaled so it has a kind of generality; actually, the value of ̅ was not significantly altered by 139 artificial reductions of sample numbers or sequence length (not shown). 140 Among the classes of the coronaviruses, Gammacoronavirus and Deltacoronavirus, which are 141 close to the origin of the graph, were mainly found in bird samples ( Fig. 1 and S1 Table). These affected by focusing on the indels or on the rest of the sequences (S4 Fig.). Therefore, indels 157 were not given extra weight in this study; they were treated as a base or a residue. Note that some 158 small ORFs, such as the envelope and nucleocapsid, are conservative and lack indels. The values of PCs were not significantly affected by the hosts (Fig. 1). For example, differences 166 between bird and swine coronaviruses in Deltacoronaviruses were small (S1 Table, PC18). This 167 is in contrast to influenza viruses, which were separated among different hosts. For example, in 168 influenza H1N1, the waterfowl class is near the centre, with three swine groups around it, and 169 two human groups further apart (S2 Fig.) [5]. For coronaviruses, those that are more distant from 170 Norbecovirus seem to infect larger animals, but this rule is not absolute (S1 Table).
Each of the human epidemic coronaviruses had similar viruses in bats or camels, although there 172 were minor differences ( Fig. 1 and S1 Table). In the SARS-CoV spike protein, no amino acid 173 residue was unique to humans. This is partially because our knowledge about the viruses has 174 increased after the efforts to screen for likely viruses in wild animals [2,[25][26][27]. Only 35 out of 175 2412 residues were different between SARS-CoV and similar bat viruses, and many of these 176 were not conserved among the bat samples (S2 Table). The situation was the same with SCoV2, 177 which presented 34 unique amino acid residues (S2 Table); naturally, this uniqueness will be 178 reduced after further research. 179 The year occurrence of influenza A H1N1 and HCoVs is very different, since only one H1N1 180 variety spreads worldwide yearly (S2G Fig.) [5], while several OC43 variants appear even within 181 a single country (Fig. 3A, S5, and S3 Table). H1 variants will never return in the subsequent  Table S4.  Table). The coronavirus and influenza classes were fairly different in two aspects: although the groups 217 were clearly separated, the differences did not match those of the hosts in coronavirus, and the 218 divergence magnitude among coronaviruses was much lower than that of a subclass of the conservative, and hence would be ideal targets for vaccines (Fig. 2). HCoVs are more 318 conservative than influenza ( Fig. 3A and S2G), but they also show annual changes (Fig. 3B) Table S4). 449 S1 Table. PC for samples of whole genome sequences of coronaviruses. Figure 1 [11]. Some viruses may have a much higher infectivity and cause outbreaks: e.g.,  CoV [13,14], 15,16], and SARS-CoV-2 (SCoV2) [1,3,17,18]. The former 81 two cause severe symptoms, while the latter varies from asymptomatic to critical.

161
The coronaviruses separated into distinct classes (Fig. 1), which could be further divided into 162 subclasses (S1 Fig.). The coronaviruses separate intoconsist of distinct classes (Fig. 1). In the 163 lower PC axes, the oAnother set ofother classes were separated in the lower PC axis (S1 Fig.), 164 and these were further divided into subclasses. For example, SARS-CoV and SCoV2 belong to 165 different subclasses of Sarbecovirus (S1 Fig. and S1 Table). The variation magnitude, estimated by the mean distance ̅ , was 0.11. This is much smaller than 179 those of single subclasses of influenza A virus (S2 Fig.), such as H1 or H9. The value has been 180 scaled so it has a kind of generality; actuallyIncidentally, the value of ̅ was not significantly 181 altered by artificial reductions of sample numbers or sequence length (not shown).

182
Among the classes of the coronaviruses, Gammacoronavirus and Deltacoronavirus, which are 183 close to the origin of the graph, were mainly found mainly in bird samples ( Fig. 1 and S1 Table).  191 Strains of Human coronaviruses (HCoV belonged) belong to Embecovirus,Alphacoronavirus,192 and Dubinacovirus (Fig. 1, blue). On the other hand, the strains of recent major epidemics were  Fig.). Therefore, indels were not given extra weight in this study; 205 they were treated as a base or a residue. Note that some other small ORFs, such as the envelope 206 and nucleocapsid, are conservative and lack indels. The values of PCsPC were not significantly affected by the hosts (Fig. 1).) e.g. For example, 214 differences between bird and swine coronavirusesviruses in Deltacoronaviruses were small (S1 215 waterfowl is located near the centre, with three swine groups located around it, and two human 218 groups furtherwere positioned in the most apart (S2 Fig.) [5]. InFor coronaviruses, those that are 219 more distant [5]. In coronaviruses, those apart from the Norbecovirus seem to infect larger 220 animals, but this rule is not absolute (S1 Table). 221 Each of the human epidemic human coronavirusescoronavirus-outbreak strains had similar 222 onesviruses in bats or camels, although there werewith minor differences ( Fig. 1 and S1 Table). 223 In the SARS-CoV spike protein of SARS-CoV, no amino acid residue was unique to humans.

224
This is partially because our knowledge about the viruses has increased after the efforts to screen 225 for likely viruses in wild animals [2,[25][26][27]. Only 35 out of 2412 residues were different 226 frombetween SARS-CoV and similar bat viruses, and many of these were not conserved among 227 the bat samples (S2 Table). The situation was the same inwith SCoV2, which presented 34 228 unique amino acid residues (S2 Table); naturallyhowever, this uniqueness will be reducedcould 229 disappear after further research. 230 The year occurrence of IinfluenzaInfluenza A H1N1 and HCoVsHCoV yearly occurrence is very 231 different, since only one H1N1 variety spreads worldwide yearly (S2G Fig.) [5],. On the other 232 hand, while [5]. Contrarily, several OC43 variants appear even within a single country (Fig. 233 3A3a, S5, and S3 Table). H1 variants will never return in the subsequent seasons, whereas OC43 234 varieties appearappeared repeatedly for a decade. However, by concentrating solely on one 235 variety, the annual alterations became obvious (Fig. 3B). As an example, here thea variety with a 236 high PC1 value was selected., Itwhich was found in 1985-2000 in the USA and in 2002 in France 237 (Fig. 3A). The mutation speed of alteration iswas much slower than that of influenzaInfluenza 238 H1N1 (S2 G Fig.); as the PC values are scaled, the magnitudes can be compared directly.3b).  Table S4.  Table). Deltacoronavirus and Embecobirus were exchanged (Fig. 4B4b). Additionally, the position of  (Fig. 4C4c). These drastic changes are difficult to explain without shifts. The coronavirusCoronavirus and influenza classes were fairly different in the following two 273 aspects: although the groups were clearly separated, the differences did not match those of the transferred to other countries, and mutate further (S6 Fig.). TheseThe changes may help 325 acclimatisationcould be acclimation to humans (c); however, they may also relate to the herd 326 immunity (b) and/or lower lethality (d). 2 and S3). It seems that tThis protein ismight be too short to form a variable structure., 337 Therefore, this will bemaking it a good target for herd immunity. and tThese conservative ORFs 338 might be suitable to produce vaccines targets. In contrast, the Sspike protein tendedtends to 339 change (S3 Fig.), and may cause antibody-dependent enhancement ( other than the Sspike protein will help with understanding the immune mechanisms. 348 It seems that this protein is too short to form a variable structure. Therefore, this will be a good 349 target for herd immunity and these conservative ORFs might be suitable to produce vaccines. In contrast, the Spike protein tends to change (S3 Fig.), and may cause antibody-dependent 351 enhancement [40]. 352 Many bat coronaviruses seemed to be able to infect humans. The bat and human viruses are 353 similar (Fig. 1) Table S4.