dbGSRV: A manually curated database of genetic susceptibility to respiratory virus

Human genetics has been proposed to play an essential role in inter-individual differences in respiratory virus infection occurrence and outcomes. To systematically understand human genetic contributions to respiratory virus infection, we developed the database dbGSRV, a manually curated database that integrated the host genetic susceptibility and severity studies of respiratory viruses scattered over literatures in PubMed. At present, dbGSRV contains 1932 records of genetic association studies relating 1010 unique variants and seven respiratory viruses, manually curated from 168 published articles. Users can access the records by quick searching, batch searching, advanced searching and browsing. Reference information, infection status, population information, mutation information and disease relationship are provided for each record, as well as hyperlinks to public databases in convenient of users accessing more information. In addition, a visual overview of the topological network relationship between respiratory viruses and associated genes is provided. Therefore, dbGSRV offers a convenient resource for researchers to browse and retrieve genetic associations with respiratory viruses, which may inspire future studies and provide new insights in our understanding and treatment of respiratory virus infection. Database URL: http://www.ehbio.com/dbGSRV/front/


Introduction
Respiratory viruses are viruses that enter from respiratory tract and proliferate in respiratory mucosal epithelial cells, causing local infection in respiratory tract or lesions in other organs [1]. Common human respiratory viruses include respiratory syncytial virus (RSV), rhinovirus, influenza virus, parainfluenza virus, human metpneumonia virus, coronavirus, adenovirus and so on [2]. Respiratory virus infection is one of the leading causes of human mortality and morbidity, which confers constant public health treats and results in significant economic losses [3][4][5]. Of note, as of 25th August, 2021, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a novel coronavirus emerged in 2019 [6][7][8], has spread all over the world and caused over 213 million infections and 4.4 million deaths (https://covid19.who.int/).
Human responses to respiratory virus infection differ from uninfected, asymptomatic, mild, moderate, severe to fatal outcome. The wide variations in susceptibility and severity are not only attributed to the different transmissibility and virulence of different virus strains, but also attributed to host factors like age, sex, premature birth, pregnancy, obesity and comorbidity [9][10][11]. Among host factors, host genetic background attracts more and more attention in these years [12,13]. Adoption, twin and heritability studies provided the first line of evidence [14][15][16], followed by candidate-gene study, genome-wide association study (GWAS), whole exome sequencing (WES) and whole genome sequencing (WGS) in recent years [17], revealing that human genetic variants play an important role in susceptibility and severity to infection by altering the expression or function of genes, especially those in genes involved in viral life cycle, host inflammatory and immune response [18][19][20][21]. These genetic association studies may help dissect the underling mechanisms of viral pathogenesis and host antiviral defense and may contribute to future clinical risk prediction models, allowing for the stratification of individuals according to risk so that those at high risk would be prioritized for immunization [22]. Though great progress has been made in this area, there is a lack of database systematically collecting, formatting, annotating, storing and displaying studies of human susceptibility and severity in respiratory virus. Searching and reading related papers scattered in PubMed are time-consuming, hindering convenient access to useful information.
Therefore, we present the first database of Genetic Susceptibility to Respiratory Virus (dbGSRV), which integrates published genetic studies relating susceptibility and severity in respiratory virus infection. It contains 1932 records of genetic association studies relating 1010 unique variants and seven respiratory viruses, manually curated from 168 published articles. Comprehensive information about reference, infection, samples, mutations and their relationships are available at http://www.ehbio.com/dbGSRV/front/. We anticipate that this resource will a useful tool for researchers to query and retrieve genetic association studies of respiratory viruses.

Publications collection
We searched for literatures that describe genetic associations with susceptibility or severity of respiratory virus infections in PubMed, using keywords of 'variant', 'polymorphism', 'susceptibility' combined with names of specific respiratory virus like 'adenovirus', 'bocavirus', 'influenza', 'measles', 'MERS', 'metapneumovirus', 'mumps', 'parainfluenza virus', 'respiratory syncytial virus', 'rhinovirus', 'rubella', 'SARS' and 'SARS-CoV-2'. We selected these viruses because they are the most common respiratory viruses that infect humans. The searching results were manually examined to only leave English publications that study associations between human single nucleotide variants (SNVs), multiple nucleotide variants (MNVs) or indels with susceptibility or severity of explicit respiratory virus in case-control researches. Meta-analysis and researches that did not test which virus the cases were infected with were excluded. As for these related publications, we collected publication information like paper title, first and corresponding author, year and journal published and PubMed Unique Identifier (PMID).

Data extraction, standardization and annotation
We defined one record of genetic association study based on the virus type, case-control sample and variant. As for each record, respiratory virus type information was extracted from the full text of the paper. Virus subtype was also extracted if specified. The number, country, ethnicity and clinical severity information of samples were collected. Based on the ethnicity, we determined the superpopulation of 1000 Genomics that the sample population belonged to. If the sample belonged to multiple superpopulation, marked as 'Mixed'. No matter whether the variant was associated with virus susceptibility or not, all the studied variants mentioned in the main text were included, to provide a more comprehensive and unbiased scope of genetic association studies.
For sake of uniformity, the name, reference allele and alternate allele of each variant were based on the dbSNP database. Many early publications did not offer dbSNP rs ID of the variant. We manually annotated the rs ID of these variants by genomic mapping. The original names of these variants in the publication were also included in the database as old name. The genomic position of the variants was annotated based on hg38 human genome. Annotation of variants relative to genes were based the following order: exon, 5' UTR, 3' UTR, intron, promoter (within 2kb upstream), upstream (2-5kb upstream), downstream (within 2kb downstream).
The alternate allele frequencies of cases and controls, statistic method, odds ratio (OR), 95% confidence interval (CI) and p value for the allele association were extracted from the fulltext or supplementary materials. If only genotype frequency was given in the paper, then alternate allele frequency was calculated manually. As for p value, '> 0.05%' was marked if the paper did not give a specific value but claimed that there was no statistically significant difference or association. The allele, genotype, and haplotype association results were each classified into one of the following four categories: 'severity', 'susceptibility', 'no association' and 'NA'. If at least of one of the allele, genotype or haplotype association result reported 'severity' or 'susceptibility', then the overall association status was determined as 'severity' or 'susceptibility', otherwise the overall association status was determined as 'no association'. Additional noteworthy information about sample, variant and disease association was included in notes.

Database implementation and data analysis
dbGSRV database was implemented as a web application using JavaScript and HTML for front-end development. The used core JavaScript libraries included Vue.js (https://vuejs.org/) for the main front-end framework, vis.js (https://visjs.org) for Network viewer, plotly.js (https://plotly.com/) for Lollipop charts. High-level web framework Django (https://www. djangoproject.com/) was used for back-end data preprocess and data analysis. The global search function was based on Elasticsearch module. Open source data management system MySQL was used for table-data saving and accessing.
Gene Ontology (GO) and pathway analysis of associated genes were conducted using Database for Annotation, Visualization and Integrated Discovery (DAVID, https://david.ncifcrf. gov/).

Web interface
The dbGSRV database comprises six pages, including Home page, Browse page, Batch Search page, Advanced Search page, Network page and Help page.
On the Home page, users can find a brief introduction, update log of dbGSRV and a quick search box (Fig 1A). The quick search box allows users to search genetic association records based on virus name, variant rs ID, gene name or genomic region. The Batch Search page allow users to search multiple viruses, variants, genes, or genomic loci either by entering keywords in the text box or by uploading a txt file (Fig 1B). On Advanced Search page, users can search by logical combination of more keywords (Fig 1C). The Browse page permits users to browse all records by virus, annotation or study type (Fig 1D).
The search results are presented as pie charts and a table (Fig 1D). The pie charts display the number and proportion of each subgroup as for virus type, variant position relative to genes and study type respectively, while the table contains the basic information of each record, including virus type, variant information (rs ID in dbSNP database, position in hg38 genome and relative position to genes), study type, sample size and association status. Clicking the subgroup in the pie charts will get the results of the subgroup, and clicking the same subgroup one more time will return back.
In addition, the table provides several features. First, users can further filter the results in the table by typing terms in the 'filter' box at the top-right of the table. Second, Clicking the icon on the right of 'filter' box, users can change the columns displayed in the table. The following three columns can also be added: ID (unique ID of each record), Year (the year that the paper is published) and PMID (the PMID of the paper in PubMed database). Third, each column could be ranked in ascending or descending by clicking the triangle on the right side of the column header. Fourth, for each record, clicking the Variant dbSNP, Position, Gene and PMID column will take users to the corresponding page in the dbSNP, UCSC, GeneCards and PubMed database respectively. Clicking 'More' will jump to the Details page of the record in the database, which consists of more detail information about the reference, infection, population, mutation and disease relationship (Fig 1E). The Network page provides a visual overview of the topological relationship between respiratory viruses and associated genes (Fig 2). Nodes represent respiratory viruses and genes. Respiratory virus node and gene node are linked by an edge if at least one variant on the gene are reported to be associated with the susceptibility or severity of the virus. As default, the network shows all the respiratory viruses in the database. Users can select a set of specific viruses and submit to generate a new network for these viruses. The attributes of nodes (such as size, shape, background color, label font size and label color) and the overall layout of the network can be edited. The picture can be exported in SVG format for publication usages, as well as the data used to generate the network, which can be downloaded in an Excel file.
At last, dbGSRV provides a detailed tutorial for usage of the database in the Help page.

Database statistics
For a more comprehensive and unbiased understanding of the genetic association studies with respiratory viruses, the database not only includes positive results of association, but also includes negative results reported in the main text of the paper. In total, dbGSRV contains 1932 records of genetic association studies relating 1010 unique variants and seven respiratory viruses, manually curated from 168 published articles. The  (Fig 3A). Besides, a majority of the records are related with variants residing in the intron (35.7%) and exon (34.3%) of genes ( Fig 3B). As for study strategy, most of the records are curated from candidate-gene study (62.4%) (Fig 3C). It is worth noting that 610 records report positive genetic associations between 249 unique variants of 159 genes with respiratory virus infection, mostly based on allele frequency (Fig 3D). Among the positive associations, 149 records are related to susceptibility to infection while 461 records are related to severity.
The Network page in the database provides a visual overview of the topological relationship between respiratory viruses and associated genes as shown in Fig 2. Influenza virus, RSV and SARS-CoV-2 have the most associated genes, which is in accordance with these three respiratory viruses having the most study records. On the other hand, a couple of genes are associated with multiple respiratory viruses. Particularly, TNF gene, which is a key mediator of the inflammatory response and is critical for host defense against a wide variety of pathogenic microbes [23], is associated with the greatest number of respiratory viruses.

GO and pathway analysis of associated genes
We performed GO and pathway analysis of associated genes using Database for Annotation, Visualization and Integrated Discovery (DAVID). The top 10 significantly enriched GO terms and pathways were shown in Tables 1 and 2, respectively. GO analysis revealed 'immune response' and 'inflammatory response' were top enriched terms. In addition, other enriched terms such as 'positive regulation of T cell proliferation', 'positive regulation of interferon-gamma production', 'regulation of complement activation', 'type I interferon signaling pathway', 'cytokine activity' and 'positive regulation of inflammatory response' are also related with immune response and inflammatory response, highlighting the central role of these processes against respiratory virus infection [24]. Notably, inflammatory response is double-edged sword in respiratory virus infection [25]. On one hand, inflammatory response promotes immune response against infection. On the other hand, 'cytokine storm' triggered by inflammatory response may worsen the severity of respiratory virus infection [26].
Pathway analysis revealed a significant enrichment for pathways directly related to pathogens and autoimmune diseases such as 'Influenza A', 'Inflammatory bowel disease (IBD)', 'Herpes simplex infection', 'Measles', 'Rheumatoid arthritis' and 'Leishmaniasis'. There were three enriched pathways related to cytokines, 'Cytokine-cytokine receptor interaction', 'Jak-STAT signaling pathway' which is the downstream signaling pathway of cytokine interferon [27] and 'Cytokine Network'. In addition, 'Toll-like receptor signaling pathway', which is essential for viral sensing and triggering downstream immune response [28], was also enriched.

Discussion
To our knowledge, dbGSRV is the first manually curated database containing comprehensive human genetic association information with respiratory viruses. It is composed of several characteristic features worth noting.
First, dbGSRV contains 1932 records of genetic association studies relating 1010 unique variants and seven respiratory viruses with a user-friendly interface, which is convenient for

PLOS ONE
dbGSRV: A manually curated database of genetic susceptibility to respiratory virus researchers to browse and retrieve the data, access more information of public databases by hyperlinks and visualize the network of respiratory viruses and associated genes. Users could make use of the database to familiarize themselves with respiratory virus genetics, explore genes implicated in virus infection, select variants to further confirm and check whether significant variants discovered in their studies have been reported previously. Second, inconsistent results are frequently found in replication studies of genetic association [29,30], thus only recording positive association results may result in misleading. To provide a more comprehensive and unbiased scope of genetic association studies, we not only included records of positive association results in the database, but also included records of negative results mentioned in the main text of references. Additionally, as conflicting genetic association results might result from factors such as ethnicity, sample size, allele frequency and analysis method [31], we also collected information like number, ethnicity and alternate allele frequencies of cases and controls, study type, statistic method, OR, 95% CI and p value for the allele association if available, in order to facilitate users to assess and compare different study results accurately and comprehensively.
Third, users should recognize that there are certain limitations in the database. Among all kinds of common respiratory viruses, only seven respiratory viruses have been studied of human genetic susceptibility or severity and thus included in the database, and a majority of the studies are candidate gene studies. This might lead to a biased representation of certain variants and associations in the database. Therefore, more and more studies, especially GWAS studies, on a wider range of respiratory viruses are anticipated in the future, and dbGSRV will be updated about once in a year according to newly available data. In addition, we find that many genetic association studies with respiration viruses have limited sample size and statistic power, which might be compensated with meta-analysis [32]. Therefore, we plan to include meta-analysis literatures in the updated database in the future.
In summary, dbGSRV will be a convenient resource for researchers to query and retrieve genetic associations with respiratory viruses, which may inspire future studies and provide new insights into our understanding and treatment of respiratory virus infection.