Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A domain-knowledge modeling of hospital-acquired infection risk in Healthcare personnel from retrospective observational data: A case study for COVID-19

  • Phat K. Huynh,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, FL, United States of America, Department of Industrial and Manufacturing Engineering, North Dakota State University, Fargo, North Dakota, United States of America

  • Arveity R. Setty,

    Roles Conceptualization, Investigation, Validation, Visualization, Writing – review & editing

    Affiliations University of North Dakota, Fargo, North Dakota, United States of America, Sanford Hospital, Fargo, North Dakota, United States of America

  • Quan M. Tran,

    Roles Conceptualization, Methodology, Writing – review & editing

    Affiliation Department of Biological Sciences, University of Notre Dame, Notre Dame, Indiana, United States of America

  • Om P. Yadav,

    Roles Conceptualization, Methodology, Validation, Writing – review & editing

    Affiliation Department of Industrial and Systems Engineering, North Carolina A&T State University, Greensboro, North Carolina, United States of America

  • Nita Yodo,

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Industrial and Manufacturing Engineering, North Dakota State University, Fargo, North Dakota, United States of America

  • Trung Q. Le

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Validation, Visualization, Writing – review & editing

    Affiliations Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, FL, United States of America, Department of Industrial and Manufacturing Engineering, North Dakota State University, Fargo, North Dakota, United States of America



Hospital-acquired infections of communicable viral diseases (CVDs) have been posing a tremendous challenge to healthcare workers globally. Healthcare personnel (HCP) is facing a consistent risk of viral infections, and subsequently higher rates of morbidity and mortality.

Materials and methods

We proposed a domain-knowledge-driven infection risk model to quantify the individual HCP and the population-level risks. For individual-level risk estimation, a time-variant infection risk model is proposed to capture the transmission dynamics of CVDs. At the population-level, the infection risk is estimated using a Bayesian network model constructed from three feature sets, including individual-level factors, engineering control factors, and administrative control factors. For model validation, we investigated the case study of the Coronavirus disease, in which the individual-level and population-level infection risk models were applied. The data were collected from various sources such as COVID-19 transmission databases, health surveys/questionaries from medical centers, U.S. Department of Labor databases, and cross-sectional studies.


Regarding the individual-level risk model, the variance-based sensitivity analysis indicated that the uncertainty in the estimated risk was attributed to two variables: the number of close contacts and the viral transmission probability. Next, the disease transmission probability was computed using a multivariate logistic regression applied for a cross-sectional HCP data in the UK, with the 10-fold cross-validation accuracy of 78.23%. Combined with the previous result, we further validated the individual infection risk model by considering six occupations in the U.S. Department of Labor O*Net database. The occupation-specific risk evaluation suggested that the registered nurses, medical assistants, and respiratory therapists were the highest-risk occupations. For the population-level risk model validation, the infection risk in Texas and California was estimated, in which the infection risk in Texas was lower than that in California. This can be explained by California’s higher patient load for each HCP per day and lower personal protective equipment (PPE) sufficiency level.


The accurate estimation of infection risk at both individual level and population levels using our domain-knowledge-driven infection risk model will significantly enhance the PPE allocation, safety plans for HCP, and hospital staffing strategies.

1. Introduction

Nosocomial infections (i.e., hospital-acquired infections) of communicable viral diseases (CVDs) (e.g., influenza virus, hepatitis A virus, and rotavirus infections) have posed huge challenges to public health organizations. Healthcare personnel (HCP) experience the highest risk [13] because of the direct or indirect contact with infected patients and virus-contaminated surfaces. Subsequently, these HCP may become widespread the virus to non-infectious patients, coworkers, and their family members. Although there has been an increasing number of hospital outbreaks of CVDs over the last decade, current containment and preventive measures in hospital settings usually overlook asymptomatic individuals and “super spreader” events [4, 5]. Hence a quantitative estimation of the infection risk in HCP is critical to mitigate and subsequently prevent nosocomial infections in hospitals. Furthermore, a precise measure of HCP infection risk is also important to address the epidemiological issues in hospital settings and provide information for personal protective equipment (PPE) allocation, safety plans for HCP, and staffing strategies.

Modeling nosocomial HCP infections in hospitals has been based on mathematical models to qualitatively capture the dynamics of CVDs and the effects of different control measures [6, 7] at population level. One traditional model of disease spread is the compartmental SEIR (Susceptible-Exposed-Infected-Recovered) epidemiological model [8]. It divides a population into four different compartments or sub-groups (susceptible, exposed, infected, and recovered individuals) and employs deterministic ordinary differential equations to model the spread of a CVD. In the literature, there are many variants of this model (e.g., SIS, SIRD, MSIR, and MSEIR model). These models consider the population as homogenous without individual interactions (e.g., patients and HCP); therefore, they fail to capture the individual contact process and the effects of individual risk and protective factors [9]. In other words, only the patient flow dynamics are captured at the population level, but the individual-level information is neglected. To address the “homogenous population” assumption and capture better transmission dynamics, spatial epidemic models [10, 11] and metapopulation models [12, 13] have been introduced, which can be considered as “heterogeneous models”. The spatial modeling approach incorporates the spatial information of disease occurrences and transmission behaviors to estimate of the probability of infection transmission from an infectious to a susceptible individual at a certain distance. The metapopulation models are the integration of the compartmental models and spatial epidemic models, which allows us to simulate the behaviors of a large population given a well-defined spatial distribution. In this approach, the entire population is separated into two different subpopulations with coupled disease transmission dynamics based on the subpopulation characteristics, and each subpopulation is mixed homogeneously. In addition, network models [14, 15] have been proposed to model the contact network elaborately without having the assumption that all individuals interact with each other with a regular random pattern [16]. The contact network concept was founded by the mathematical graph theory, and a graph as two fundamental components: vertices and edges. In the context of epidemiological modeling, a vertex can represent an individual, a subpopulation, or even an entire country, an edge is the link between vertices, which captures the disease transmissions and individual interactions. Despite the tremendous efforts in advancing the population-based mathematical epidemiological model, the advanced models, such as “heterogeneous models” and network models, are still limited in terms of modeling and characterizing the individuals’ movements and interactions.

To overcome the limitations of the population-based models, individual-based models have been proposed to fully capture the complexity of individual behaviors and interactions with other individuals. First, complex systems approaches using cellular automata (CA) theory have been proposed to model location-specific dynamics of susceptible populations and the probabilistic nature of disease transmission [17, 18], which takes into account the individual movements of the populations. However, The major drawback of CA models is their insufficiency in characterizing the spatial-temporal information of individuals’ movements and interactions [19]. Agent-based modeling (ABM) was proposed to address the limitations of CA models by accounting for the individual-level movement of disease carriers and the contact network of people [20]. Although the ABM approach can capture the spread of a CVD in a spatial region (e.g., hospital) over time and estimate the risk of viral infection, it requires a large amount of information of individuals’ movement and high computational cost. Moreover, individuals’ movements are highly restricted in hospital settings, especially for patients who have positive test results for infectious diseases.

Statistical models have been used as an alternative to mathematical models to quantify the population-averaged effects of protective or risk factors on the time-variant infection risk of HCP. Statistical models capture the disease transmission dynamics within the hospital, HCP-related risk factors of infection, and other patients and HCP as sources of infection [21]. Two classes of statistical models, namely measure of association and statistical survival analysis, have been proposed to estimate HCP infection risk. The measure of association approaches quantifies the relationship between the exposed and diseased HCP groups by using the adjusted odds ratio (aOR), risk difference (RD), and relative risk (RR) as the risk measures [2, 2225]. To capture the changes of HCP’s characteristics and infection risk over time, survival analysis models are used to estimate the HCP infection risk and the expected duration of time until a viral infection occurs [26, 27]. Although time-dependent variables have been considered in the survival analysis models, the stochastic nature of epidemiological dynamics and individual interactions have not been investigated.

To overcome the above research gaps, this paper proposes a probabilistic domain-knowledge-driven model of the infection risk of CVDs for HCP, which is a hybrid multi-scale approach that combines mechanistic modeling and statistical modeling approaches. The term “domain-knowledge-driven model” [28] refers to a class of statistical or machine learning models that leverage the expert knowledges and embed them into the models to enhance the performance, understanding, and validity. The domain knowledge we incorporated in our proposed probabilistic framework includes modes of CVD transmissions, significant risk and protective factors of the infection risk, and disease transmission dynamics through patient compartments. The proposed model was formulated for the infection risk estimation at both individual and population levels with respect to three modes of transmissions: 1) direct contact of susceptible HCP with other infectious individuals including patients and coworkers, 2) airborne viruses, and 3) contaminated equipment and surfaces. The individual-level risk model was built based on the population grouping in the SEIR model with the consideration of the time-varying confounders to capture the dynamical contagious disease transmission mechanism. At the population-level, three subsets of features, which are introduced in Sub-section 2.2, were constructed and represented by a Bayesian network [29], from which the probability of transmission from patients to HCP was estimated. The main contributions of this paper are 1) a novel probabilistic model characterizes the dynamics of the disease transmission in HCP over time and 2) a domain-knowledge-driven risk analysis model that quantifies both the individual-level and population-level infection risk. The remainder of the manuscript is organized as follows: Section 2 discusses the model formulation and model validation; the results with sensitivity analysis and the case study on the COVID-19 are presented in Section 3, discussion and conclusions are provided in Sections 4 and 5.

2. Materials and methods

The proposed framework consists of two sub-models: (1) an individual-level infection risk model that quantifies the risk of infection of an HCP, and (2) a population-level model that estimates the infection risk under working conditions at a medical facility. The output from the first sub-model serves as an input for the estimation of the population infection risk in the second model. Other inputs, such as engineering control and administrative factors, were also considered in the estimation of population risk.

2.1. Individual-level infection risk model

The individual infection risk model aims to quantify the potential risk of infection associated with a healthcare worker subject to nosocomial infection, whose job functions require working in proximity of patients. The proposed individual-level infection risk model is formulated using the population grouping approach in the compartmental SEIR model [8], in which the population is divided into different compartments (i.e., Susceptible (S), Exposed (E), Infectious (I), or Recovered (R)). However, susceptible (S) and recovered individuals (R) cannot transmit the virus during the length of a hospital stay [8], hence we do not consider these compartments in our model. Moreover, we do not assume that the recovered patients confer immunity to reinfection when being released from isolation. HCP coworkers have also been shown to contribute significantly to virus spread within the healthcare setting if contracting a virus [26, 30]. To capture the virus transmission mechanism, the healthcare worker group (HW) is added to model the HCP-HCP transmission, and the infectious individuals are further classified into two sub-groups: the infection-confirmed group (IC) and the infection-suspected (IS) group. Infection-confirmed individuals are those who have lab-confirmed infections (e.g., individuals have tested positive for COVID-19 using the polymerase chain reaction (PCR) test), and the infection-suspected group includes individuals who are suspected to have the virus infection because they developed symptoms but have never tested for the infectious disease. In total, four groups (E, IC, IS, HW) are considered to model the individual HCP infection risk. We denote the potential infection risk of the HCP j at location i (e.g., hospitals) over time from t1 to t2 as , which is the cumulative risk of viral infection after contacting patients and contaminated surfaces. We denote , and as the number of exposed cases, infection-confirmed, infection-suspected, and colleagues that an HCP j has contacted with over the time (t1:t2), which is denoted as (∙) (e.g., ).

An HCP j is assumed to have independent close contacts with an individual k. Next, we denote as the probability of viral transmission from individual k to the HCP j, with X∈{E, IC, IS, HW} being the compartment indicator of person k. Here, if the probability is constant, the viral transmission mechanism is modelled as a binomial process [31], and there are binomial processes in total. The sequence of contacts of HCP j ordered by time will be superscripted by person index k(m) and compartment index X(m) as follows: (1) where X(m) = {E, IC, IS, HW}, m is the temporal order of close contacts from which the HCP j contracts the virus, if the HCP j contracts the virus at the mth close contact, otherwise. As a result, the risk , is estimated as: (2) where |C(∙)| is the total number of contacts and means all previous m−1 contacts are the failed transmissions. Given the assumption of independent close contacts, Eq (2) can be expressed as: (3)

The expectation and variance of are further investigated and presented in the S1 File. If we denote and as the patient admission time and the recovery time of an individual k with whom the HCP j has close contacts, the time interval is the virus exposure period for the HCP j with the person k. Therefore, can be reduced to . If is time-invariant, a logistic regression model is established to estimate the probability as: (4) (5) where is the indicator variable ( means that HCP j has contracted the virus via the contact with person k(r) and if HCP j has failed to contract the virus), Z is the covariate vector including the factors influencing the response and β is the coefficient vector. If varies over time, the constant assumption is relaxed by considering the cumulative distribution function that describes the probability of infection up to time t: , in which T is the infection time and h(t) is the hazard function. We assume h(t) = 0 over the time of no close contacts. Hence, is: (6) (7) where is the time period of the mth close contact with person k(m), and hm(t) is the cumulative infection time distribution function for the mth close contact. The probability and hm(t) depend on various factors including HCP- dependent features, patient-dependent features, patient-HCP interactions, HCP-HCP interactions, and healthcare facilities’ conditions.

2.2. Population risk indicator model

The population risk indicator quantifies the potential viral infection risk associated with a hospital/clinic over the time period [t1:t2]. The population risk, annotated as , is interpreted as the probability that an HCP contracts the disease under working conditions at place i given the information about the individual-level infection risk of all HCP at place i and the external factors. At this level, external factors from engineering and administrative controls within the hospital are considered. Those are the factors that affect the population-level infection risk apart from the individual-level risk. Representative examples of engineering controls are high-efficiency air, ventilation rates at the workplace, and infection isolation rooms for aerosol generating procedures. Administrative controls include formal HCP training regarding PPE availability level, training on risk factors and resources to promote personal hygiene. The is computed using logistic function as: (8) where is the vector of individual infection risk estimates of a total number of nHCP HCP, τ is the scaling parameter, F = {Fi} is the vector of engineering control and administrative control factors. We denote f(∙) as the abbreviated notation for the function of and F. When the working restriction policy is applied to a certain HCP, which forces him/her to be self-isolated at home, his/her individual risk will not be considered in Eq (8). The function f(∙) can be simply formulated as a linear regression model such that: (9) where α, wi, and b are the model parameters. Alternatively, the population risk is estimated using a Bayesian network when we have access to the domain knowledge that describes the relationships between the control factors and the infection risk at the population level and individual level. Here, the Bayesian network model [32] is employed to incorporate the domain knowledge that influences the virus spread. The network is formulated based on three subsets of factors from the literature that affect the risk of infection including 1) individual-level factors, 2) engineering control factors, and 3) administrative control factors (see Fig 1). Individual-level factors include patient characteristics (e.g., time from exposure to symptom onset, clinical severity of patients), HCP-dependent factors (e.g., PPE sufficiency level, close contacts with patients, exposure level to infection, working hours per week), and intervention-related risks (e.g., endotracheal intubation, high flow nasal cannula (HFNC)). External factors include engineering control factors (e.g., ventilation rates, airborne infection isolation rooms) and administrative control factors (e.g., formal HCP training on PPE and disease risk factors, resources to promote personal hygiene). These factors are annotated as ILF, ECF, and ACF, respectively. Hence, using the chain rule of the Bayesian network [33], the risk is (10) where P(∙) is the probability function, and is the indicator variable ( indicates that an HCP contracts the disease and if they do not).

Fig 1. Illustration of the contributions of the individual factors and external factors to the estimation of the infection risk of HCP in our model formulation.

The infection risk at both individual level and population levels can be estimated based on a Bayesian network formulation which has 4 main nodes, namely the individual-level risk , the population-level risk , the individual-level factors, and the external factors. The individual-level factors (ILF) include patient characteristics, HCP characteristics, and intervention-related risks, whereas the external factors consist of engineering control factors (ECF) and administrative control factors (ACF).

3. Results and covid-19 case study

3.1. Sensitivity analysis using simulated data

The variance-based sensitivity analysis was utilized to study the uncertainty of HCP’s potential infection risk output caused by the variance of the input variables.

3.1.1. The measure of sensitivity of to and close contact sequence.

The dependence of the infection risk on the probability of viral transmission and close contact sequence for an HCP was analyzed. The ’s for different numbers of close contacts |C(∙)| were estimated by Eq (3). For illustration, the results for |C(∙)| = 2 and |C(∙)| = 3 are shown in Fig 2.

Fig 2. Sensitivity analysis of the impact of probability of viral transmission and the number of close contacts on .

We estimated values of for the synthesized data with three levels of : Plow = 0.01, Pmedium = 0.05, Phigh = 0.1. Panel (a): the estimated for |C(∙)| = 2, i.e., two close contacts; therefore, there are n = 32 = 9 possible contact sequences with different combinations of levels, and those combinations are encoded in the form X1X2Xn, where X1, X2,…,Xn∈{0,1,2}, which corresponds to low, medium, and high levels of . The mean level of (green dash-dotted line) associated with its standard deviation indicated by purple dash-dotted lines are also plotted. Panel (b): the results for |C(∙)| = 3 with n = 33 = 27 possible contact sequences.

According to the results, the mean level (± standard deviation) of for |C(∙)| = 2 was 0.1038±0.0523, which was lower than that for |C(∙)| = 3 at 0.1516±0.0583. The mean value of the individual risk escalated together with the standard deviation values as the number of contacts increased. In addition, the estimated was not influenced by the time order of the close contacts, e.g., the same for three sequences: 011, 101, 110, where 0 and 1 are the encoded values for P(Low) and P(medium) respectively. The results are from the assumption of temporal independence between close contacts However, the risk would increase when the probability for each contact raised to a higher value, hence the probabilities collectively contributed to the value of risk.

3.1.2. Response surface of the mean and variance of .

The measure of sensitivity of potential infection risk of the HCP j at the place i over time (t1: t2) was investigated. We denote the mean level and the variance of of all sequences given the number of close contacts |C(∙)| as and , respectively. Next, we defined two levels of : Plow∈(0,0.5] and Phigh = Plow+0.3, and derived the response surfaces of the and with respect to two inputs Plow and |C(∙)|. As shown in Fig 3, the response surface of showed that a high probability of successful viral transmission will result in an extremely high value of , e.g., when |C(∙)| is only 3, Plow = 0.3, and Phigh = 0.7.

Fig 3. Response surfaces of with respect to two input variables: Viral transmission probability and number of close contacts.

(a): the response surface of subject to the change of Plow and total number of close contacts |C(∙)|∈[1,12]. A data set was synthesized with two levels of : Plow∈(0,0.5] and Phigh = Plow+0.3. where the expectation is the mean level of of all possible contact sequences C(∙), which are the combinations of Plow and Phigh in the sequence of length |C(∙)|. Data tips at 3 values of Plow: 0.05, 0.2, 0.5 were created to indicate the cut-off values of |C(∙)| when was significantly high. Similarly, (b) shows the response surface of of all possible sequences subject to the change of Plow and |C(∙)|. Three data tips at Plow = {0.05, 0.2, 0.5} were included to show the threshold of |C(∙)| at which was sufficiently low.

3.2. Model validation using COVID-19 case study

3.2.1. Case study description.

Data sets of HCPs with COVID-19 were used to validate the proposed model. Access to these data sources can be provided per requests or via the cited references. The validation was performed on three main components: the viral transmission probability model, the individual-level infection risk model, and the population-level risk model. The HCP’s occupational infection risk to COVID-19, interim guidance regarding risk assessment and universal PPE policy issued by the CDC [41], and the risk factors for severe acute respiratory syndrome coronavirus (SARS-CoV-2) transmission in hospital settings from previous studies were also included to develop the model for the case study.

The major factors resulting in high risk for HCPs are 1) exposure to COVID-19 patients without using appropriate PPE, 2) involvement in aerosol-generating procedures and the interventions performed by physicians or nurses, and 3) contact with patients and colleagues during the incubation period. Many studies suggested that there is a significant association between PPE use and infection risk and that masks are the most consistent contributing measure to reduce the risk [34, 35]. A similar association was observed for other PPE, such as gowns, gloves, and eye protection. Other exposures and treatment practices (e.g., intubation involvement, patient care, or having contact with secretions) were found to link with increased infection risk for HCPs [36, 37]. Finally, given the implementation of a universal PPE policy, the high risk of infection among HCP also arises from contacting asymptomatic patients and colleagues who are in the early phase of viral infections [24].

Different regression models, including logistic regression, log-binomial, and Poisson, were used with the defined risk measures to estimate the COVID-19 infection risk among HCP groups [2325, 3845]. Statistical survival analysis models were also used to estimate the HCP’s risk of contracting SARS COV-2 viruses and the expected duration of time until viral infection occurs. Shah et al. [27] modeled hospital admission of healthcare workers with COVID-19 using Cox regression and conditional logistic regression. Long Nguyen et al. [26] assessed the COVID-19 infection risk among healthcare workers in contrast to the general community by examining the effect of PPE on risk. They also used Cox proportional hazards model to calculate multivariate-adjusted hazard ratios (HRs) of a positive test. However, the major limitations of these models are: 1) the individual-specific characteristics, e.g., occupation [46], type of PPE used, experience level, and exposure duration to COVID-19 patients, are not considered [26, 27], and 2) the simple formalism of the models without time-varying stochastic transmissions oversimplifies the complex contagious mechanism of SARS COV-2.

3.2.2. Data description.

Data collected from multiple sources, namely COVID-19 transmission databases, health surveys/questionaries, U.S. Department of Labor databases, and cross-sectional study of UK-based healthcare workers, are illustrated in Table 1.

Table 1. Sources of databases information including source, nation, updated time, and owner.

3.2.3. Model variable selection.

Variables from recent findings of SARS-CoV-2 as introduced in Sub-section 3.2.1, were used to select the features. The validation was performed on three main components: the viral transmission probability model, the individual-level infection risk model, and the population-level risk model. Regarding the viral transmission probability model, we included the following covariates in the model: Age, Cancer, Resp, Obes, Smoker, Allied_prof, Dental_staff, Doctor, Pub_trans, C_contact, AGP, PPE_train, Lacked_PPE, Cont_wo_PPE, and Imp_PPE. These are significant factors suggested by the original cross-sectional study [54]. The description of these variables is summarized in S1 Table. To validate the individual-level infection risk model, the U.S. Department of Labor O*Net database was employed to quantify the risk score for healthcare-related occupations, where virus exposure time and duration and working environment were considered. For the population-level risk model, the PPE sufficiency level, regional infection risk and the hospitalization data of HCP were selected to estimate population-level infection risk in California and Texas medical centers [49, 50] and implement a surrogate method for model validation. The description of these variables is summarized in S1 Table.

3.2.4. Model validation of viral transmission probability estimation using multivariate logistic regression.

To validate the logistic regression introduced in Sub-section 2.1., we considered different protective and risk factors for COVID-19 in the data set of UK-based healthcare workers [54] and modelled the association between these covariates and the COVID-19 infection status using multivariable logistic regression. The data set provides 6263 responses in which a composite outcome was present in 1,806 (29.4%) HCP, of whom 49 (0.8%) HCP were admitted to hospitals, 459 (7.5%) were tested positive for SARS-CoV-2, and 1,776 (28.9%) HCP were self-isolated. The covariates included in the model were reported in Sub-section 3.2.3. The estimated coefficients with their standard errors (SEs) and their statistical significance indicated by p-value are shown in Table 2.

Table 2. Estimated coefficients and their statistical significance for the multivariate logistic regression model.

According the table, the most significant variables (p-value < 0.001) that influence the disease transmission probability are Age, Obes, Allied_prof, Dental_staff, Pub_trans, C_contact, AGP, Lacked_PPE, and cont_wo_PPE. The model goodness-of-fit was further assessed by the Akaike information criterion (AIC) and 10-fold cross validation. The AIC value for the above model was 7317.70 and that for the null model was 7449.75. The 10-fold cross validation accuracy was calculated to be 78.23%, which showed that the performance on test data was relatively good.

3.2.5. Model validation of the individual-level infection risk.

To validate to infection risk model at the individual level, six occupations were considered using the U.S. Department of Labor O*Net database [51]. We also introduced a new variable called occupational-specific risk score denoted as ORS to account for the differences in infection risk among different occupations. The score was computed as: (11) where max {Nhours} is the maximum working hours per week of 6 occupations, and ϕ is the scaling parameter. The description of those variables CO, PP, EI, and Nhours are summarized in S1 Table. Because of the limited longitudinal data, our strategy was to validate the individual infection risk model using hypothesized scenarios of different occupational settings. Particularly, we made four main assumptions: 1) the individual-risk is the same for every individual working under the same conditions (e.g., same occupation), 2) all patients are confirmed cases, i.e., there is only one compartment IC, 3) the probabilities of viral transmission from all patients are the same for each occupation, and 4) the probability of viral transmission estimate for confirmed infectious patients, denoted as is equal to ORS/max{ORS}, where max{ORS} is the maximum ORS score among 6 occupations, which guarantees . This is the surrogate approach for approximating the transmission probability defined in Eq (5) in the scenario of limited individual-level data. Consequently, Eq (3) is reduced to: (12)

Lastly, the total number of contacts |C(∙)| was fixed to be 5 and the value ϕ was set to 20. Next, the risk was estimated using Eq (12), and the results are summarized in Table 3.

Table 3. Estimated individual-level infection risk for six different occupational settings.

The results of the individual-level model indicated a strong positive association between the estimated risk and virus transmission probability , in which the top three occupations that have the highest risk were registered nurses, medical assistants, and respiratory therapists. Their associated values were 0.2262, 0.2119, and 0.1575 respectively, which were relatively high when |C(∙)| = 5.

3.2.6. Model validation of the population-level infection risk.

The population-level infection risk was validated based on the total of confirmed COVID-19 cases of HCP reported to the CDC. The number of positive COVID-19 cases of HCP in the US up to April 9, 2020, is presented in Fig 4.

Fig 4. Daily number of laboratory-confirmed positive COVID-19 cases by date of symptom onset of health care personnel and non-health care personnel (N = 43968) in the US from February 12 to April 9, 2020 [47].

According to Fig 4, there was a strong association between the number of positive cases among non-HCP and the number of cases among HCP by date of symptom onset. In addition, the risk of infection among HCP was closely related to the total number of positive tests among HCP and the patient loads that HCP needed to handle. For population-level, we used the following selected features: SOHtime, CS, PPESL, ORS. The description of those is elaborated in S1 Table. Based on Eq (10), population-level risk estimation was reduced to a regressive equation with equal weights assigned to each variable as: (13) where is the expected value of over the distribution of the variable X and Value(X) is the value set of X, is estimated as: (14)

The population-level infection risk model was validated using the COVID-19 data from health centers in Texas, California and other relevant sources as presented in Sub-section 3.2.2 and Table 1. The accessible HCP COVID-19 data of Texas and California were PPE sufficiency level, the total number of hospitalizations, and the percentage of ICU beds available. So, we assumed the distributions and the expected value of over the other variables to be the same for both states. The expected values of was computed using Eq (14) (see Table 4).

Table 4. Estimated value and distribution of the selected features used in two case studies to estimate the infection risk in Texas and California.

In Table 4, was estimated using the PPE lacking information in health centers in Texas and HCP surveys in California. The value of was estimated by averaging the values of over all occupations. The estimated values for Texas and California were 0.0084 and 0.0132, respectively.

4. Discussion

Hospital-acquired infections of communicable viral diseases are posing a challenge to healthcare workers globally. HCP is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality. Therefore, mitigating and preventing nosocomial infections in hospitals is an urgent and important task to lower the risk of contracting CVDs for HCP, guarantee adequate availability of PPE and develop well-informed strategies to protect health-care workers from contracting CVDs. In this paper, we have developed a proposed probabilistic model characterizes the dynamics of the disease transmission in HCP over time, in which the domain-knowledge-driven risk analysis framework can quantify both the individual-level and population-level infection risk. We validated the model at both levels using two main approaches, namely the variance-based sensitivity analysis using the simulated data and the COVID-19 case study. The sensitivity analysis indicated that the uncertainty in the HCP infection risk is attributed to 2 variables: the number of close contacts and the viral transmission probability. The COVID-19 case study showed that the occupations with the highest risk are registered nurses, medical assistants, and respiratory therapists. In addition, the results indicated the significant risk and protective factors of the COVID-19 transmission risk of HCP.

In our sensitivity analysis, we focused only on two key variables, namely viral transmission probability and the number of close contacts between HCP and patients. Specifically, the sensitivity of the infection risk to those input variables was measured by the amount of variance caused by changing the inputs. We divided our analysis into two parts: 1) the measure of sensitivity of to and close contact sequence, and 2) response surface of the mean and variance of to |C(∙)|and The results of the sensitivity analysis revealed that the output will be significantly increased when the viral transmission probability and the number of contacts increases. In addition, the results in the second part indicated that quickly converged to one as |C(∙)|→∞, and the convergence rate was faster if Plow took higher values. Based on the response surface of , higher values of Plow and |C(∙)| will lead to a lower value of ; however, the effect of Plow is more significant than that of |C(∙)|. The value of as |C(∙)|→∞ and dropped to nearly 0 after only four close contacts when Plow = 0.5.

After performing the sensitivity analysis, the logistic regression for estimating viral transmission probability was validated using the cross-sectional observational study of UK-based healthcare workers. Based on the coefficient estimates of the variables in the built multivariate logistic regression model, Age, Smoker, Allied_prof, Dental_staff, AGP, PPE_train, Imp_PPE were the protective factors, whereas the risk factors were Cancer, Resp, Obes, Doctor, Lacked_PPE, cont_wo_PPE, Pub_trans, C_contact. Surprisingly, advanced age, being a smoker or ex-smoker within one year, and having regular exposure to aerosol-generating procedures performed on COVID-19 patients decreased the infection risk. This result seems counter-intuitive at first, but they are confounders because it was shown that HCP working directly with suspected or confirmed COVID-19 patients tended to be more cautious and self-aware in clinical environments [55]. Therefore, they had sufficient self-protection and took containment measures; however, healthcare workers in non-communicable viral disease departments, who were potentially exposed to contagious viruses, did not have sufficient training on how to use PPE and deal with infectious diseases and lack of access to PPE and isolation equipment [56]. However, the model has several limitations. First, because we did not have access to information on HCP contact with patients and coworkers, we assumed the estimated viral transmission probability as a measure averaged over all individuals. Second, the data were gathered using surveys and questionnaires, which are subject to selection and recall bias. Third, the use of a composite outcome (including HCP with COVID-19 symptoms, HCP being exposed to risk factors, and lab-confirmed HCP infections) may have resulted in overestimation or underestimation of the infection risk.

We validated the individual-level infection risk model, implemented the model using the two-parameter regressive equation, and estimated the individual risk for six occupations. The results highly depend on the pre-defined parameters, which can be estimated in healthcare settings when data are available. It was shown that healthcare workers and nurses are frequently in close contact with COVID-19 patients, which therefore increases the risk for acquiring SARS-CoV-2 virus [57]. Because HCP can acquire infection through various pathways apart from direct patient care, such as exposure to colleagues, family members, or people in the community, the time-varying risk estimation in the model can provide informed decisions for screening HCP for COVID-19 before workplace entry. The individual risk model can be improved and more specific to better model the transmission dynamics, e.g., a model that incorporates the quantification of indoor airborne infection risks using a probabilistic framework [58]. In addition, we do not assume that the recovered patients confer immunity to reinfection when being released from isolation This statement can be further clarified that even the patients are fully recovered after getting contracted with COVID-19 (or any communicable viral diseases in general) and being released back to the population, they are still under the risk of reinfection with the same disease strain or other strains. However, reinfection with the same strain is very rare. Hence, if the HCP were recovered from the disease, they might get infected again; however, they still confer some degree of immunity from subsequent infection. Therefore, we can consider adding a new group of patients in our model called “reinfected patients”. Moreover, the same idea can be applied for the vaccinated population, in which people have vaccine-induced immunity.

For model validation at the population level, we considered two case studies to estimate the risk of infection of HCP in Texas and California states. Both states have a high number of lab-confirmed SARS-CoV-2 patients. The average number of hospitalizations in Texas and California were 16843 cases/day and 4219 cases/day, respectively. However, the infection risk in Texas was 0.0084 which was lower than the risk in California (0.0132). This was mainly due to the difference in patient load for each HCP per day and the two states’ PPE sufficiency level. From Table 4, the average PPE sufficiency level in California was only 0.744 as opposed to 0.9355 in Texas, and the average percentage of ICU beds available per 100,100 people in Texas was significantly higher than that in California, which implies heavier patient loads in California. The model also made some important assumptions: 1) close contacts with COVID-19 patients are independent and there is no viral transmission among HCP, and 2) protective/risk factors are well-defined and sufficient to estimate the risk of infection.

5. Conclusion and future work

The paper proposed a time-variant infection risk analysis model to characterize the dynamic of the disease infection risk in HCP over time and a domain-knowledge-driven infection risk to quantify the complexities of HCP’s risk of CVDs in healthcare settings. The infection risk analysis model for HCP was estimated at both individual and population levels. The individual-level risk model was built based on the population grouping concept of the well-established epidemiological SEIR model with the consideration of the time-varying confounders to capture the dynamical contagious disease transmission mechanism. At the population-level, three subsets of features were constructed and represented by a Bayesian network, from which the probability of viral transmission from patients to HCP was estimated. To validate our methods, we have incorporated the data from multiple data sources from the US, the UK, and Taiwan for the COVID-19 case study, which contains the information about potential factors that affect COVID-19 transmission mechanism; and the domain knowledge of similar contagious diseases such as SARS or MERS from the relevant studies to estimate the risk of COVID-19 infection of HCP. For individual-level risk estimation, the model was founded on the SEIR compartmental model and developed for the occupational-specific and individualized infection risk model. As a result, the model can accurately capture the infection risk varying over time under the control of those individual time-varying confounders, and it is also able to account for the intrinsic stochastic transmission mechanisms. At the population level, the Bayesian network formalism can accommodate the limited data scenario, and it can update the parameters when more data are available. The results from two case studies are interpretable at the population level, which showed infection risk in California is higher than in Texas because of the heavier patient loadings and shortage of PPE. The major limitations of the CDC’s interim guideline for risk assessment, which is inadequate in quantifying the risk of infection in an individualized HCP, have been addressed by our model. The model would significantly endorse the PPE allocation and safety plans for HCP and enhance the crisis-level staffing strategies in facilities with the staffing shortages. Longitudinal experimental designs are required to collect more COVID-19 data among HCP to validate the proposed model properly. Future work would involve: 1) model assumption validation when more data are available and sufficient, 2) model modification and reformulation if the assumptions are violated (e.g., independence assumption and new vaccinated or “reinfected” population), and 3) validating the model with the other related case studies of communicable viral diseases.

Summary table

What was already known on the topic

  • Hospital-acquired infections of communicable viral diseases are posing a challenge to healthcare workers globally.
  • Healthcare personnel (HCP) is facing a consistent risk of hospital-acquired infections, and subsequently higher rates of morbidity and mortality.
  • Mitigating and preventing nosocomial infections in hospitals is an urgent and important task to lower the risk of contracting CVDs for HCP.
  • Previous mathematical models and statistical methods to model the nosocomial infection risk in HCP fail to capture the time-dependent disease transmission processes and the effects of individual risk and protective factors.
  • What this study added to our knowledge
  • Our proposed probabilistic model characterizes the dynamics of the disease transmission in HCP over time
  • The knowledge-driven risk analysis framework can quantify both the individual-level and population-level infection risk.
  • The sensitivity analyses indicated that the uncertainty in the HCP infection risk is attributed to 2 variables: the number of close contacts and the viral transmission probability.
  • The COVID-19 case study showed that the occupations with the highest risk are registered nurses, medical assistants, and respiratory therapists.
  • The results indicated that age, smoking status, occupation, PPE use, using public transport, close contacts with patients, and having regular exposure to aerosol-generating procedures are significant factors.

Supporting information

S1 Table. Characteristics of the selected features and their associated databases.


S1 File. The expectation and variance of the potential individual risk.



  1. 1. Iversen K., et al., Risk of COVID-19 in health-care workers in Denmark: an observational cohort study. The Lancet Infectious Diseases, 2020. 20(12): p. 1401–1408.
  2. 2. Mutambudzi M., et al., Occupation and risk of severe COVID-19: prospective cohort study of 120 075 UK Biobank participants. Occupational and Environmental Medicine, 2020. pmid:33298533
  3. 3. Baker M.G., Peckham T.K., and Seixas N.S., Estimating the burden of United States workers exposed to infection or disease: a key factor in containing risk of COVID-19 infection. PloS one, 2020. 15(4): p. e0232452. pmid:32343747
  4. 4. McDougal A.N., et al., Outbreak of coronavirus disease 2019 (COVID-19) among operating room staff of a tertiary referral center: An epidemiologic and environmental investigation. Infection Control & Hospital Epidemiology, 2021: p. 1–7.
  5. 5. Khatib A.N., et al., Navigating the risks of flying during COVID-19: a review for safe air travel. Journal of travel medicine, 2020. 27(8): p. taaa212.
  6. 6. Cooper B.S., Confronting models with data. Journal of Hospital Infection, 2007. 65: p. 88–92.
  7. 7. Grundmann H. and Hellriegel B., Mathematical modelling: a tool for hospital infection control. The Lancet infectious diseases, 2006. 6(1): p. 39–45.
  8. 8. Brauer F., Compartmental models in epidemiology, in Mathematical epidemiology. 2008, Springer. p. 19–79.
  9. 9. Di Stefano B., Fuks H., and Lawniczak A.T. Object-oriented implementation of CA/LGCA modelling applied to the spread of epidemics. in 2000 Canadian Conference on Electrical and Computer Engineering. Conference Proceedings. Navigating to a New Era (Cat. No. 00TH8492). 2000. IEEE.
  10. 10. Lawson A.B., Statistical methods in spatial epidemiology. 2013: John Wiley & Sons.
  11. 11. Lessler J., et al., Trends in the mechanistic and dynamic modeling of infectious diseases. Current Epidemiology Reports, 2016. 3(3): p. 212–222. pmid:32226711
  12. 12. Anderson R.M., The population dynamics of infectious diseases: theory and applications. 2013: Springer.
  13. 13. Banos A., et al., The importance of being hybrid for spatial epidemic models: a multi-scale approach. Systems, 2015. 3(4): p. 309–329.
  14. 14. Lanzas C. and Chen S., Complex system modelling for veterinary epidemiology. Preventive veterinary medicine, 2015. 118(2–3): p. 207–214. pmid:25449734
  15. 15. Martínez‐López B., Perez A., and Sánchez‐Vizcaíno J., Social network analysis. Review of general concepts and use in preventive veterinary medicine. Transboundary and emerging diseases, 2009. 56(4): p. 109–120. pmid:19341388
  16. 16. Bansal S., Grenfell B.T., and Meyers L.A., When individual behaviour matters: homogeneous and network models in epidemiology. Journal of the Royal Society Interface, 2007. 4(16): p. 879–891.
  17. 17. Sirakoulis G.C., Karafyllidis I., and Thanailakis A., A cellular automaton model for the effects of population movement and vaccination on epidemic propagation. Ecological Modelling, 2000. 133(3): p. 209–223.
  18. 18. Zhen J. and Quan-Xing L., A cellular automata model of epidemics of a heterogeneous susceptibility. Chinese Physics, 2006. 15(6): p. 1248.
  19. 19. Casalicchio E., Galli E., and Tucci S., Agent-based modelling of interdependent critical infrastructures. International Journal of System of Systems Engineering, 2010. 2(1): p. 60–75.
  20. 20. Perez L. and Dragicevic S., An agent-based approach for modeling dynamics of contagious disease spread. International journal of health geographics, 2009. 8(1): p. 1–17.
  21. 21. Voirin N., et al., A multiplicative hazard regression model to assess the risk of disease transmission at hospital during community epidemics. BMC medical research methodology, 2011. 11(1): p. 1–8. pmid:21507247
  22. 22. Chu D.K., et al., Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis. The lancet, 2020. 395(10242): p. 1973–1987.
  23. 23. Wang Q., et al., Epidemiological characteristics of COVID-19 in medical staff members of neurosurgery departments in Hubei province: a multicentre descriptive study. medRxiv, 2020.
  24. 24. Eyre D.W., et al., Differential occupational risks to healthcare workers from SARS-CoV-2 observed during a prospective observational study. Elife, 2020. 9: p. e60675. pmid:32820721
  25. 25. Ki H.K., et al., Risk of transmission via medical employees and importance of routine infection-prevention policy in a nosocomial outbreak of Middle East respiratory syndrome (MERS): a descriptive analysis from a tertiary care hospital in South Korea. BMC pulmonary medicine, 2019. 19(1): p. 1–12.
  26. 26. Nguyen L.H., et al., Risk of COVID-19 among front-line health-care workers and the general community: a prospective cohort study. The Lancet Public Health, 2020. 5(9): p. e475–e483.
  27. 27. Shah A.S., et al., Risk of hospital admission with coronavirus disease 2019 in healthcare workers and their households: nationwide linkage cohort study. bmj, 2020. 371.
  28. 28. Gupta H.D. and Sheng V.S. A Roadmap to Domain Knowledge Integration in Machine Learning. in 2020 IEEE International Conference on Knowledge Graph (ICKG). 2020. IEEE.
  29. 29. Friedman N., Geiger D., and Goldszmidt M., Bayesian network classifiers. Machine learning, 1997. 29(2–3): p. 131–163.
  30. 30. Chou R., et al., Epidemiology of and risk factors for coronavirus infection in health care workers: a living rapid review. Annals of internal medicine, 2020. 173(2): p. 120–136.
  31. 31. Kallenberg O., Random measures, theory and applications. Vol. 1. 2017: Springer.
  32. 32. Huynh P.K., et al., Probabilistic domain-knowledge modeling of disorder pathogenesis for dynamics forecasting of acute onset. Artificial Intelligence in Medicine, 2021. 115: p. 102056.
  33. 33. Jensen F.V., Bayesian networks. Wiley Interdisciplinary Reviews: Computational Statistics, 2009. 1(3): p. 307–315.
  34. 34. Mo Y., et al., Transmission of community-and hospital-acquired SARS-CoV-2 in hospital settings in the UK: A cohort study. PLoS medicine, 2021. 18(10): p. e1003816. pmid:34637439
  35. 35. Lan F.-Y., et al., COVID-19 symptoms predictive of healthcare workers’ SARS-CoV-2 PCR results. PloS one, 2020. 15(6): p. e0235460. pmid:32589687
  36. 36. Chen Q., Allot A., and Lu Z., Keep up with the latest coronavirus research. Natur, 2020. 579(7798): p. 193–193. pmid:32157233
  37. 37. Raboud J., et al., Risk factors for SARS transmission from patients requiring intubation: a multicentre investigation in Toronto, Canada. PLoS One, 2010. 5(5): p. e10717. pmid:20502660
  38. 38. Caputo K.M., et al., Intubation of SARS patients: infection and perspectives of healthcare workers. Canadian Journal of Anesthesia, 2006. 53(2): p. 122.
  39. 39. Chen W.-Q., et al., Which preventive measures might protect health care workers from SARS? BMC Public Health, 2009. 9(1): p. 1–8. pmid:19284644
  40. 40. Le Dang Ha S.A.B., et al., Lack of SARS transmission among public hospital workers, Vietnam. Emerging infectious diseases, 2004. 10(2): p. 265.
  41. 41. Alraddadi B.M., et al., Risk factors for Middle East respiratory syndrome coronavirus infection among healthcare personnel. Emerging infectious diseases, 2016. 22(11): p. 1915.
  42. 42. Hall A.J., et al., Health care worker contact with MERS patient, Saudi Arabia. Emerging infectious diseases, 2014. 20(12): p. 2148.
  43. 43. Bai Y., et al., SARS-CoV-2 infection in health care workers: a retrospective analysis and a model study. medRxiv, 2020.
  44. 44. Heinzerling A., et al., Transmission of COVID-19 to health care personnel during exposures to a hospitalized patient—Solano County, California, February 2020. 2020.
  45. 45. Mutambudzi M., et al., Occupation and risk of severe COVID-19: prospective cohort study of 120 075 UK Biobank participants. Occupational and Environmental Medicine, 2021. 78(5): p. 307–314.
  46. 46. Lan F.-Y., et al., Work-related COVID-19 transmission in six Asian countries/areas: a follow-up study. PloS one, 2020. 15(5): p. e0233588. pmid:32428031
  47. 47. COVID T.C., Characteristics of Health Care Personnel with COVID-19-United States, February 12-April 9, 2020. 2020.
  48. 48. Cheng H.-Y., et al., Contact tracing assessment of COVID-19 transmission dynamics in Taiwan and risk at different exposure periods before and after symptom onset. JAMA internal medicine, 2020.
  49. 49. California Health Care Foundation, California COVID-19 Health Surveys: Data and Charts. April 1, 2020; Available from:
  50. 50. Health Resources & Services Adminstration, Texas Health Center COVID-19 Survey Summary Report. Oct 7th, 2020; Available from:
  51. 51. O*Net database. Nov 16th, 2020; Available from:
  52. 52. Garg S., Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019—COVID-NET, 14 States, March 1–30, 2020. MMWR. Morbidity and mortality weekly report, 2020. 69. pmid:32298251
  53. 53. Texas COVID-19 Data. Apr 29th, 2021; Available from:
  54. 54. Kua J., et al., healthcareCOVID: a national cross-sectional observational study identifying risk factors for developing suspected or confirmed COVID-19 in UK healthcare workers. PeerJ, 2021. 9: p. e10891. pmid:33604201
  55. 55. Du Q., et al., Nosocomial infection of COVID‑19: A new challenge for healthcare professionals. International Journal of Molecular Medicine, 2021. 47(4): p. 1–1.
  56. 56. McMichael T.M., et al., Epidemiology of Covid-19 in a long-term care facility in King County, Washington. New England Journal of Medicine, 2020. 382(21): p. 2005–2011. pmid:32220208
  57. 57. Hughes M.M., et al., Update: characteristics of health care personnel with COVID-19—United States, February 12–July 16, 2020. Morbidity and Mortality Weekly Report, 2020. 69(38): p. 1364. pmid:32970661
  58. 58. Liao C.M., Chang C.F., and Liang H.M., A probabilistic transmission dynamic model to assess indoor airborne infection risks. Risk Analysis: An International Journal, 2005. 25(5): p. 1097–1107.