High variability in transmission of SARS-CoV-2 within households and implications for control

Background Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses a high risk of transmission in close-contact indoor settings, which may include households. Prior studies have found a wide range of household secondary attack rates and may contain biases due to simplifying assumptions about transmission variability and test accuracy. Methods We compiled serological SARS-CoV-2 antibody test data and prior SARS-CoV-2 test reporting from members of 9,224 Utah households. We paired these data with a probabilistic model of household importation and transmission. We calculated a maximum likelihood estimate of the importation probability, mean and variability of household transmission probability, and sensitivity and specificity of test data. Given our household transmission estimates, we estimated the threshold of non-household transmission required for epidemic growth in the population. Results We estimated that individuals in our study households had a 0.41% (95% CI 0.32%– 0.51%) chance of acquiring SARS-CoV-2 infection outside their household. Our household secondary attack rate estimate was 36% (27%– 48%), substantially higher than the crude estimate of 16% unadjusted for imperfect serological test specificity and other factors. We found evidence for high variability in individual transmissibility, with higher probability of no transmissions or many transmissions compared to standard models. With household transmission at our estimates, the average number of non-household transmissions per case must be kept below 0.41 (0.33–0.52) to avoid continued growth of the pandemic in Utah. Conclusions Our findings suggest that crude estimates of household secondary attack rate based on serology data without accounting for false positive tests may underestimate the true average transmissibility, even when test specificity is high. Our finding of potential high variability (overdispersion) in transmissibility of infected individuals is consistent with characterizing SARS-CoV-2 transmission being largely driven by superspreading from a minority of infected individuals. Mitigation efforts targeting large households and other locations where many people congregate indoors might curb continued spread of the virus.


Introduction
Since its emergence in 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for COVID-19, has spread rapidly, causing severe morbidity, mortality, and disruption to daily life. As public health officials continue grappling with reducing community spread, it is of increased importance to understand transmission risk in different locations where people mix. Transmission within households may be especially important, given the mounting evidence that indoor environments with close, sustained contact are especially high risk for SARS-CoV-2 transmission [1][2][3]. Furthermore, with substantial observed decreases in mobility during the pandemic [4], individuals likely are spending a greater proportion of time at home, thus increasing the importance of understanding within-household transmission. Likewise, isolation and quarantine measures recommended to help control COVID-19 frequently occur within homes, increasing risk to susceptible household members [5].
Data collected from members of households with at least one person infected with SARS-CoV-2 have revealed a wide range of within-household transmission estimates. One systematic review and meta-analysis [6] found 24 studies with household data conducted from January-March 2020, mostly in China, with secondary attack rate estimates ranging from 5% to 90% in the individual studies; pooling these data led to an average secondary attack rate estimate of 27% (95% CI: 21%-32%). Another published review and meta-analysis of more recent data found 22 studies on the secondary attack rate in households, including estimates ranging from 4% to 32% [7]. Pooling these studies, the review found an average secondary attack rate of 17.1% (95% CI: 13.7%-21.2%). Another review and meta-analysis found 40 household studies with individual study estimates ranging from 4% to 45% [8]. Their pooled analysis found that the household-based secondary attack rate for all household contacts was 19.0% (95% CI: 14.9 -23.1%). Data from households in the U.S. [9][10][11][12] produced secondary attack rate estimates from 11% to 53%.
Most household studies generated data by first identifying index household cases via active or passive surveillance followed by monitoring and testing specimens from their household contacts using PCR or other methods that detect presence of the virus. These studies may exhibit bias if mild or asymptomatic cases were less likely to be identified as an index household case. By contrast, data for the presence of antibodies among household members provide information on the distribution of final sizes of household outbreaks no longer in progress and in which some or none of the cases were identified at the time. We are aware of only 3 studies that used serological antibody data to estimate household transmission, using data from Spain [13], Brazil [14], and Switzerland [15].
In addition to average transmission rates, heterogeneity and variability in SARS-CoV-2 transmission have also been quantified. The amount of individual-level variation in the number of secondary infections can affect final outbreak size [16]. Large variation (i.e., overdispersion) indicates the presence of superspreading by a minority of individuals who transmit to a disproportionately large number of others [17]. Better understanding of superspreading individuals and locations can greatly enhance efficient targeting of transmission control strategies [18]. Backward contact tracing can efficiently trace sources of acquisition to high-transmission individuals and circumstances when superspreading is present [19], and efforts that target similar circumstances for transmission prevention can have disproportionate benefits [20,21].
Studies have quantified the variability in the number of SARS-CoV-2 transmissions from infected individuals using the dispersion parameter k, governing the variance of a negative binomially distributed offspring distribution [22][23][24][25][26]. Those studies estimated high overdispersion (low values of k) similar to what was observed during the first SARS-CoV outbreak in 2003 [17]. These estimates were derived from data on transmissions, including superspreading events, occurring in a variety of locations both inside and outside of households. Regarding household transmission specifically, Madewell et al. [8] showed preliminary evidence of overdispersion in household data, with more households than expected experiencing extremes of transmission (i.e., either no transmission or many transmissions) from an introduced case.
In this study, we combine SARS-CoV-2 data from serological antibody tests and selfreported prior tests to estimate within-household transmission of COVID-19 in Utah. Previously published secondary attack rate estimates are largely based on crude formulae which ignore the probabilities of multiple members of a household acquiring infection from the community, multiple generations of transmission within the household (i.e. secondary, tertiary, etc. transmissions), and imperfect test sensitivity and specificity. We addressed these limitations by extending previous models of final household outbreak size distributions [27] to develop a novel probabilistic model of household importation and household transmission combined with test sensitivity and specificity. Our model also quantifies variability in household transmission and the potential extent of overdispersion, to shed light on superspreading phenomena and the implications of household transmission for population-level controllability of COVID-19.

Data collection from Utah households
Details of our data collection process are described elsewhere [28]. Briefly, the Utah Health & Economic Recovery Outreach project involved selecting households in several counties in Utah by population sampling designed to form a set of households by which average community seroprevalence could be assessed. Any member of selected households could participate in a survey that included questions about prior SARS-CoV-2 test results (see Supplementary Methods in S1 File for wording of relevant survey questions). Adult household members could fill out surveys on behalf of children of any age in the household. Survey participants age 12 or older could additionally opt to provide serological samples for COVID-19 antibody testing. Serum specimens were analyzed using the Abbott SARS-CoV-2 IgG assay performed on an Abbott Architect i2000 instrument (Abbott Laboratories), with methodology and criteria for a positive antibody result defined according to the manufacturer's instructions. Data included in this analysis were collected between May 4 and August 15, 2020.
The University of Utah Institutional Review Board reviewed the surveillance project that produced the data analyzed in this manuscript and determined it as non-research public health surveillance, waived the requirement for documented consent, and determined that use of these data for analysis to understand the dynamics of SARS-CoV-2 transmission was exempt from further review (IRB_00132598). Individuals were informed of the project procedures and that participation was voluntary. Participants provided their agreement to participate and were given the chance to opt out of having their data used for future research. The data were analyzed anonymously for this manuscript.
The data are represented as follows. For each household in the dataset, we captured the following 7 values from the data: • n: total number of people in household Those surveyed participants who reported no prior positive test result includes both those who had never been tested and those who had been tested but received no positive results. We did not have sufficient information to properly distinguish those two groups, nor to determine the circumstances of any prior negative tests that might affect the inferred probability of true prior infection.
Each of the C unique combinations of the above 7 values found at least once in the dataset was indexed as a vector y i : We tallied the number of households for which each y i occurred in the frequency elements f i , and represented the entire dataset by the vector y = (y 1 ,. . .,y C ,f 1 ,. . .f C ).
The dataset y and all codes, written in R version 4.0.3, used for analyses described in the following sections are posted and publicly available at https://github.com/damontoth/ householdTransmission.

Total household infection size model
Here we derive the probabilities M kn for the probability that k out of n total household members ended up infected. If k members of a size-n household were infected, that means that n−k members escaped being infected by a non-household member (called "community" acquisitions) and escaped being infected by any of the n infected within the household. Thus, our model for M kn combines both probabilities and does not depend on the order of occurrence of household transmissions and subsequent community acquisitions after the initial one, as in similar prior formulations [27]. Also following prior formulations, we assume that active infections were not present in the households at the time of antibody data collection (i.e., that household outbreaks had reached final size). Accounting for the timing of recent household importations, transmissions, and development of detectable antibodies during an ongoing household outbreak would significantly complicate the model equations and would likely have little effect on our overall results, given that the prevalence of active infections at the time of data collection was very low [28].
The M kn values depend on 3 parameters. The parameter p c is the average per-capita probability of community acquisition, p h is the mean transmission probability from an infected person to a fellow household member, and d h is the dispersion parameter characterizing variability in transmissibility across infected individuals, with no assumed correlation among members of the same household.
For a given household size n�2, the formula for M kn is: For households of size n = 1, note that the expression involving the household transmission parameters does not apply and we have M 01 = 1−p c and M 11 = p c .
The probability that a household of size n had 0 infections: M 0n = (1−p c ) n , is the probability that none of the household members acquired infection from the community and does not depend on the household transmission variables because no household transmissions were possible without a community introduction. For the final number of household infections to be nonzero, there must be at least one community acquisition, which may be followed by within-household transmissions. The n i ! ðp c Þ i ð1 À p c Þ nÀ i expression is the binomial probability that i out of the n household members had a community acquisition, and the function T xyz is the probability that x already infected household members lead to a total of y transmissions to z susceptible household members. In other words, T xyz is the probability that the final outbreak size is x+y, given that x household members are already infected in a house with z susceptible members. For efficiency of computation, the T xyz values are calculated in order of increasing values of y, i.e. T x0z for each relevant x and z value are calculated first, then the T x1z values, then T x2z . This allows the use of T xyz values for lower values of y to be used in the formula (see S1 File for details): Within the T xyz formula, the function H xyz is the probability that x infected household members transmit infection directly to y out of z fellow household members who are susceptible. The H xyz values are calculated in order of increasing values of x for efficient computation (see S1 File): Finally, the function F yz (p, d) is the probability mass function of the beta-binomial distribution for y successes out of z trials, parameterized by a mean success probability p and a dispersion parameter d. When d is finite and nonzero, F yz is derived from the binomial distribution with success probability that is a beta-distributed random variable with parameters α = dp, β = d (1−p), with decreasing variance as d increases. We also make use of the boundary cases d = 0 and d!1. In the limit d!1, holding p constant, F yz becomes the binomial distribution with constant success probability p (S1 File). In the maximal variance limit, d!0, with p held constant, F yz becomes an "all-or-nothing" distribution where y = z successes occur with probability p and to y = 0 successes occur with probability 1−p (S1 File): The function B is the beta function. We use F yz within the formula for H 1yz to quantify the distribution of household transmissions directly from a single infected household member, where y is the number of transmissions, z is the number of susceptible household members, p = p h , and d = d h .
The above formulas are derived in the S1 File. Elements of this model appear in other publications. Longini and Koopman [27] derived a formula for M kn for the model with no variability among households or individuals, equivalent to our model with d h !1. While they provided a more efficient formula that takes advantage of the properties of that special case, we confirmed that our calculation scheme above reproduces the results of their formula. Becker [29] published explicit formulas for the final size of household outbreaks after a single introduction to households up to size 5 using the beta-binomial chain model, equivalent to our T xyz for x = 1 and z up to 4. We confirmed that our scheme for calculating T xyz produces the same results as their example formulas for arbitrary values of p h and d h .

Likelihood model
We sought to use our data to simultaneously estimate the 3 parameters (p c , p h , d h ) using maximum likelihood estimation (MLE). However, applying the M kn formula directly to our data would be problematic because the true number of infections k in each household are not known with certainty. The data include two sources of COVID-19 test information by which prior infection status of a portion of individual household members can be probabilistically inferred: antibody test results and surveys in which participants could report results of a prior test.
Antibody test results are subject to imperfect sensitivity and specificity due to false negative tests and false positive tests, respectively. To account for these, we added two additional parameters to be estimated by the MLE: ϕ A , the probability that an antibody-tested person with a prior infection tested positive for antibodies, and π A , the probability that an antibodytested person with no prior infection tested negative for antibodies.
Prior test results for SARS-CoV-2 reported on the survey also do not perfectly identify those with prior infections. To quantify this imperfection, we introduced two more parameters to be estimated by the MLE: ϕ V , the probability that a surveyed person with a prior infection reported receiving a positive test for the virus, and π V , the probability that a surveyed person with no prior infection did not report receiving a positive test.
Some household members received a survey but no antibody test and other members received neither. The M kn formula depends on the total household size n, which for many households includes individuals with missing data. For households with at least one but not all members infected (1�k�n−1) and in which less than n member were full participants, the likelihood formula required the probability that different portions of the k infected members were among those who were antibody tested or surveyed only. To arrive at our formula, we assumed that the antibody-tested and surveyed-only portion of a household were a random sample of household members with respect to their prior infection status. I.e., we assumed that those individuals in a participating household with and without prior infections were equally likely to participate in the study and equally likely to agree to antibody testing.
In all we have 7 variables to be estimated by MLE, encapsulated in the following vector θ: The log likelihood of the dataset y described in Section 2.1 with variable set θ is then To present the formula for Lðy i jθÞ, the likelihood of a particular y i , we first define the following quantities calculated from the core elements of y i listed in Section 2.1: In the formula, the function A quantifies the probability of observing the given set of test result combinations among antibody-tested people (a PPi , a PNi , a NPi , a NNi ), given that (u, v, w, x) of them had a prior infection, respectively. E.g., u is the number of the a PPi household member who had an infection (true positives), v is the number of the a PNi household members who had an infection (true positive by prior test and false negative by antibody test), w is the number of the a NPi household members who had an infection (true positive by antibody test and did not report a prior positive test), and x is the number of the a NNi household members who had an infection (false negative by antibody test and did not report a prior positive test). The formula for A is The function f m (r; p) is the probability mass function for the multinomial distribution, where the number of trials is the sum of the elements of r, which are the number of infected or uninfected antibody-tested people who received each of the four possible test result combinations. The vector p contains the probability of each of the four test result combinations given that the person was infected (for p = p I ) or uninfected (for p = p U ): The first element of p I , ϕ V ϕ A , is the probability that an antibody-tested person with a prior infection reported a prior positive test (with probability ϕ V ) and also had a positive antibody test result (with probability ϕ A ). Note that ϕ A represents the sensitivity of the antibody test, but ϕ V includes both the sensitivity of the prior test and the probability that an infected person actually sought and received a SARS-CoV-2 test during the period of infection in which detectable virus was present and reported that positive test on our survey. Elements 2-4 of p I are the probabilities that an antibody-tested, prior infected person reported a prior positive test but tested negative for antibodies, did not report a prior positive test and tested positive for antibodies, and did not report a prior positive test and tested negative for antibodies, respectively. The elements of p U are the corresponding probabilities for individuals with no prior infection.
The function S quantifies the probability of the survey-only data (s Pi , s Ni ) given that y of the s Pi individuals had a prior infection and z of the s Ni individuals had a prior infection: The function f b (q;r,p) is the probability mass function for the binomial distribution, for q successes given that there were r independent trials with probability p for success of each trial.
The function H(k, k a , k s ) in the likelihood equation is the probability that, when k of n i individuals in the household were infected, k a infected individuals were among the a i individuals antibody tested and k s infected individuals were among the s i individuals surveyed but not antibody tested: is the probability mass function of the hypergeometric distribution for the number b of infected people selecting to be antibody-tested or surveyed-only, given that there were c infected people and d uninfected people available for selection in the household, and e people were tested or surveyed-only. These terms account for individuals in households who received neither an antibody test nor a survey, who may have included infected individuals. Our use of the hypergeometric distribution led from our assumption that, if some members of the household had a prior infection and others didn't, the antibody-tested / surveyed individuals were a random sample from the household with respect to their prior infection status.

Likelihood optimization and uncertainty
We maximized the log likelihood over the 7 unknown parameters (p c , p h , d h , ϕ V , ϕ A , π V , π A ) using the observations (n, a, s, a PP , a PN , a NP , s P ) for each household, to produce the MLE: The log likelihood maximization was performed using the "optim" function in R. We derived approximate confidence interval boundaries for an individual parameter θ i using the likelihood ratio test, using the statistic 2 logðLðŷÞ=LðyÞÞ, where θ consists of θ i freely varying and the other 6 elements of θ held at their optimal value. We defined a 95% confidence interval boundary where θ i produces a value for this statistic equal to the 95 th percentile of the chi-squared distribution with 1 degree of freedom. We also plotted 2-dimensional confidence region boundaries for each of the 21 possible (θ i , θ j ) parameter pairs by allowing each pair to vary freely together while holding the other 5 at their optimal values. We calculated the boundary in the (θ i , θ j ) parameter plane where the likelihood ratio statistic equals the 95 th percentile of the chi-squared distribution with 2 degrees of freedom. To calculate P-values at which certain fixed parameter values could be rejected in favor of the MLE, we used the chi-squared distribution with degrees of freedom equal to the number of fixed parameters.
Additionally, we developed a simulation model to produce synthetic data sets on which to test our likelihood model. We ran the simulation for the same number of households with the same sizes and participation rates for survey and antibody testing as in the actual data (fixed values of n, a, and s for each household). We randomized importations to households and simulated transmissions using the MLE values of the three epidemiological parameters p c , p h , and d h , randomized survey and antibody test results using the MLE sensitivity and specificity values, and maximized the likelihood against the simulated data. We repeated this process for 500 simulated data sets and recorded the median estimated value of each variable, for comparison against the MLE value that generated the data. We also used the 500 sets of simulation-based estimates as a parametric bootstrap to generate 95% confidence estimates for each variable, for comparison against the intervals generated from the likelihood ratio test.
We also compared the performance of our 7-parameter model against simpler models with fewer free parameters, including models assuming d h = 0 and d h !1, models fixing the sensitivity and specificity of the antibody test according to independent data from the test manufacturer (ϕ A = 89.3% and/or π A =99.6%) [32], and models assuming perfect specificity of prior test result reporting or antibody testing (π V = 100% or π A = 100%). We tested each of these assumptions individually and in various combinations, resulting in models ranging from 3 to 6 optimized parameters. We compared the models using the Akaike information criterion, which aims to balance goodness of fit with model simplicity.
Finally, we tested an alternate model that allows the community acquisition probability to vary by household, such that some households may have a higher per-capita acquisition rate than others applied to each household member. To quantify this probability in the alternate model, we employed the beta-binomial distribution for the number of community acquisitions in a household of a given size (see Supplementary Methods in S1 File).

Household transmission variability
We quantified the implications of our household transmission variability estimates by calculating the probability of transmission extremes, compared to those produced by the classic binomial transmission model (d h = 1). Specifically, we calculated the probability that an initially infected individual transmits to no one or everyone in households of sizes from 2 to 10. For households of size n, the probability of no transmissions from the index infection is F 0,n−1 (p h , d h ) and the probability the index person transmits directly to the entire household is We also calculated an example of a dynamic transmission model that produces a distribution of household transmission probabilities close to that produced by our MLE beta distribution, using the method of moments. Specifically, if an infected person's duration of infectiousness is assumed to be fixed and transmissibility to a housemate is modeled as a gamma distribution with shape parameter k, then we solve for the value k that produces the same mean and variance for the transmission probability as that of the beta distribution with mean p h and dispersion d h (Supplementary Methods in S1 File). We solved for k using our MLEp h andd h values, and we derived a confidence interval for k using the pairs of (p h , d h ) estimates from our parametric bootstrap analysis.

Within-household reproduction number
We calculated the within-household reproduction number R h , defined as the expected number of household transmissions directly from a community acquirer with all fellow household members susceptible: where μ and σ 2 are the mean and variance of the household size distribution, and p h is the secondary attack rate as determined by our MLE. This equation for R h is derived as in Ball et al. [30] and detailed in the S1 File.
Additionally, we derived an alternate household reproduction number R � h defined as the expected total number of transmissions in the household of an infected person who acquired infection in the community and has no initially non-susceptible housemates. This differs from R h in that it counts all potential downstream transmissions in the household stemming from the index community acquirer. The formula for R � h , derived in the S1 File, is where h i is the fraction of all households that are size i.
To investigate the implications of household transmission for population-wide transmission control, we use a threshold condition delineating subcritical and supercritical transmission in the population. Supercritical transmission occurs when R c ðR � h þ 1Þ > 1, where R c is the average number of community (non-household) transmissions from an infected person. We derive this formula in the S1 File, following Ball et al. [30]. We estimated R h , R � h , and the threshold value for R c by applying our MLE estimates of p h and d h to the above formulas and their confidence intervals by applying the (p h , d h ) pairs from each parametric bootstrap estimate.

Data summary
We compiled data from 9,383 households (Fig 1). Of these, we retained 9,224 (98.3%) for use in the MLE. The 159 excluded households were removed because the household size was unknown (51) or the reported household size was less than the number of people tested or surveyed in the house (108). In the 9,224 retained households, there were 28,321 (3.07 per household) reported household members, 13,998 (1.52 per household) people who were both surveyed and antibody tested, and another 5,249 (0.57 per household) who were surveyed but not antibody tested. The households in the data were located in 7 of the 29 counties in Utah; the 22 excluded counties account for <14% of Utah's total population S1 Table in S1 File.
Of the 13,998 antibody tests in the retained households, 178 (1.27%) were positive. Of those 178 people with a positive antibody test, 58 (32.6%) reported receiving a prior positive test. Of the 19,247 people who were antibody tested or surveyed only, 119 (0.62%) reported receiving a prior positive test. This broke down to 0.53% (75 / 13,998) for those who were antibody tested and 0.84% (44 / 5,249) for those who were surveyed but not antibody tested. The rate of testing positive for antibodies among those reporting a prior positive test was 77.3% (58 / 75). The interval between the reported prior positive test date and the antibody test date did not exhibit a strong correlation to the fraction of testing antibody positive, other than perhaps the 3 individuals reporting a very recent (less than 1 week) positive test all testing negative for antibodies (S2 Table in S1 File). The rate of survey participants agreeing to antibody testing was lower for those who reported a prior positivetest compared to those who did not: 63.0% (75 of 119) vs. 72.8% (13,923 of 19,128), a small but statistically significant (P < 0.01) difference in proportion.
Of the retained households, 193 (2.1%) had at least one household member who either tested positive for antibodies or reported a prior positive test. There were 159 households with exactly 1 positive member (by either antibody test or reported prior test or both), 26 households with 2 positives, 6 with 3 positives, 1 with 4 positives, and 1 with 6 positives. In all, there were C = 273 unique y i vectors representing household data described in section 2.1.
The crude secondary attack rate measure derived from antibody testing only (fraction of antibody-tested housemates of antibody-positive household members who were also antibody positive) was 14.9% (29 / 194). The crude secondary attack rate estimate from reported prior test data only (fraction of surveyed housemates of people reporting a prior positive test who also reported a prior positive test) was 23.0% (31 / 135). When combining both types of data, the crude secondary attack rate estimate (fraction of surveyed / tested housemates of any antibody-positive or reported-prior-positive person who were positive by either or both measures) was 15.6% (46 / 295).
We tallied demographic statistics of the set of surveyed individuals (S3 Table in S1 File). The distribution of reported ages skewed older than Utah's overall population age distribution, and females were slightly overrepresented (52.0%). The distribution of surveyed individuals' race, Hispanic origin, and education level also differed from the overall Utah and U.S. distributions.

Maximum likelihood estimates
Our MLE procedure produced simultaneous estimates for all 7 parameters (Table 1). The MLE for p c , the per-person community acquisition probability from outside the household, was 0.41% (0.32%-0.51%). For within household transmission probability, the MLE produced an average secondary attack rate estimate p h = 36% (27%-48%). The MLE for the dispersion  [27], could be rejected with P = 0.001 (Table 2). Our MLE result for ϕ V , the probability that a surveyed person with a prior infection reported a prior positive test, was 72% (62%-82%). The ϕ V value can be interpreted as the case ascertainment fraction, i.e. fraction of individuals with SARS-CoV infections who were identified with a positive test during their infection. Our result may be high compared to other areas of the U.S.: one study estimated that less than 60% of symptomatic cases in the U.S. were identified during February-June 2020 [31]. Our finding may reflect unusually successful case ascertainment efforts in Utah during the Spring and early Summer of 2020, perhaps partly owing to slower emergence compared to other regions.
For π V , the probability that a surveyed person with no prior infection reported no prior positive test, the MLE was 99.94% (99.88%-99.98%). This result is consistent with the low probability of false positives among viral tests, which to our knowledge were exclusively PCRbased in Utah prior to our data collection. It is possible that some false positives in our survey data occurred by erroneous reporting, i.e. survey respondents reporting a prior positive test that did not occur, rather than via errors in testing procedure. Even though our MLE for this parameter was in excess of 99.9%, we found that an alternate model assuming π V = 100% produced notably different estimates of some of the other parameters (Table 2), which suggests that studies producing epidemiological estimates relying on a 100% viral test specificity assumption should test robustness of conclusions to small deviations from that assumption.
For ϕ A , the probability that a prior-infected person's antibody test was positive, the MLE was 86% (75%-93%), similar to the test manufacturer's finding that 109 of 122 (89.3%) PCRpositive subjects were positive for antibodies [32]. Assuming ϕ A = 89.3% directly and optimizing the other 6 parameters produces a slightly better AIC than the full 7-parameter model ( Table 2). The manufacturer's results included only symptomatic subjects and were highly dependent on the number of days post-symptom onset at which the serological sample was taken. Because the symptom histories of the antibody-tested people in our data are largely uncertain, it is difficult to determine how consistent our result is with the manufacturer's data.
The MLE for π A , the probability that an antibody-tested person with no prior infection tested negative for antibodies, was 99.3% (99.2%-99.5%), which is within the uncertainty range of the test manufacturer's estimate of 99.6% (99.0%-99.9%) based on 4 positive tests from 997 samples collected prior to September 2019 [32]. Models assuming the manufacturer's point specificity estimate of 99.6%, produced inferior AIC to the full model (Table 2). When we ran our MLE under the assumption of perfect specificity (no false positives) for the antibody test (π A = 100%), the result for secondary attack rate reduced from 36% to 18%, which is closer to the crude estimate described in Section 3.1, and the results for community acquisition probability increased from 0.4% to 1.2% (Table 2). Thus, our model suggests that allowing for false positives can shift the attribution of infections toward household transmissions and away from acquisitions outside the household. We also found that assuming perfect specificity of the antibody test dramatically reduced the estimate of ϕ V from 72% to 37% (Table 2), which suggests that ignoring false positives in serology data could cause an underestimate of the case ascertainment rate if the serology data are used for that purpose.
When optimizing the likelihood equation against 500 synthetic data sets simulated using the MLE variable assumptions, the median estimates of each parameter were very close to the MLE values ( Table 1). The confidence intervals derived from these bootstrap estimates were similar to those derived from the likelihood ratio test, though the bootstrap intervals were somewhat wider for the three parameters governing importation and transmission. Likewise, the likelihood ratio-based intervals reported in Table 1 expanded modestly when we calculated 2-dimensional confidence regions based on each pair of estimated parameters, with most regions exhibiting close to symmetric shapes around the MLE (Supplemental Figures in S1  File). Notably, the 95% confidence regions involving the transmission dispersion parameter d h can extend to the high-variability boundary d h = 0, a result that is also reflected by the fact that the MLE for the model with fixed d h = 0 cannot be rejected with high confidence (P = 0.16) ( Table 2). However, none of the alternate models assuming d h = 0 produced a superior AIC to the full model (Table 2).
Our alternate model that employed a beta-binomial distribution for the number of household acquisitions, using a new dispersion parameter d c estimated as an additional variable in the MLE, found d c = 2.1 (0.89-7.5), with somewhat altered estimates of the other parameters (S4 Table in S1 File) compared to those in Table 1. However, the log likelihood of the model in Table 1, which is equivalent to the alternate model with d c = 1, is sufficiently close to that of the alternate model that d c = 1 cannot be rejected by the likelihood ratio test and neither model is favored by the Akaike information criterion. If overdispersion in household community acquisitions does occur, the uncertainty ranges of the transmission variables p h and d h become large (see Supplementary Results in S1 File).

Household transmission variability
We quantified the implications of our key finding of high transmission variability within households of persons infected with COVID-19 by calculating the probability of transmission extremes. Compared to our overall MLE, the classic binomial transmission model (d h = 1) produced a similar average secondary attack rate estimate of p h = 32% (24%-41%). However, the binomial model produces substantially lower probabilities that an infected individual transmits to no one or everyone in larger households (Table 3).
For example, our MLE model estimates that an infected member of an 8-member household would have a 46% (22%-70%) chance of transmitting to no one, but a 20% (3%-50%) chance of transmitting infection directly to all 7 housemates. By contrast, the no-variability binomial model estimate would be substantially lower for each extreme: 7% (3%-14%) chance of transmitting to no one and 0.03% (0.005%-0.2%) chance of transmitting to everyone (Table 3).
We calculated an example of a dynamic transmission model that would produce the same mean and variance of a person's transmission probability to a household member that is produced by our MLE beta distribution. If an infected person's duration of infectiousness is assumed to be fixed and transmissibility to a housemate is modeled as a gamma distribution 45% (20%-70%) 5% (2%-11%) 20% (2%-50%) 0.01% (0.001%-0.08%) with shape k, then k = 0.18 (95% CI 0-0.7) when the mean and variance are matched, regardless of the infectious duration (Supplementary Methods in S1 File). This estimate of k is comparable to the dispersion parameter k of the negative binomial distribution commonly used to characterize overall variability in the number of transmissions from individuals, which can be derived from the Poisson distribution with a mean that is gamma-distributed with shape parameter k [17]. Our estimate of k is similar to point estimates for SARS-CoV-2 of k = 0.1 [22], k = 0.25 [23], and k = 0.33 [24].

Within-household reproduction numbers
Our estimate of the household reproduction number R h , the expected number of household transmissions from a community acquirer with no other infected fellow household members, depends on our estimate of p h and the mean μ and variance σ 2 of the household size distribution. From our data we found μ = 3.07 and σ 2 = 3.12, so our estimate is R h = 1.12 (0.78-1.56).
Our estimate of the alternate household reproduction number R � h , the expected total number of transmissions in the household of a community acquirer, is R � h = 1.45 (0.94-2.05). The supercritical threshold for R c , the average number of non-household transmissions by an infected individual, is approximated by 1=ðR � h þ 1Þ (see Methods section 2.6 and S1 File). Using our estimate for R � h in Utah, this formula suggests that R c must be kept below approximately 0.41 (0.33-0.52) to avoid increasing growth of COVID-19 infections in the population.

Discussion
The key findings of our analyses stem from our simultaneous estimation of the average and variability of SARS-CoV-2 household transmission, household importation, and test data accuracy. Our novel combination of those interacting features within our model revealed two important epidemiological insights. First, we found that accounting for test error, especially the specificity of the serological antibody test, produced a substantially higher estimate for the household secondary attack rate. Second, we found evidence of substantial variability of transmissibility within households, which has important implications for understanding broad transmission patterns and mitigation strategies.
An important implication of the first finding is that assuming perfect test accuracy may be a source of underestimation for the household secondary attack rate in other studies. Our maximum likelihood estimate was 35% (27%-48%), which is higher than recent pooled estimates of 17-19% from the most recent meta-analyses of worldwide household studies [7,8]. These and other published studies have generally estimated the secondary attack rate by a simple calculation of the fraction of tests that were positive among household contacts of known cases. When we applied that calculation to our combined data, we found a crude secondary attack rate estimate of 15.6%. We traced the major source of this substantial underestimate to the assumption of perfect test specificity inherent in the crude formula.
Our second major finding of overdispersion of household transmission stemmed from our use of the beta-binomial distribution to quantify the number of household transmissions from infected individuals. We quantified individual-level variability in transmissibility using a dispersion parameter d h , and the optimal value occurred at low dispersion (high variability; d h = 0.43). The more commonly used binomial model, a special case of our model at minimal variability (d h !1), was rejected, suggesting that transmission patterns are not well captured by that simplifying assumption.
Our dispersion parameter estimate is not directly comparable to another commonly used dispersion parameter, often named k, that characterizes variability in the total number of transmissions (whether household or not) from each infected person as a parameter of the negative binomial distribution [17]. We converted d h to k in the context of simple model in which the only source of variability is a person's transmissibility per unit time in contact with others, finding k = 0.18, similar to other published results for SARS-CoV-2. This similarity perhaps suggests that variability in infectivity per time is a major driver of overall transmission variability for SARS-CoV-2. This could be consistent with findings that viral shedding is highly variable by individuals with SARS-CoV-2 infections, both during asymptomatic and symptomatic phases of disease, suggesting that heterogeneous transmissibility may be largely explained by overdispersion in levels of viral shedding by individuals [33]. However, other studies suggest that SARS-CoV-2 transmission overdispersion in the wider population beyond households may be less driven by biological heterogeneity and more by heterogeneous social contact behavior [34].
The level of within-household transmission variability captured by the parameter d h affects the contribution of household transmission toward threshold levels of overall transmission. Threshold conditions are often expressed using a reproduction number (R), the average number of transmissions from each infected person. The average number of household transmissions directly from an initially infected household member (R h ) is independent of d h , but d h does affect the average number of household transmissions in the next generation, i.e. by someone who acquired infection from a housemate. When transmission variability is higher, the household transmission potential of a household acquirer is lower, reducing to zero in the "all-or-nothing" limit d h = 0. To capture this effect, we introduced an alternate reproduction number R � h , which is the average number of total household transmissions after the initial introduction, when final household outbreak size has been reached.
Neither R h >1 nor R � h > 1 are sufficient threshold conditions for sustained transmission in a community, which requires some level of between-household transmission to be maintained. Given our estimate of R � h = 1.45 (0.94-2.05), we can estimate the critical value of R c , the average number of non-household community transmission that would push transmission for the population above the supercritical threshold for a growing epidemic, with the threshold condition R c > 1=ðR � h þ 1Þ. Thus, we estimate that R c must be kept below approximately 0.41 (0.33-0.52) to avoid continued case growth in Utah if household transmission continues to be well characterized by our model. As this result depended on the average household size in our data, it is notable that Utah has the highest state-average household size in the United States. The average household size in Utah is 3.1, about 20% higher than the national average household size. Thus, our R h estimate may be high compared to other locations. A lower value of R h would lead to a higher threshold value for R c .The potential contribution of interventions to reduce household transmission may also be important. Using the terms defined above, if R c <1 but R c ðR � h þ 1Þ > 1, then overall transmission is above-threshold but could be pushed belowthreshold by reducing household transmission alone, such that R � h < 1=R c À 1. Methods to reduce household transmission might include increased used of at-home testing to earlier detect potential asymptomatic or pre-symptomatic transmitters, paired with increased use of masks, disinfectants, and/or distancing within homes of an infectious person [35].
This study has several limitations. Our estimate of high household transmission variability may not be robust to alternate assumptions for the way community acquisition risk varies by household. For example, some households could have been comprised of families with both parents working essential jobs during Spring/Summer 2020, with children attending in-person day care or camps, thus placing the entire household at much higher risk of community acquisition compared to households working / caring for children at home. Also, households could have high collective community acquisition probability via attending multi-household gatherings of extended family or other social groups. In these ways, households conceivably could vary considerably in their infection numbers for reasons that don't involve within-household transmission.
We tested the implications of this alternate possibility for household variability in our model by allowing variability in community acquisition by household using an additional dispersion parameter to the MLE model (Supplementary Results in S1 File). Interestingly, the MLE for the transmission dispersion parameter d h still occurred at high variability in household transmission (d h = 0.21) under this alternate model. Furthermore, the improvement in likelihood was not substantial, such that the more complicated model would not be favored by the likelihood ratio test nor the Akaike information criterion. However, larger uncertainty ranges under the alternate model suggest that we may not be able to definitively rule out the possibility that variability in community acquisition risk by household plays a substantial role in explaining overall variability in household infection numbers.
It is also possible that household transmission variability could be driven by properties of households such as contact behavior, underlying health composition of household members, physical properties of the domicile such as size and ventilation, or other properties that could increase transmission risk of all household members together. Possible variability in personto-person transmission probability by household, rather than by individual, is not accounted for in our model. Using a beta distribution for this probability across different households to arrive at an alternate final size distribution would require integrating the beta distribution over the full final size distribution equations produced by the binomial-chain model, which would be complicated for larger households. Alternatively, one could model a functional relationship between observed properties of a household in the dataset and its average transmission probability, while retaining dispersion occurring at the individual level. We have not attempted this with our data; we suspect that the sample size of outbreaks in households with a given feature would not be large enough to draw meaningful conclusions, but this could be an important direction of future work enhanced by a larger dataset.
Another limitation lies in our potentially inaccurate assumptions used to quantify the probability of prior infections among those with missing data within participating households. Most non-participating individuals within participating households were children under 12, who were not offered antibody tests. Older participants could fill out surveys on behalf of children of any age, including reporting of prior positive tests, but participation in that option was low. Thus, our assumption that non-participants had equal community acquisition rates, susceptibility to acquisition from another household member, and transmissibility to other household members compared to study participants would be violated if children were substantially different from adults in one or more of those quantities. Our assumption is consistent with studies finding similar transmission rates to and from children compared to adults. In a study of COVID-19 clusters linked to day care centers within our study area in Utah [36], 42% of the cases occurred in children, who represented 60% of the people with epidemiological contacts to the facilities. The infected children (median age 7) transmitted infection to at least 26% of their non-facility contacts, close to our household estimate. Another study found that children under 10 in China were as likely to be infected as adults [37]. However, other studies suggest that children may be less likely to acquire infection than adults [38], and one study found very low household secondary attack from infected children in South Korea [39]. A study similar to ours found lower rates of importation and household acquisition among children aged 5-9 compared to older groups, although confidence intervals overlapped [15]. If substantial differences existed between children under 12 and our study participants, one or more of our estimates could be biased.
In addition, many eligible participants older than 12 chose not to participate, either declining the serological antibody test only (but still filling out a survey) or declining to participate at all. Comparing full participants to survey-only participants, we found that participants reporting a prior positive SARS-CoV-2 test were less likely to agree to antibody testing, though the difference was not large (63.0% vs. 72.8%). It is unknown whether a prior confirmed or suspected infection affected eligible household members' decision to agree or decline to fill out a survey. The full set of surveyed participants had different distributions of reported age, sex, race, Hispanic origin, and education level compared to the wider population, and future work could assess the implications of those differences for extrapolating COVID-19 risk to other households.
We also have not adjusted for potential biases related to non-participation rates of entire households that were selected and approached for inclusion in the study. Our data collection included a complicated sampling design across several different strata, and weights were introduced partly to account for different rates of nonresponse across the different strata. For simplicity we ignored these details and sampling weights for the analysis presented here. Also, while the 7 included Utah counties represent >86% of the state population, there may be important differences in households from the 22 excluded counties. Thus, households with higher COVID-19 risk may be overrepresented or underrepresented in our data relative to their frequency in the broader population of households in Utah.
Other potential limitations due to simplifying assumptions could be addressed in future work by relaxing those assumptions, such as assuming different prior infection probabilities for those who reported prior negative tests vs. those who reported never being tested and including the probability of active infections at the time of serological testing. Although these potential limitations, which also exist for other analyses of household transmission from serological data [13][14][15], remain in our analysis, we believe our model has addressed other limitations of existing models that may be more substantial. Our improvements to household secondary attack rate estimates, including factoring out non-household community acquisitions and tertiary transmissions, inclusion of overdispersion estimates, and careful consideration of the impact of imperfect test sensitivity and specificity, have produced improved insights into this important measure. While the likelihood equations resulting from our model are somewhat complicated, we have demonstrated that the complications introduced by including 7 unknown model parameters are justified by systematically comparing its performance against simpler models, using criteria that seek to balance goodness of fit with simplicity. Furthermore, we have provided full mathematical specification and computational code for reproducibility. The ability to explicitly calculate the likelihood for our model is an advantage for optimization speed and further mathematical analysis, and extensions to the epidemiological household model can readily be simulated to explore potential improvements.
In conclusion, we found evidence of a relatively high secondary attack rate and high overdispersion in transmission of SARS-COV-2 in Utah households during a time when overall community prevalence was low. Other published household secondary attack rates may be underestimated without accounting for imperfect test sensitivity and specificity. Controllability of the virus may depend on mitigating transmission from a minority of highly infectious individuals in large households and other household-like locations where several people congregate indoors for extended periods.