## Figures

## Abstract

Contagious statistical distributions are a valuable resource for managing contagion by means of *k*–connected chains of distributions. Binomial, hypergeometric, Pólya, uniform distributions with the same values for all parameters except sample size *n* are known to be strongly associated. This paper describes how the relationship can be obtained via factorial moments, simplifying the process by including novel elements. We describe the properties of these distributions and provide examples of their real–world application, and then define a chain of *k*–connected distributions, which generalises the relationship among samples of any size for a given population and the Pólya urn model.

**Citation: **García–García V, Martel–Escobar M, Vázquez–Polo F (2022) Contagious statistical distributions: *k*-connections and applications in infectious disease environments. PLoS ONE 17(5):
e0268810.
https://doi.org/10.1371/journal.pone.0268810

**Editor: **Ivan Kryven,
Utrecht University, NETHERLANDS

**Received: **April 11, 2021; **Accepted: **May 6, 2022; **Published: ** May 27, 2022

**Copyright: ** © 2022 García–García et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper.

**Funding: **This work was partially funded by grant ECO2017–85577–P (Ministerio de Economía, Industria y Competitividad. Agencia Estatal de Investigación, Spain) and the 2014–2020 ERDF Operational Programme and the Dpt. of Economy, Knowledge, Business and University of the Regional Government of Andalusia under grant FEDER–UCA18–107519.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

A large body of literature has been generated regarding mathematical models of epidemiology. These models usually consider the population under study to be clustered as follows: persons born with passive immunity (denoted by M), those without passive immunity and hence susceptible (S), those who are infected but not infectious (E), those who are capable of transmitting the infection, and hence infectious (I), and those who have a permanent infection–acquired immunity, and hence recovered (R). Different epidemiology models are classified according to which of these clusters are considered: MSEIR, SEIR, SIR, etc.

Among the main parameters included in these models are the basic reproduction number of an epidemic, *R*_{0} (that is, the expected number of secondary cases produced by a primary case during their infectious period, within a completely susceptible population), and the degree of herd immunity (the fraction of immune individuals within the population beyond which the epidemic can no longer grow). A brief historical summary of mathematical models in epidemiology can be found in Hethcote [1].

Knowledge and understanding in this area have advanced rapidly, and many empirical studies have improved upon the classical models by including significant features such as the effects of heterogeneity and correlations, household effects, network–driven contagion and mobility models. Sun et al. [2], in a study based on Chinese data, considered the role of transmission heterogeneities, which are driven by demography, behaviour and interventions. Kawagoe et al. [3]examined the question of infectious disease dynamics in heterogeneous populations and the role played by “superspreaders”. Aleta et al. [4] studied the effects of testing, quarantine and contact tracing, and Huber et al. [5] proposed a tracing strategy to optimise the cost/effect balance. Chang et al. [6] successfully developed a SEIR model which used mobile phone geolocation data.

Most of the mathematical models proposed require certain assumptions about the dynamics of infectious disease. For instance, a common and sometimes unrealistic assumption is that there is the same probability of any infectious individual infecting any susceptible one, a relation that is termed homogeneity. Britton et al. [7] showed that the contrary situation, that of population heterogeneity, can have a considerable impact on disease–induced immunity because the proportion of infected individuals in groups with higher contact rates is greater than that in groups with lower contact rates. Hébert–Dufresne et al. [8] showed, using random network theory to predict the size of an epidemic, that without data on the heterogeneity in secondary infections (which are needed to estimate its cumulative distribution function) the size of the outbreak remains highly uncertain.

Seeking to avoid the above assumption, various improvements to the models have been suggested. For instance, Neipel et al. [9] generalised the SIR model taking into account generic effects of heterogeneity on the population’s degree of susceptibility to infection. Introducing a new parameter, that of a power–law exponent of the susceptibility distribution at small susceptibilities, Neipel et al. showed that the class of gamma distributions acts as an attractor of the dynamics, making it possible to identify generic effects of population heterogeneity.

Another common assumption which may be unrealistic is the “law of large numbers” (LLN), meaning that the population size is large enough to accurately describe random dynamics with asymptotic elements, such as limit probability distributions. However, many situations of infectious disease spread originate within a closed environment (school classrooms, for instance) with a population size, where the LLN assumption does not hold. In this circumstance, attempting to forecast the behaviour of infection dynamics by means of the classical models would be quite misleading. Brooks et al. [10] developed a stochastic transmission model of infection spread in university campuses, based on realistic mixing patterns, and evaluated various infection mitigation strategies. Mayberry et al. [11] presented dynamic random graph techniques for modelling small population outbreaks, allowing different interaction rates among students. These authors analysed Monte Carlo simulations, assuming a beta negative binomial distribution, to determine the effects of different transmission rates and of diverse vaccination strategies on the dynamics of a hypothetical outbreak of influenza. With respect to the COVID–19 pandemic, several guidelines on appropriate antigen–testing strategies have been developed. For instance, the National Academies of Sciences, Engineering, and Medicine [12] provided a general guide for colleges and universities in the USA, and Nixon et al. [13] described how the University of Bristol (UK) developed CONQUEST, a tool to record and analyse data on COVID–19.

In this paper, we consider contagious statistical distributions and theoretical tools which could be applied to certain scenarios of infection, such as a closed environment with several rooms. In addition, we present techniques showing how complementary information on the statistical behaviour of infectious disease spread can be obtained.

Contagious statistical distributions are valuable toolboxes relating to the epidemiology of communicable diseases. These resources enhance our understanding of the presence of contagions in, for example, a confined space. In a more general scenario, assume a system with *n* components. Each component could have a different workload and hence a different probability of failing. The variable considered is the number of failure events. Now assume a different number of system components and a different workload for each component; nevertheless, the overall proportion of failing components remains unaltered. The question then arises: when *X*^{(n)} is the number of fails when the system contains *n* components, what relationship exists (if any) among the probability distributions of *X*^{(n)} for *n* = 1, 2, …?

Such a scenario is most commonly modelled using a classical binomial distribution, in which each component has the same probability of failing. However, this means there may be a high variance, and therefore a large statistical error in the estimates obtained. A second concern is the implicit assumption that the binomial probability mass function (pmf) seems to resemble a Gaussian curve (as the *n* increases), producing a certain symmetry, unimodality, etc. This assumption is not always valid. Finally, independence cannot always be assumed.

The Pólya urn (contagious) model described by Eggenberger and Pólya [14] models the above situation by considering an urn which initially contains *W* white balls (cases) and *R* red ones (others). One ball is sampled at random and returned to the urn with *c* additional balls of the same colour. After this procedure has been applied to *n* samples, the variable, *X*^{(n)}, which counts the number of white balls sampled is said to be Pólya distributed, and is denoted by *X*^{(n)} ∼ P(*W*, *R*, *c*, *n*). Its pmf, i.e. the probability that, after *n* draws, *w* white balls (representing cases of infection) and *n* − *w* = *r* red balls (representing individuals free of infection) have been drawn, is given by
(1)
where *p* = *W*/(*W* + *R*), *q* = 1 − *p*, and *δ* = *c*/(*W* + *R*), subject to the following feasibility conditions:

*W*+*R*− 1 ≥ (1 −*n*)*c*, to have a feasible set of parameters, and- min{
*W*− 1,*R*− 1} ≥ (1 −*n*)*c*, to have a distribution in which the rank is complete.

Particular cases are:

- Let
*np*=*h*,*nδ*=*d*, and*n*→ ∞, with*h*and*d*remaining finite, then (1) has the limiting form which corresponds to a negative binomial distribution. - For
*c*= 0, then (1) reduces to a classical binomial model, Bin(*n*,*p*). - If negative values are allowed for
*c*, then for*c*= −1, P(*W*,*R*, −1,*n*) reduces to the hypergeometric model, H(*W*+*R*,*W*,*n*) of sampling without replacement.

Recent studies of the Pólya urn models include in Kotz et al. [15], Mahmoud [16], Chen and Wei [17] and Chen and Kuba [18].

Ollero and Ramos [19] showed that the Pólya distributions (allowing any feasible integer value for *c*) are equivalent to Poisson–Binomial models. The Poisson–Binomial model describes the number of successes, *X*^{(n)}, in *n* Bernoulli independent trials, each of which has the probability of success *p*_{i}, *i* = 1, …, *n*. The model is denoted by *X*^{(n)} ∼ PB(**p**), where **p** = (*p*_{1}, …, *p*_{n}), and its pmf is given by
(2)
where logit(*p*_{i}) = *p*_{i}/(1 − *p*_{i}), for *i* = 1, …, *n*, and the summation is over all possible combinations of different *i*_{1}, …, *i*_{k} from {1, …, *n*}. Clearly, the mean of the random variable *X*^{(n)} is given by

This main result from [19] is quite surprising, meaning that the number of successes in *n* dependent Bernoulli trials can be described as the number of successes in independent ones.

The Poisson–Binomial distribution has had relatively little research attention in recent years, mainly due to the absence of assumptions regarding its parameters: the Poisson binomial family contains quite different distributions, with quite different properties, and *n* parameters are required to model a random variable which can take *n* + 1 values. Among the few more or less recent papers on these distributions, theoretical results have been reported by Schlemm [20] and a goodness of fit test was proposed by Acharya and Daskalakis [21]. In addition, some work on approximation, by different methods, has been done by Neammanee [22, 23] and Barbour [24], Skipper [25], Butler and Stephens [26] and Novak [27]. Studies related to the computation of probabilities include Hong [28] and Barrett and Gray [29]. Analyses in which the model has proven useful are described in Chen and Liu [30], Tejada and den Dekker [31] and Rosenman and Viswanathan [32]. An excellent review of the most recent progress related with the Poisson–Binomial distribution is Tang and Tang [33].

Let us assume that not only Poisson–Binomial models but any finite distribution might best fit the data. If the cdf of each *X*^{(n)} is denoted by *F*^{(n)} then we wish to find chains of distributions of the form {*F*^{(n)}: *n* = 0, …, *M*}, where *M* could be infinity or an integer upper bound to the chain, and where all these distributions share certain regularity conditions. However, these conditions cannot be defined in a simple way.

A chain of finite distributions has a relationship called *k*–connection, meaning there exists a strong relationship among the respective factorial moments, which can be viewed as a regular pattern of behaviour within a contagious environment. This, in turn, implies the existence of proportionality in the expected means, variances, etc., thus providing us with an instrument to manage the behaviour of the number of infections taking place in an environment as its population increases.

By means of this relationship, a model for the number of successes in *n* = *n*_{0} trials can not only be described by a given distribution *F*^{(n)}, but can also facilitate a chain of *k*–connected distributions for any other feasible sample size. When the relationship among the models within a chain is assumed as part of the model, it can be tested or estimated from samples of different sizes.

The aim of this paper is to describe and/or characterise families of discrete distributions parametrised by a sample size. These distributions are used to model contagion via a relationship that we term *k*−connectedness. We show that this relationship can be presented in a natural way, among many well–known families of discrete distributions, such as the chains {Bin(*n*, *p*): *n* ≥ 0}, {P(*W*, *R*, −1, *n*): *n* = 0, …, *M*}.

What is this relationship useful for? Theorem 1 shows that chains of connected distributions are feasible statistical models for estimating the proportion of infected individuals in a population, given samples of varying sizes from this population. In other words, sample observations would commonly be used, jointly, with different sample sizes to estimate a unique or common value for the probability of success, *p*. Thus, data from different distributions within a chain of connected distributions can be jointly used for inference. This powerful possibility is proven to be feasible within any given chain of *k*−connected distributions.

The rest of this paper is organised as follows. Next, we present the following theoretical elements considered, and describe their main properties: the connecting function of a finite random variable or distribution; the *k*–connection relationship; the chains of connected distributions; and the chain–generating sequence. In addition, we provide a triangular table to represent a chain generating sequence. Finally, some subsets of well–known families of finite distributions are shown to be chains of connected distributions. In the Estimation section, we then illustrate a practical application of these elements, with a real–world example of their use, showing that samples from different distributions belonging to the same chain of *k*–connected ones can be used jointly for estimation. A simulation study is also performed to rule out the possibility of errors in the estimation process. Finally, we summarise the main conclusions drawn.

## Chain of distributions

In this section, we define and study some auxiliary elements to simplify the definition of a chain of *k*–connected distributions. Instead of addressing this relationship by means of factorial moments, we do so using a characteristic function of the distributions, termed the connecting function. To facilitate the detection and management of a chain of *k*–connected distributions, we also define the chain generating sequence, i.e. the sequence of real numbers that characterises a given chain of *k*–connected distributions. Some classical (but previously unknown) chains and their generating sequences are also shown.

**Definition 1** *X*^{(n)} *be a random variable with support in the integer interval* [0, *n*]. *The function*
(3) *is then termed the connecting function of X*^{(n)}.

Expression (3) can be rewritten in terms of the probability generating function (pgf) (4)

**Example 1** *For a binomial random variable X*^{(n)} ∼ *Bin*(*n*, *p*) *the connecting function is given by*

**Proposition 1** *The connecting function of a Poisson–Binomial distributed variable*, *X*^{(n)} ∼ *PB*(*p*_{1}, …, *p*_{n}) *is given by*

*Proof*. Given that the well–known Poisson–Binomial probability generating function can be expressed as
the proof follows immediately from (4).

**Corollary 1** *The connecting function of a random variable X*^{(n)} *with support on the integer interval* [0, *n*] *is a polynomial with real roots iff X*^{(n)} *is a Poisson–Binomial distributed variable*.

*Proof*. To prove this, it only has to be noticed that any real root of is inside the real interval [0, 1].

The connecting function is no more than a particular probability generating function. Nevertheless, it is a useful means of presenting the natural concept of chain of connected distributions, which to our knowledge has not been addressed before. In this understanding, we first introduce the concept of *k*−connection and then go on to prove that it is the common internal relationship of certain particular sets of discrete probability distributions.

**Definition 2** *Let X*^{(n)} *and X*^{(n+k)} *be random variables with respective connecting functions*
*and*
. *Both variables and their respective distributions are said to be k*–*connected if*

When a pair of random variables, *X*^{(n)} and *X*^{(n+1)}, are 1–connected, they are said to be connected.

For instance, in the binomial distributions Bin(*n*, *p*) and Bin(*n* + 1, *p*), we have that and so Bin(*n*, *p*) and Bin(*n* + 1, *p*) are 1–connected. Analogously, Bin(*n*, *p*) and Bin(*n* + 2, *p*) are 2–connected distributions, and so on. The same outcomes are obtained in most classical finite models, such as Pólya distributions and discrete uniform distributions.

In the following, we use the standard Pochhammer notation for the falling and rising factorials: and .

The following properties are straightforwardly proven.

**Proposition 2** *Let X*^{(n)} *and X*^{(n+1)} *be connected random variables with respective connecting functions*
*and*
*Let h* ∈ {*n*, *n* + 1}. *Then*:

*The connecting function can also be written as**For i*= 0, …,*n*− 1,*this verifies*- (
*n*+ 1 −*i*)[*μ*]_{i}(*X*^{(n+1)}) = (*n*+ 1)[*μ*]_{i}(*X*^{(n)}). - (
*n*+ 1) Pr (*X*^{(n)}=*i*) = (*n*+ 1 −*i*) Pr (*X*^{(n+1)}=*i*)+ (*i*+ 1) Pr (*X*^{(n+1)}=*i*+ 1).

- (
- .

*Proof*. Parts 1 and 2 are straightforward from (3). To prove 3, denote by *f*_{h,i} = Pr(*X*^{(h)} = *i*), for *i* = 0, …, *h*. Then, expanding (*z* − 1)^{i} in (3) we have

Part b in 4 follows from and taking into account that The remaining properties are obtained immediately.

It is obvious that if *X*^{(n)}, *X*^{(n+k)} are *k*–connected and *X*^{(n+k)}, *X*^{(n+k+h)} are *h*–connected, then *X*^{(n)}, *X*^{(n+k+h)} are *k* + *h*–connected. Accordingly, this can be considered a sequence of consecutively connected variables, meaning that any pair of them are *k*–connected.

Notice that item 3 in Proposition 2 gives a recurrence relationship among the distributions. This relationship is verified by the well–known subfamilies of discrete distributions which are applied to *n*–sampling from a given population.

**Definition 3** *A set of random variables X*^{(0)}, *X*^{(1)}, … *such that any pair of them are k*–*connected for the appropriate k is said to be a chain of connected distributions*.

A chain of connected distributions can be finite or infinite, depending on its nature, and its first element is a degenerate random variable which takes a null value with full probability. In an example below, we demonstrate that binomial chains contain one distribution for each sample size. However, a chain of hypergeometric distributions only contains a finite number of ones, as the samples without replacement cannot be higher than the population. Given a discrete distribution *F*^{(n)} on {0, …, *n*}, it is easily proven that there exists a chain of connected distributions {*F*^{(k)}: *k* = 0, …, *n*} which contains *F*^{(n)}. The question to be solved is whether there exists an additional distribution *F*^{(n+1)} that would extend the chain.

Any chain of connected distributions is characterised by a sequence of real numbers, such that the chain can be extended if this is possible, and if not, this is apparent. These distributions are termed chain generating sequences. In this case, the finite difference operator of a sequence of numbers is denoted as (5)

The following properties of this operator are evident:

**Lemma 1** *Given a sequence of real numbers A* = {*a*_{k}: *k* = 0, …, *n*}, *then*
(6)

**Definition 4** *Let A* = {*a*_{k}: *k* = 0, …, *n*} *be a sequence of real numbers that verifies*:

*a*_{0}= 1- (−1)
^{j}Δ^{j}*a*_{k}≥ 0, for*k*+*j*≤*n*.

*Then, A is termed a chain generating sequence (cgs)*.

It can be easily proven that each element of a cgs lies within the real interval [0, 1], and that *a*_{k} ≥ *a*_{k + 1}, for any *k* = 0, …, *n* − 1.

**Lemma 2** *Let X*^{(n−1)} *and X*^{(n)} *be random variables, and let*
*for i* = 0, …, *n* − 1 *and j* = 0, …, *n*. *Then, X*^{(n−1)} *and X*^{(n)} *are connected iff it is verified that*:
*for all i* = 0, …, *n* − 1.

*Proof*. Following part b in 4 from Proposition 2, we obtain one direction of the iff. To prove the opposite direction, and for *h* ∈ {*n* − 1, *n*}, the polynomial expression of is given by
Thus,
(7)
After identifying the terms in the polynomial expression of we obtain that is equivalent to verifying (7), and so the proof is complete.

**Theorem 1** *Let A* = {*a*_{k}: *k* = 0, …, *N*} *be a cgs, where N could be infinite. Consider the set of vectors* **f**_{k} = (*f*_{k,0}, …, *f*_{k,k}), *where*
(8) *Then, the set of random variables* {*X*^{(k)}: *k* = 0, …, *N*} *such that*
*is a chain of connected distributions, which we term the chain of connected distributions generated from A*.

*Proof*. Proceed recursively. For *k* = 1, *f*_{0,0} = *a*_{0} = 1, and *f*_{1,0} = Δ^{1}*a*_{0} = 1−*a*_{1} ≥ 0, *f*_{1,1} = Δ^{0}*a*_{1} = *a*_{1}. Notice that *f*_{0,0} = (*f*_{1,0} + *f*_{1,1})/1, and apply Lemma 2. Now, if the result is true for *k* − 1, we search for a pmf **f**_{k} = (*f*_{k,0}, …, *f*_{k,k}) which is connected to **f**_{k−1} = (*f*_{k−1,0}, …, *f*_{k−1,k−1}). From Lemma 2, the following linear system must be solved:
Notice that condition is redundant, and that the system has infinitely many solutions. A particular solution can be found by first taking *f*_{k,k} = (−1)^{0}Δ^{0}*a*_{k} = *a*_{k}. After some straightforward, if tedious, calculus, the proof is complete.

Given any given finite distribution *F*^{(n)} within the integer interval [0, *n*] it is simple to obtain a chain that contains *F*^{(n)}, and to determine whether another distribution *F*^{(n+1)} could be added to the chain. The necessary procedure, which somewhat resembles Pascal’s triangle, only requires the use of (8) and (5), as shown in the following example.

**Example 2** *Let X*^{(2)} *be a random variable with a pmf given by*
, *and where*
*for j* = 0, 1, 2. *Then, from* (8) *we have*
*Thus*,
*Now, using* (5) *and from bottom to top, we can easily derive the following triangle*:

*From this triangle, the pmf’s of X*^{(k)}, *k* = 0, 1, *are also found, again from* (8):
*We now wish to find a feasible value for a*_{3} = *x*. *In order to preserve the condition of cgs, the entries in the additional row must maintain the sign of each column*:
*To conclude*, *x* = 0 *leads to* , *which is a uniform distribution on* {1, 2}.

For any unknown cgs {1, *a*_{1}, …, *a*_{n}}, the squares of a generic triangle table are easily found as a linear function of *a*_{i}, by using (6) and (8). Moreover, it is easy to obtain cases where a cgs can be enlarged with a unique feasible additional number, or within a rank of values, or it is impossible, as the last row gives an inequalities system, which can have just one solution, or no solution or an infinite number of solutions.

The nature of the *k*–connection among distributions is illustrated by the following set of results. The proof of each one reduces to simple checking. The parameter notation is shown in the Introduction section.

**Proposition 3** *Let p be a real number within the real interval* [0, 1]. *Then, the sequence* {*a*_{n} = *p*^{n}: *n* ≥ 0} *is a cgs and the chain of distributions generated is the set of classical binomial distributions* {*Bin*(*n*, *p*): *n* = 0, 1, 2, …}.

*Proof*. The proof is immediate from (8).

**Proposition 4** *Given M* ≥ *R* > 0 *integers, the sequence*
*is a cgs and the chain of distributions generated is the set of hypergeometric distributions* {H(*M*, *N*, *n*): *n* = 0, 1, 2, …, *M*}.

*Proof*. The proof is immediate from (8).

The latter result has an interesting meaning, namely that in sampling without replacement, higher values of *n* lead to null probabilities for some extreme values of *X*^{(n)}, meaning that its support set is not actually the integer interval [0, *n*], but a subset within it.

**Proposition 5** *Given W* > 0, *R* > 0, *c* > 0 *integers, then the sequence*
*for n* ≥ 1 *and a*_{0} = 1 *is a cgs and the chain generated is the set of Pólya distributions* {*P*(*W*, *R*, *c*, *n*): *n* = 0, 1, 2, …}.

*Proof*. The proof is immediate from (8).

**Proposition 6** *The sequence a*_{n} = (*n* + 1)^{−1}, *for n* ≥ 0 *is a cgs and the chain of distributions generated is the family of discrete uniform distributions in the integer intervals* [0, *n*].

*Proof*. The proof is immediate from (8).

From the previous results, it seems that the *k*–connection of finite distributions is an essential relationship, which is present in a natural way, although largely unnoticed. Accordingly, it seems credible that certain apparently unrelated distributions may actually present a similar relationship, for example, Poisson–Binomial distributions with no common individual probabilities of success. On the other hand, any more or less arbitrary finite distribution can be connected to a chain.

## Estimation

Consider the following scenario. In a given country or city, the presence of infectious disease is noticed, and planners wish to model the number of contagious persons in a classroom, waiting room or similar. An initial approach to this task might be to create a binomial model, whereby the parameter to be estimated would be the proportion of contagious persons, *p* within the total population in the environment. This parameter could be estimated from samples of rooms containing any number of persons. If any model other than a binomial one were considered, the existence of different-sized rooms (i.e. capacities) could make it difficult or even impossible to conduct a joint estimation.

On the one hand, the possibility of making joint use of different room capacities is helpful, as the number of persons within each room is another random variable in itself. But on the other hand, the binomial model requires the (uncomfortable) condition of independence among the numbers present in each room.

Given these circumstances, a helpful, relaxed condition could be to consider that the number of contagious persons in a room of any size is *k*−connected random variables. The implications of this assumption are only the regularity conditions given by the proportionality of the factorial moments for each room size. The advantage of this assumption is that it reduces the problem of estimation to that of finding a cgs, {*a*_{n}: *n* = 0, …, *M*}, and making joint use of all data, with no constraints on room sizes.

Consider the following notation. There are *N* rooms; inside each room *k*_{j} people are meeting, where *j* = 1, …, *N* and each *k*_{j} ∈ {1, …, *M*}, and where *M* is the highest number of persons observed in a room. The number of persons infected after each meeting is given by Then, we denote by
the number of cases where *X*^{(k)} = *i*, that is, the number of rooms with *k* persons attending and where *i* of them are infected after their meeting. We also denote
that is, the number of rooms with *k* persons present. Then, .

The problem to be addressed is then to estimate where

If no assumptions were made for the models, there would be *M*(*M* − 1)/2 values to be estimated, {**f**_{k}: *k* = 0, …, *M*}. But if we assume that those **f**_{k} are *k*–connected pmf’s, the problem reduces to that of estimating the *M* unknown values ot their cgs, {*a*_{k}: *k* = 0, …, *M*}, where *a*_{0} = 1.

To do so, estimates for **f**_{k} can be found by solving the following program:
(9)

In this case, the best results are obtained by the quadratic norm ‖*x*‖ = *x*^{2}.

Then, by using (6) and (8), each can be written as a linear function of {*a*_{k}: *k* = 0, …, *M*}. These are well–known, and no comments are needed in this paper about convergence (from the law of large numbers) or the chi-square goodness–of–fit test for each pmf, **f**_{k}.

The following example illustrates how even with a sparse dataset, estimation is feasible.

**Example 3** *Assume an infectious disease outbreak and the known existence of meetings at which some of those present have been infected. At each meeting, there are three, four or five attendees. Consider eight samples, as shown in* Table 1:

*Here, the values* *(Infected) from rooms with the same number of attendees, k*_{j} = 3, 4, 5 *(Attendees) are shown in each row. For rooms with three attendees, only one meeting concluded with no persons infected; two meetings concluded with one infected, three with two infected and two with three infected*.

*These data can also be presented as follows*:
*Then, the unknown pmf’s can be written as*

*The programs were solved using Wolfram Mathematica ^{©}. The results obtained (rounded to 2 decimals) are shown in* Table 2.

*Notice that each respective cgs is found on the main diagonal of the corresponding table*.

*The estimate for the connecting function of* **f**_{5} *is given by*

*Its three real roots are*, *z*_{1} = 0.358869, *z*_{2} = 0.489898, *z*_{3} = 0.902781, *and so the estimate of* **f**_{5} *is a Poisson–Binomial pmf*.

## A simulation experiment

Suppose that, in a contagion situation within several rooms, as in the Estimation section, the probability distributions of *X*^{(k)} (number of infected persons after a meeting with *k* attendees) are *k*–connected hypergeometric, meaning that *X*^{(k)} ∼ H(*M*, *N*, *k*), where *M* and *N* are fixed values for all values of *k* considered, *M* ≥ *k*, and *k* is the number of attendees in each room.

A simulation study was conducted to evaluate the estimation errors, considering three hypergeometric chains H(100, 20, *n*), H(100, 50, *n*) and H(100, 80, *n*). For each chain, only the data from *n* = 5, 10, 20 were simulated.

Two scenarios are considered:

- Scenario 1: Sample sizes were 35, 35, 30 for the respective values of
*n*= 5, 10, 20. This first scenario is an almost egalitarian one, while the second would be cheaper (in the example given in the above section about meetings under epidemic situation). - Scenario 2: Sample sizes were 50, 25, 25 for the respective values of
*n*= 5, 10, 20.

1000 simulations of each case were performed and the results obtained are shown in Tables 3 to 8. In each table, the exact values for the cgs (*a*_{i}) and the last pmf, Pr(*X*^{(20)} = *i*), are shown beside the respective mean squared error (mse) for the estimations. In Tables 3 and 4, the simulated data correspond to H(100, 20, *n*)simulations; in Tables 5 and 6, data belong to H(100, 50, *n*); and in Tables 7 and 8 the data correspond to H(100, 80, *n*) estimations, where each pair of tables corresponds to Scenario 1 and 2, respectively.

## Conclusions

In the epidemiology of communicable diseases, it is essential to analyse and control the spread of infection, which can provoke severe problems not only for public health but in many other areas (for example, a major outbreak may force schools and universities to close). In this respect, statistical procedures such as contagious statistical distributions can be a useful means of studying and controlling the situation. In this paper, we describe the development of procedures directly linked to the modelling of contagion in a closed environment through *k*–connected chains of distributions.

A chain of *k*–connected distributions contains a single probability distribution for each integer interval [0, *n*] as its support set, where *n* = 0, …, *N*, and where *N* could be infinity. These distributions are closely related, as verified by various well–known models for sampling within a given population and for other families of finite distributions.

Conversely, any probability distribution with a support set in the integer interval [0, *N*] belongs to a chain, which could be enlarged with other distributions with support sets [0, *N* + 1], [0, *N* + 2], …

A major application of this result is as a means of estimating the probabilities of a set of distributions from sparse data, as we show in an example. This example also illustrates how the approach described can be used to obtain a contagious model which contains the Pólya distribution as a particular case. When the hypothesis of *k*–connection among a set of finite distributions is accepted, the pmf of each one can be estimated with data from some of them. Therefore, the *k*–connection might be considered not only a generalisation of the relationship among sampling distributions from a given population, in which different sample sizes can be jointly used to estimate the common probability of success, *p*, but also a generalisation of the Pólya contagious model, as both can be obtained as particular cases of *k*–connected chains.

## Acknowledgments

The authors thank the Academic Editor and three referees for their suggestions and comments, which have greatly contributed to improving the original manuscript.

## References

- 1. Hethcote W.H. The mathematics of infectious diseases. SIAM Review. 2000; 42: 599–653.
- 2. Sun K, Wang W, Gao L, Wang Y, Luo K, Ren L, et al. Transmission heterogeneities, kinetics, and controllability of SARS–CoV–2. Science 2021; 371: 254. pmid:33234698
- 3. Kawagoe K, Rychnovsky M, Chang S, Huber G, Li LM, Miller J, et al. Epidemic dynamic in inhomogeneous populations and the role of superspreaders. Phy Rev Res. 2021; 3: 033283.
- 4. Aleta A, Martín–Corral D; Pastore A, Ajelli M, Litvinova M, Chinazzi M, et al. Modelling the impact of testing, contact training and hpousehold quarantine on second waves of COVID–19. Nat. Hum. Behav. 2020; 4: 964–971.
- 5. Huber G, Kamb M, Kawagoe K, Li LM, McGeever A, Miller J, et al. A minimal model for household–based testing and tracing in epidemics. Phys Biol. 2021; 18: 045002. pmid:33434891
- 6. Chang S, Pierson E, Koh PW, Gerardin J, Redbird B, Grusky D, et al. Mobility network models of COVID–19 explain inequities and inform reopening. Nature 2020; 589: 82–87. pmid:33171481
- 7. Britton T, Ball F, Trapman P. A mathematical model reveals the influence of population heterogeneity on herd immunity to SARS–CoV–2. Science 2020; 369: 846–849. pmid:32576668
- 8.
Hébert–Dufresne L, Althouse BM, Scarpinno S, Alland A. Beyond
*R*_{0}: heterogeneity in secondary infections and probabilistic epidemic forecasting. J. R. Soc. Interface 2020; 17: 20200393 pmid:33143594 - 9. Neipel J, Bauermann J, Bo S, Harmon T, Jülicher F. Power–law population heterogeneity governs epidemic waves. PLoS ONE 2020; 15 (10): e0239678. pmid:33052918
- 10. Brooks–Pollock E, Christensen H, Trickey A, Hemani G, Nixon E, Thomas AC, et al. High COVID–19 transmission potential associated with re–opening universities can be mitigated with layered interventions. Nature Comm. 2021; 12: 5017 pmid:34404780
- 11. Mayberry J, Nattestad M, Tuttle A. The structure of an outbreak on College Campus. Math. Mag. 2021; 94: 83–89.
- 12.
National Academies of Sciences, Engineering, and Medicine. COVID–19 testing strategies for Colleges and Universities. Washington, DC: The National Academies Press.
- 13. Nixon E, Trickey A, Christensen H, Finn A, Thomas A, Relton C, et al. Contacts and behaviours of university students during the COVID–19 pandemic at the start of the 2020/21 academic year. Scientific Reports 2021; 11: 11728. pmid:34083593
- 14. Eggenberger F, Pólya G. Über die Statistik verketetter Vorgänge. Zeitschrift fur Angewandte Mathematik und Mechanik. 1923; 1: 279–289.
- 15. Kotz S, Mahmoud H, Robert P. On generalized Pólya urn models. Stat & Prob Let. 2000; 49 (2): 163–173.
- 16. Mahmoud HM. Pólya urn models and connections to random trees: a review. J Iranian Stat Soc. 2003; 2 (1): 53–114.
- 17. Chen M-R, Wei C-Z. A new urn model. J App Prob. 2005; 42: 964–976.
- 18. Chen M-R, Kuba M. On generalized Pólya urn models. J Appl Prob. 2013; 50: 1169–1186.
- 19. Ollero J, Ramos HM. Description of a subfamily of the discrete Pearson system as generalized-binomial distributions. J Italian Stat Soc. 1995; 4: 235–249.
- 20. Schlemm E. The Kearns–Saul inequality for bernoulli and Poisson–binomial distributions. J Th Prob. 2016; 29: 48–62.
- 21.
Acharya J, Daskalakis C. Testing Poisson binomial distributions. In: Proceedings of the 2015 Annual ACM-SIAM Symposium on Discrete Algorithms; 2015. pp. 1829–1840
- 22. Neammanee K. A nonuniform bound for the approximation of Poisson binomial by Poisson distribution. Int J Math & Math Sc. 2003. pii: ID 619382.
- 23. Neammanee K. A refinement of normal approximation to Poisson binomial. Int J Math & Math Sc. 2005. pii: ID 679348.
- 24.
Barbour AD. Multivariate Poisson–binomial approximation using Stein’s method. In: Barbour AD, Chen LHY, editors. Stein’s method and applications. Singapore: Institute for Mathematical Sciences, National University of Singapore; 2005. pp. 131–142.
- 25. Skipper M. A Pólya aproximation to the Poisson–Binomial law. J Appl. Prob. 2012; 49: 745–757.
- 26. Butler K, Stephens MA. The distribution of a sum of independent binomial random variables. Methodol. Comput. Appl. Prob. 2017; 19: 557–571.
- 27. Novak SY. Poisson approximation. Probab. Surv. 2019; 16: 228–276.
- 28. Hong Y. On computing the distribution function for the Poisson binomial distribution. Comp Stat & Data An. 2013; 59: 41–51.
- 29. Barrett BE, Gray JB. Efficient computation for the Poisson binomial distribution. Comp Stat. 2014; 29: 1469–1479.
- 30. Chen SX, Liu JS. Statistical applications of the Poisson–Binomial and conditional Bernoulli distributions. Stat Sinica. 1997; 7: 875–892.
- 31. Tejada A, den Dekker AJ. The role of the Poisson’s binomial distribution in the analysis of TEM images. Ultramicroscopy. 2011; 111 (11): 1553–1556. pmid:21939620
- 32.
Rosenman E, Viswanathan N. Using Poisson binomial GLMs to reveal voter preferences. arXiv:1802.01053v1 [Preprint] 2018 [cited 2018 Feb 4] Available from: https://arxiv.org/abs/1802.01053v1
- 33.
Tang W, Tang F. The Poisson binomial distribution: old and new. arXiv. 2019; arXiv:1908.10024v1