Large and finite sample properties of a maximum-likelihood estimator for multiplicity of infection

Reliable measures of transmission intensities can be incorporated into metrics for monitoring disease-control interventions. Genetic (molecular) measures like multiplicity of infection (MOI) have several advantages compared with traditional measures, e.g., R0. Here, we investigate the properties of a maximum-likelihood approach to estimate MOI and pathogen-lineage frequencies. By verifying regulatory conditions, we prove asymptotical unbiasedness, consistency and efficiency of the estimator. Finite sample properties concerning bias and variance are evaluated over a comprehensive parameter range by a systematic simulation study. Moreover, the estimator’s sensitivity to model violations is studied. The estimator performs well for realistic sample sizes and parameter ranges. In particular, the lineage-frequency estimates are almost unbiased independently of sample size. The MOI estimate’s bias vanishes with increasing sample size, but might be substantial if sample size is too small. The estimator’s variance matrix agrees well with the Cramér-Rao lower bound, even for small sample size. The numerical and analytical results of this study can be used for study design. This is exemplified by a malaria data set from Venezuela. It is shown how the results can be used to determine the necessary sample size to achieve certain performance goals. An implementation of the likelihood method and a simulation algorithm for study design, implemented as an R script, is available as S1 File alongside a documentation (S2 File) and example data (S3 File).


8
Without loss of generality let λ 1 < λ 2 , p Here, we derive the lineages' prevalences and the probability of observing irregular data, 13 i.e., N k = N for at least one k or n k=1 N k = N . Remember that a blood sample is 14 represented by a 0-1 vector i i i = (i 1 , . . . , i n ). The probability of observing i i i is denoted by 15 Q i i i and given by (S1). 16 First, we derive the prevalence of lineage j, i.e., the probability to observe lineage j in a blood sample. Note that by using (S1) 1 = i i i∈{0,1} n \{0 0 0} = Q e e en + (e λpn − 1) i i i∈{0,1} n \{0 0 0}: in=0 Hence, i i i∈{0,1} n \{0 0 0}: in=1 Q i i i = e λ (e λpn − 1) e λpn (e λ − 1) , and by replacing n with j we obtain the probability that lineage j is observed in a blood sample as Note, that p j is the frequency of lineage j in the population of infective agents and 17 differs from the prevalence q j . However, in the limit λ → 0, these probabilities coincide 18 as is easily seen by applying de l'Hospitals rule. This is not surprising, because in this 19 limit every host is infected by exactly one lineage.

20
Next, we show by induction that the probability of observing lineages j 1 , . . . , j k together in a blood sample is given by This probability corresponds to the prevalence of the combination of lineages j 1 , . . . , j k . 21 If k = 1 the formulas (S3) and (S2) obviously coincide. By relabelling it suffices to show that (S3) holds for lineages n − k, . . . , n. Assume q {n−k+1,...,n} is given by (S3). Hence, Hence, follows, which proves (S3).

2/19
Now, the probability that N k = N for at least on k is found by a standard inclusion exclusion argument and is given by Next, we will derive the probability of observing only single infections. A single 23 infection with lineage k is represented by the standard base vectors e e e k . The probability 24 of a single infection is therefore Q e e e1 + . . . + Q e e en . Hence, the probability of observing 25 only single infections is given by n k=1 Q e e e k N .

26
Summarizing, the probability that N k = N for at least one k or n k=1 N k = N , is given by where the probabilities that all samples are only infected by lineage j, i.e., Q N e e ej , are 27 subtracted because they are accounted for in both irregular cases, i.e., N single 28 infections with the same lineage (N j = N for exactly one j implying n k=1 N k = N ).

29
Clearly, the probability q vanishes as N → ∞. However, if N and λ are small and 30 the lineage frequencies are very skewed, it might be rather larger.

32
Next we present the proof of Result 1.   As shown in [1], unique confidence points λ <λ < λ exists, implyingλ = λ orλ = λ. 45 Assumeλ < λ min . Thenλ = λ max is impossible, because the uniqueness of λ >λ 46 maximizing the profile-likelihood subject to the constraint L(λ,p p p) − L(λ, p p p) = c would 47 imply L(λ,p p p) − max p p p L(λ min , p p p) < c. The latter is a contradiction to (λ,p p p) being the 48 point, where the log-likelihood function attains its global maximum onΘ. Thus, 49 λ = λ min . The same reasoning yieldsλ = λ max ifλ > λ max . 50 In the case that only one lineage is found in the data, i.e., n k=1 N k = N and N j = N 51 for some j, it is immediately clear from the log-likelihood function that any estimate of 52 the form (λ, e e e j ) is equally likely. Hence, we further assume that at least two lineages are 53 present in the data. Next, we consider λ as a fixed constant and maximize the log-likelihood function (eq. 3 in the main text) over the simplex. It is more convenient to introduce a nuisance parameter β and maximize the function The equations ∂Λ ∂p k = 0 yields N k λe λp k e λp k −1 = β for all k (cf. eq. S17b in Appendix D below). Solving this equation with respect to p k yields which is in essence eq. 6 in the main text. Substituting this into Thus, f is strictly monotonically increasing and consequently f (β) = 0 for exactly one 57 β > β 0 . This solution is found by applying Newton's method, which yields exactly eq. 58 6b in the main text for λ = λ min . Choosing λ = λ max yields the rest of the proof. are and the second order derivatives are Clearly, the third-order derivatives exist for any admissible parameter value ϑ ϑ ϑ ∈Θ and 62 could be calculated analogously. Obviously, these are linear in N and the N k 's.

63
Therefore, by the theorem of maximum and minimum we obtain:

64
Remark 1 The third-order derivatives of the log-likelihood function satisfy that 1 is uniformly bounded on any compact subsetΘ Θ .

66
Notably,Θ is not compact and the third-order derivatives are indeed not bounded onΘ. 67 In order to derive the information matrix and prove its properties we need the following 68 lemma.

70
Proof. Clearly, (S7a) is equivalent to (S2). The second identity is a special case of (S3). 71 Denoting a 0-1 vector by i i i = (i 1 , . . . , i n ) and by n i i i the number of blood samples corresponding to i i i we clearly have N k = i i i∈{0,1} n \{0 0 0} i k n i i i . Moreover, the n i i i are according to the first identity.

72
From the multinomial distribution of the for k = , finishing the proof.

73
Lemma 1 enables us to derive the information matrix immediately from (S6), which 74 is presented in Result 2 in the Main Text. Moreover, we are now able to prove an 75 important regulatory condition.

Theorem 1 The Fisher information matrix satisfies
.

6/19
Hence, Lemma 1 yields Straightforward calculation yields, By using Lemma 1 we obtain Next, note that and finishes the proof.

77
To prove positive definiteness of the information matrix, we need the following 78 lemma first.
Hence, g(p) < 0 for 0 < p < 1. Therefore, ∂f ∂x < 0 is 91 shown, which finishes the proof.   column for k = 1, . . . n − 1 gives λ 2 a n−1 det I = where g = − e λpn −1 e λ −1 . Subtracting e λp k −1 e λpn −1 times colum k + 1 from the last column yields λ 2 a n−1 det I = where h = 1 + To derive the Fisher information and especially its properties it was convenient to reduce the parameter space Θ to the lower dimensional parameter spaceΘ by eliminating a redundant parameter. However, inverting the information given by Result 2 is cumbersome. To derive the Cramér-Rao lower bound, it is more convenient to embed the parameter space Θ into a higher dimensional space by introducing a nuisance parameter. Namely, consider the log-likelihood function Λ = Λ(p p p, λ, β) = Λ(p p p, λ, β|X X X) = L(λ, p p p|X X X) Note that Λ(p p p, λ, β) = L(λ, p p p|X X X) since the term that has been added vanishes for any admissible set of parameters. The advantage of Λ is that p p p can be regarded as an element of R n rather than of the (n − 1)-dimensional simplex when calculating the derivatives. However, the parameter space itself does not change. Moreover, β is regarded as a scalar, although we impose that the true value is given by β = N λe λ e λ −1 .

Straightforward calculation yields
The second derivatives become Application of Lemma 1 yields the following entries of the information matrix: Hence, the Fisher information has the following structurẽ where The inverse Fisher information can be derived using a blockwise inversion formula, namelỹ The formula applies whenever, d i = 0 and the 2 × 2 matrix A − BD −1 B T is invertible. Blockwise inversion is particularly simple here because D is diagonal. This is the reward of introducing the nuisance parameter β. Without β, D would not be diagonal and inverting the Fisher information would be more involved. Note that where a 11 = a − Moreover, Hence, the entries of the inverse Fisher informationĨ −1 N are where we do not need to calculate its last row and column because these correspond to the nuisance parameter β. After some algebraic manipulation we obtain for i, j = 1, . . . , n and i = j:

109
MOI is the average number of super-infections, which -assuming the conditional Poisson distribution -equals while the remaining second derivatives remain unchanged as in (S18). Note that the score function has mean zero and particularly satisfies E ∂Λ ∂λ = 0. (That the score function has mean zero can be seen directly from (S17) and Lemma 1. Here it is important to keep in mind that p p p is an element of the simplex and β = N λe λ e λ −1 for the true parameter and the MLE.) This fact and the inverse function theorem yield The Fisher information for the parameters (p p p, ψ, β) is denoted byJ N , and given bỹ Hence,J Clearly, f (λ) = ∂f ∂λ = e λ e λ −λ−1 (e λ −1) 2 , so that for i, j = 1, . . . , n and i = j: where C is given by (S22e To investigate the method's sensitivity on the true parameters we proceeded as follows. 116 For a fixed set of parameters θ θ θ = (λ, p p p) we simulated K data sets X X X (1) , . . . , X X X (K) of 117 sample size N as described below. For each data set X X X (k) , we derived the MLEθ θ θ MOI is the average number of infecting lineages and is given by Moreover, conditional on being infected by m parasites, the infecting lineages are drawn from a multinomial distribution with parameters m and p p p, i.e., where m j ∈ {0, . . . , m} is the number of times the host is infected with lineage j. The first aim is to measure the performance of the MLE for the Poisson parameter λ, or 157 rather for MOI, ψ = λ 1−e −λ . (This is only the correct parameter for the conditional 158 Poisson model, not for the alternative models used to simulate data.) We refer to a 159 combination of parameters and a given model as a parameter set. For each parameter 160 set K = 10 000 data sets X X X (k) were randomly generated as described above. For each 161 data set the MLEλ (k) is derived. We then derived the empirical mean (E Kλ ) and 162 variance (Var Kλ ) of theλ (k) s, as well as the 0.5%, 2.5%, 5%, 95%, 97.5%, and 99.5% increases it becomes increasingly tedious to look at measures for single frequencies, 171 rather than on a single quantity. Therefore, we also calculated the distance d(p p p,p p p) from 172 the true and estimated parameters. For each set of parameters we then obtained the 173 empirical means and variances of the distance, i.e., E K d(p p p,p p p) and Var K d(p p p,p p p). Since p p p 174 is multidimensional, it is difficult to decide on an appropriate distance. Because we do 175 not want to decide on a "best" distance measure, we used several distance measures 176 (Euclidian-norm, 1-norm, supreme norm, Jensen-Shannon divergence, Hellinger and 177 Bhattacharyy distances). Qualitatively they all yielded the same results.