coiaf: Directly estimating complexity of infection with allele frequencies

In malaria, individuals are often infected with different parasite strains. The complexity of infection (COI) is defined as the number of genetically distinct parasite strains in an individual. Changes in the mean COI in a population have been shown to be informative of changes in transmission intensity with a number of probabilistic likelihood and Bayesian models now developed to estimate the COI. However, rapid, direct measures based on heterozygosity or FwS do not properly represent the COI. In this work, we present two new methods that use easily calculated measures to directly estimate the COI from allele frequency data. Using a simulation framework, we show that our methods are computationally efficient and comparably accurate to current approaches in the literature. Through a sensitivity analysis, we characterize how the distribution of parasite densities, the assumed sequencing depth, and the number of sampled loci impact the bias and accuracy of our two methods. Using our developed methods, we further estimate the COI globally from Plasmodium falciparum sequencing data and compare the results against the literature. We show significant differences in the estimated COI globally between continents and a weak relationship between malaria prevalence and COI.

We define the within-sample allele frequency (WSAF) as a vector, w alt , of length l composed of the frequencies of the alternate allele, i.e., the non-3D7 reference, at each locus for a single individual infection. Likewise, we define the population-level allele frequency (PLAF) as a vector, p alt , of length l composed of the frequencies of the alternate allele at each locus across a population. The PLAF may be represented as the mean of the WSAF across a population of size n, i.e., at locus i, p alt i = 1 n n j=1 w alt ij . As introduced in the main text, the population-level minor allele frequency (PLMAF) is defined as a vector, p, of length l composed of the frequencies of the minor allele at each locus across a population. Namely p = (p 1 , . . . , p l ), where p i ∈ [0, 0.5]. The within-sample minor allele frequency (WSMAF) is defined as a vector, w, of length l composed of the frequencies of the population-level minor allele at each locus for a single individual infection. Given the WSAF and the PLAF, the WSMAF and the PLMAF at a particular locus i may be expressed as follows: Let us assume there are l biallelic loci for a given individual. We define the PLMAF and WSMAF as in the main text and Appendix A.

B.1 Number of strains containing the minor allele
Let each genetically distinct haplotype in a sample be referred to as a "strain" (note that this does not necessarily imply any difference in phenotype between strains). Let k be the (unknown) number of strains in a sample, i.e., the complexity of infection (COI) of the sample. We first derive an expression for the number of strains that carry the minor allele. At locus i, let C i = (C i,1 , . . . , C i,k ) be a vector of Bernoulli random variables, where C i,j = 1 if strain j carries the minor allele, and C i,j = 0 otherwise. Let the random variable N i represent the number of strains that carry the minor allele, such that N i ∈ {0, . . . , k} and N i = k j=1 C i,j . If we make the strong assumption that parasite strains within a mixture are unrelated, then the genetic material of each strain in a sample can be considered independently drawn from the PLMAF. Our assumption implies that P(C i,j = 1) = p i , ∀j . ( As a result, N i will follow a Binomial distribution with a probability of success (i.e., drawing the minor allele) defined by Eq (2): We can determine the probability of strain j carrying the minor allele given that n i strains carry the minor allele. Given n i , we know that there must be n i values of the vector C i that equal one and k − n i that equal zero. As a result, we can express the probability that a given C i,j is equal to one as follows:

B.2 Variant Method: probability a locus is heterozygous
In the Variant Method, we examine all loci and wish to identify the probability of locus i being heterozygous. Recall that heterozygous sites contain two distinct alleles: the major and minor alleles. We define a Bernoulli random variable, V i , which takes the value of one if a site is heterozygous and zero otherwise. Therefore, the probability that locus i is heterozygous, written as P(V i = 1), is equal to one minus the probability that a locus is homozygous, i.e., all strains contribute to the minor allele, or all strains contribute to the major allele. Using Eq (3), we write, B.3 Frequency Method: expected within-sample minor allele frequency As before, suppose our sample contains k distinct parasite strains. For the Frequency Method, we assume that our genotyping method produces integer counts of each strain rather than simply recording the presence or absence of a strain. Let s i = (s i,1 , . . . , s i,k ) represent the proportion of each strain at locus i, such that s i,j ≥ 0 , ∀j and k j=1 s i,j = 1 . The within-sample minor allele frequency (WSMAF) at locus i, denoted by the random variable W i , is defined as the sum of all strain proportions that carry the minor allele. We can, therefore, write, It follows from the linearity of expectation that By definition, as C i,j follows a Bernoulli distribution (Eq (2)), the expected value is equal to p i . Moreover, as the sum over all strain proportions, s i,j , is equal to 1, We arrive at the result that the expectation of the WSMAF is equal to the PLMAF. It follows that the mean of W i over all individuals in a region is an unbiased estimator of p i (subject to the caveats above). Notably, this result stands regardless of any assumptions about the distribution of strain proportions. However, this will be a biased estimator of p i if s i,j are not independent of C i,j . For instance, if C i is associated with drug resistance and sample collection was biased to reveal this non-independence, i.e., samples were collected from individuals after drug treatment, increasing resistant parasite strains, our estimator would differ between the treated and untreated malaria populations. This finding is further explored in Appendix C. We may also derive the expectation of W i given that locus i is heterozygous (i.e., V i = 1). We first find the expectation of W i given that n i strains carry the minor allele using the linearity of conditional expectation. Applying the same logic as when deriving Eq (8), we know that the expected value of Eq (4) is equal to ni k , by definition, and that the sum of all strain proportions is equal to 1. Therefore, We then leverage the law of total expectations: We note that if V i = 1, this implies that 0 < N i < k. As a result, Substituting in Eqs (3), (5), and (9), we find that We note that the sum in Eq (12) represents the expected value of the binomial distribution (equal to kp i ) less the case where n i = 0 (equal to 0) or n i = k (equal to kp k i ). Therefore, after some algebra, we obtain Note that the Variant and Frequency Methods depend on the COI and the PLMAF. However, the Variant Method identifies the probability of a locus being heterozygous, denoted as P(V i = 1). In contrast, the Frequency Method identifies the expected value of the WSMAF given a site is heterozygous, denoted as Appendix C Additional mathematical considerations C.1 An unbiased estimator of the PLMAF In Appendix B.3, we derived the expression and commented that the mean of W i over all individuals in a region serves as an unbiased estimator of p i . In particular, regardless of any assumptions about the distribution of strain proportions and COIs for a set of samples, the mean WSMAF will equal the PLMAF on average. We note that this assumes no sequencing error and that sample collection is independent of strain proportions. To further illustrate that this result stands regardless of the distribution of COIs, we simulated the WSMAF for several samples with different COIs and compared the mean WSMAF over all individuals to the true PLMAF. We first generated a random set of samples with different COIs. For each sample, we found C i by sampling from a Bernoulli distribution and s by assuming that strain proportions were sampled from a symmetric Dirichlet distribution with shape parameter α, without loss of generality. Lastly, we determined the WSMAF for each sample by using Eq (6) and found the mean WSMAF. Simulations were run with different numbers of samples, different COI values, and different values of the shape parameter. The results are illustrated in Fig A, confirming that the true PLMAF is recovered by averaging the population observed WSMAF. For each set of parameters, data was simulated 50 times. Each point represents the mean observed PLMAF, and the bars represent the 95% quantile range. The true PLMAF is indicated by a dashed horizontal black line.
In Appendix B, Eq (5), we derived the probability that a locus is heterozygous, which relied on the assumption that the genetic material contributing to the locus is independently drawn from the PLMAF. This is the same assumption that the THE REAL McCOIL assumes, as does the earlier model, COIL, on which it is based. This is because dependence between loci, i.e., linkage disequilibrium (LD), violates the assumption that each locus solely depends on allele frequencies and can thus be described with a Bernoulli process. If loci are not independent, the probability distribution describing the number of strains containing the minor allele at a locus can no longer be described using a Binomial distribution as in Appendix B, Eq (3). If any locus is in LD, the number of strains that carry the minor allele will also depend on the degree of LD and the genotype of each strain at neighboring sites in LD. For example, if two loci are correlated due to LD, and if one locus returns a polymorphic call, the other locus is more likely to be polymorphic. However, the opposite is true: if one locus is homozygous, the probability that the other locus will also be homozygous increases. Consequently, with very large numbers of loci, these antagonistic effects will cancel each other out, and our estimator will remain unbiased.
The variance in our estimator, on the other hand, will increase as we do not account for the fact that non-independent loci are less informative than independent loci. In Fig B, we compare the distribution of point estimates for two data sets across several COIs. Data were sampled from 1,000 independent loci to generate the first data set. To generate the second data set, we first sampled from 500 independent loci and subsequently duplicated these data points-yielding a set of 1,000 dependent loci. We observe that the variance in estimated COI is smaller when sampling from the set of independent loci.
This same finding also extends to the Frequency Method as follows. The expectation of the Correlated Binomial Distribution is also equal to np. Consequently, we can make the same substitution in creating Eq (13), replacing the sum in Eq (13) with np. The denominator in Eq (13) is given by Eq (5), which we have previously detailed will not impact the bias of our method but will impact the variance of our estimator. varying COIs from 2 to 10. Data was simulated 10 times for each COI, and 1,000 bootstrap replicates were run to generate a distribution of point estimates. The first data set, "Independent Loci" (shaded blue), was sampled from 1,000 independent loci. To generate the second data set, "Dependent Loci" (shaded red), we first sampled from 500 independent loci. These 500 loci were then duplicated to yield 1,000 dependent loci. The title of each subplot, denoted on the rightmost side of the figure, indicates the COI that data were sampled with.

Appendix D Accounting for sequence error
During sequencing, errors may arise for several reasons. The major next-generation sequencing technologies use a sequencing-by-synthesis method where each DNA molecule in a sample is amplified individually (e.g., spots on a chip) prior to sequencing. Early errors in this amplification process can lead to called errors in the sequence. Moreover, errors may occur during the actual sequencing. To provide a template for the synthesis of the complementary strand, many copies of the same DNA molecule are on each spot. On each synthesis cycle, a new nucleotide is added to all of the growing DNA strands. However, there is a small chance that a nucleotide will not be added or more than one nucleotide will be added to some of the strands within a spot. These strands are considered "out-of-phase." As the number of cycles increases, the proportion of these "out-of-phase" strands increases, which leads to mistakes in the final read-out.
To address these errors, our software accounts for sequencing errors. The user has two options to do so. The first option allows the user to specify the amount of error believed to be present in their samples, giving complete control over the level of error assumed to the user. For instance, the user may specify a 5% sequencing error. Note that this is the default approach taken in our software, and our package, by default, assumes a 1% sequencing error.
The second option uses an algorithm to infer the sequencing error from the distribution of the input data. In particular, we focus on data with a low PLMAF and compare the true number of data points with a WSMAF greater than zero to the expected number of points with a WSMAF greater than zero. Consider an arbitrary data point (p i , w i ). We define a partition over the range of p i constituent of sets P 1 , . . . , In doing so, we group our data into N bins. We first count the number of loci in the first bin. We define the PLMAF for this partition as follows: Given the number of loci in the first partition and the PLMAF, we can compute the expected number of loci containing the minor allele. Recall that sampling the minor allele at a locus can be expressed using a Binomial distribution with one trial and a probability of success (drawing the minor allele) equal to the PLMAF at that locus. Therefore, to find the expected number of loci with the minor allele, we can multiply the number of loci with the probability of a locus being the variant allele. As the PLMAF tends towards zero, we expect very few sites to have the minor allele. Therefore, if we remove the expected number of loci from the true number of loci with a minor allele, all the remaining loci will likely result from sequencing error. By examining the distribution of the WSMAF of these points, we can estimate the amount of sequence error present in our sample. If this sequence error is small, we enforce a minimum sequencing error of 1%. Following the determination of the level of sequencing error, ϵ, thought to be present in the data, our algorithms account for the error as follows. Note that we assume that sequencing error affects the probability of correctly sampling both the minor and the major alleles. Therefore, loci at which 0 < WSMAF ≤ ϵ and loci at which 1 > 1 − WSMAF ≥ 1 − ϵ are assumed to have been incorrectly sampled and are denoted as homozygous loci instead of heterozygous loci.
To provide a confidence interval for our COI estimates, we use the boot package in [2]. To do so, we randomly sample with replacement from our set of data points (D : {(p i , w i , d i ), i = 1, . . . , l}) multiple times and determine the estimated COI for each replicate. From this distribution, we then determine the 95% confidence interval. The default settings for this process sample with replacement 100 times (see Appendix F for additional information on this default value). Furthermore, as bootstrapping may be a computationally expensive technique, we have provided options for users to parallelize the process using multiple CPUs. An illustration of the distribution resulting from bootstrapping on a set of simulated data is shown in Fig C. Fig C. Bootstrapped confidence intervals across a range of COIs. The kernel density estimate of the distribution is plotted for simulated data. Data were sampled with 1,000 loci from a range of COIs from two to ten.

Appendix F Convergence of bootstrap replicates
As discussed in Appendix E, we leverage bootstrapping to provide users with a confidence interval for COI estimates. To do this, we sample with replacement multiple times and determine the estimated COI for each replicate. When estimating the confidence interval using this strategy, it is important to have a sufficiently large number of replicates such that increasing the number will not meaningfully alter the variance in COI estimates. In Fig D, we explore how changing the number of replicates alters the standard error when estimating the COI on simulated data.

Fig D.
Standard error in estimates across replicates. Data were simulated from a simple simulation with 1,000 loci and a COI ranging from 2 to 10. For each simulation, the COI and confidence interval was estimated, and the standard error was plotted along the y-axis for an increasing number of replicates (x-axis).
Fig D illustrates that the variance decreases as the number of replicates increases. We note that the variance with 100 replicates is largely comparable to the variance with 1,000 replicates. Given this, we set the default number of replicates to 100 in our software package.

Appendix G Simulator details
To test our mathematical methods, we created a simulator that generates synthetic sequencing data for an individual in a given population. The individual is assigned a COI value, which is used to simulate the number of sequence reads mapped to the reference and alternative alleles for the biallelic SNPs considered. These are then used to derive the within-sample frequency of the population-level minor allele, w i , at each locus.
We first define the number of loci simulated, l. We assume that the distribution of reference allele frequencies for locus i, R i , is described by a Beta distribution with shape parameters α and β: where Γ is the Gamma function. In our simulations, we assume α = 1 and β = 5 and draw l values of R i . In Eq (5) and (13), we have defined relationships with respect to the frequency of the minor allele, p i . Recall that p i ∈ [0, 0.5]. Thus, we define To simulate w i for each individual, we first assign the COI, k. Next, we simulate the genotype at locus i, drawing k values from a Binomial distribution with probability p i , i.e., the probability of having the minor allele at locus i is equal to the population-level frequency of the minor allele. This yields the matrix, G ∈ R k×l , with k rows and l columns defining the phased haplotype of each strain, where G i,j = 1 indicates that strain j at locus i has the minor allele.
To determine the number of sequence reads related to each strain, we first draw the proportion of each strain, S = (S 1 , . . . , S k ), in an individual with k strains, which we assume is described by a Dirichlet distribution with concentration parameters α 1 , . . . , α k . The "true" WSMAF of the minor allele at locus i, denoted τ i , is thus given by the sum of the strain proportions for strains with the minor allele. We can express this using matrix multiplication: In simulations with no assumed sequence error, the number of sequence reads with the minor allele at each locus is given by sampling from a Binomial distribution c times with probability τ , where c is the assumed number of sequence reads, i.e., the read depth or coverage. However, in simulations with sequence error, we perturb the "true" WSMAF by assuming that a number of sequence reads are incorrectly sequenced. An incorrectly sequenced locus with the major allele will yield the minor allele and vice versa. In our simulations, we assume a fixed sequence error, ϵ, such that the probability of correctly sequencing the minor allele is 1 − ϵ, and the probability of incorrectly sequencing the minor allele is ϵ. Therefore, we can represent the probability of sequencing the minor allele, now denoted as τ ϵ to indicate the added error, as In simulations with relatedness assumed, we sample loci from other strains in the mixed infection with a probability equal to the relatedness simulated. Lastly, we may account for overdispersion-additional unexpected variability in our data-and sample the number of sequence reads with the minor allele at each locus from a Beta-binomial distribution rather than a Binomial distribution. Dividing the number of instances of the minor allele by the coverage at each locus results in the simulated WSMAF.

G.1 Simulator parameters
As previously mentioned, in our simulator, we assume that a Beta distribution describes the distribution of the reference allele frequencies for a given locus. In addition, we account for overdispersion in parasite densities. In Fig E, we illustrate how these two parameters impact the probability that strains in mixed infection will contribute to the observed sequence reads.

Appendix I Benchmarks
To benchmark our software, we used the bench package, available in , which tracks execution time, memory allocation, and garbage collection [3]. We conducted internal comparisons, evaluating the speed of each of our estimation and solution methods. For all internal comparisons, we simulated data with a variable number of loci: 100, 1,000, 5,000, 10,000, and a variable complexity of infection: 1, 5, 10, 15, 20, 25. We then compared the speed of our methods by running each method 100 times and plotting the time for each iteration to run. Our results are shown in Fig F and G. There was little difference between our two estimation methods and two solutions methods. To evaluate whether differences between methods were statistically significant, we utilized the Wilcoxon signed-rank test, a nonparametric hypothesis test [5]. In both the Variant Method and the Frequency Method, the discrete solution method was significantly faster than the continuous solution method (p-value: < 0.001 in both cases). Of the Variant and the Frequency Methods, the Variant Method was significantly faster than the Frequency Method (p-value: < 0.001).
We additionally ran benchmarks to test our software package against the current "state-of-the-art" method, THE REAL McCOIL. All benchmarks were run ten times to reduce variance between runs. We simulated data with a COI from 1 to 10 and compared the speed of our different estimation methods to the speed of THE REAL McCOIL. The THE REAL McCOIL was run with a burn-in period of 2, 000 followed by 5, 000 sampling iterations. These numbers were chosen by simulating a complex data set, running the THE REAL McCOIL on it with different parameter values, and examining the convergence between Monte Carlo Markov chains [6]. We found that even with a complex data set, convergence was met with a burn-in period of 2, 000 followed by 5, 000 sampling iterations. Fig H shows the median run time across the ten runs across a varying number of loci and samples.
Comparing our various estimation methods to THE REAL McCOIL, we observe that the speed of the point estimates generated by the discrete and continuous Variant Methods of coiaf remains constant even as the number of loci increases. All other evaluated estimation methods increase in a seemingly linear fashion. Importantly, the rate at which the run time of our methods increases is substantially lower than the rate at which the run time of THE REAL McCOIL increases. Depending on which estimation method the user chooses, our methods will be faster than THE REAL McCOIL for analyses evaluating greater than 1,000 loci.  We note that for a fair comparison of our bootstrapped methods and THE REAL McCOIL, it is important to illustrate that the confidence intervals between the two software packages are comparable. Fig I shows the distribution of point estimates for the two packages across COI values ranging from 2 to 10. We observe that there is little difference between the two distributions.   Table B. Sampling information for each region. The information for each region includes the number of samples studied and the number of loci examined. Moreover, the average coverage per sample is provided.

Appendix K Running THE REAL McCOIL on real data
To compare our method against the "gold standard" for estimating the COI, THE REAL McCOIL, we first had to estimate the COI using THE REAL McCOIL [7]. The total number of variants across the genome is computationally untractable for THE REAL McCOIL and is unsuitable for the algorithm given that many SNPs will be in linkage disequilibrium (LD), the non-random association of alleles with one another at different loci in a given population. In response, we first identified a common set of SNPs such that the minor allele frequency of variants was greater than 0.005 globally and within each of the 24 identified regions. From this set of SNPs, we selected subsets of SNPs that were not in LD by first calculating the genetic autocorrelation [8] among SNPs by genetic distance and filtering to sets of SNPs with an autocorrelation less than 0.8. For each of the 24 regions, five unique sets of LD-filtered variants were created. The COI was estimated for each set using THE REAL McCOIL (v2) categorical method. Heterozygous loci were determined using the genotype called during variant calling. Default priors were assigned for each parameter used by THE REAL McCOIL, with a maximum observable COI equal to 25 and sequencing measurement error estimated along with COI and allele frequencies. We performed 5 repetitions for each sample, with a burn-in period of 1, 000 followed by 5, 000 sampling iterations and using standard methodology to confirm convergence between Monte Carlo Markov chains [6].  Method is compared across a COI of 2-20. Data was simulated 100 times with 1,000 loci. No error was introduced into our simulations, and no sequencing error was assumed. In all A subplots, the discrete estimation is shown, and in all B subplots, the continuous estimation is shown. In both cases, the distribution of the estimated COI as the coverage changes from 50 to 1,000 is shown. In A.1-A.6, dot plots are used where point size indicates density. In B.1-B.6, a boxplot shows the interquartile range, with whiskers indicating the range and outliers indicated by dots. The mean absolute error is shown in A.7 and B.7, with the 95% confidence interval computed from 1,000 bootstrap repetitions at each COI indicated with error bars. Method is compared across a COI of 2-20. Data was simulated 100 times with 1,000 loci. No error was introduced into our simulations, and no sequencing error was assumed. In all A subplots, the discrete estimation is shown, and in all B subplots, the continuous estimation is shown. In both cases, the distribution of the estimated COI as the coverage changes from 50 to 1,000 is shown. In A.1-A.6, dot plots are used where point size indicates density. In B.1-B.6, a boxplot shows the interquartile range, with whiskers indicating the range and outliers indicated by dots. The mean absolute error is shown in A.7 and B.7, with the 95% confidence interval computed from 1,000 bootstrap repetitions at each COI indicated with error bars.    Method is compared across a COI of 2-20. Data was simulated 100 times from 1,000 loci with a coverage of 200. No error was introduced into our simulations, and no sequencing error was assumed. In all A subplots, the discrete estimation is shown, and in all B subplots, the continuous estimation is shown. In both cases, the distribution of estimated COI as the value of alpha changes is plotted. In A.1-A.12, dot plots are used where point size indicates density. In B.1-B.12, a boxplot shows the interquartile range, with whiskers indicating the range and outliers indicated by dots. The mean absolute error is shown in A.13 and B.13, with the 95% confidence interval computed from 1,000 bootstrap repetitions at each COI indicated with error bars.   is compared across a COI of 2-20. Data was simulated 100 times from 1,000 loci with a coverage of 200. A varying error rate, known as epsilon, was introduced into our simulations. This parameter indicates the probability that a single read is miscalled as the other allele. The assumed sequence error, i.e., the error we believe exists in our simulated data, was set to 0.01. In all A subplots, the discrete estimation is shown, and in all B subplots, the continuous estimation is shown. In both cases, the distribution of estimated COI as epsilon changes is plotted.    Method is compared to THE REAL McCOIL across a COI of 2-20. Data were simulated for each value of the COI 10 times from 1,000 loci with a coverage of 200. No error was introduced into our simulations, and no sequencing error was assumed. The subplots show the estimation methods' performance as the level of relatedness varies. In all subplots, the median value of the COI was plotted for each estimation method. The vertical bars indicate the 95% quantile range. The color of each point and its corresponding error bar corresponds to each of the three estimation methods.     L.2 Clustering real data samples Fig AD. Silhouette plot. The x-axis represents the number of clusters the data was split into. The y-axis indicates the average silhouette score, a measure of cluster validity and strength. The silhouette score can provide a measure to select the optimal number of clusters [9].    Comparison between estimation methods and DEploidIBD effective Ke. We compare the estimated COI for all samples to the measured DEploidIBD effective Ke metric, a measure of the complexity of infection and relative parasite density of each strain in that mixed infection, such that Ke = 1/ (w 2 i ), where wi is the proportion of the ith strain [10]. The x-axis indicates the effective Ke. The y-axis indicates the estimated COI. We compare various methods using coiaf and THE REAL McCOIL. Note that each band of data in the Discrete Frequency Method, the Discrete Variant Method, and THE REAL McCOIL corresponds to the integer value of the COI-noise in the y-direction has been added for interpretability. Fig AJ. Comparison between estimation methods and malaria prevalence. We examine the relationship between COI and malaria prevalence. The x-axis indicates the prevalence of malaria, which was obtained from the Malaria Atlas Project [11][12][13]. The y-axis indicates the estimated COI. We compare various methods using coiaf and THE REAL McCOIL. Note that each band of data in the Discrete Variant Method, the Discrete Frequency Method, and THE REAL McCOIL corresponds to the integer value of the COI-noise in the y-direction has been added for interpretability.