Informed sequential pooling approach to detect SARS-CoV-2 infection

The alarming spread of the pandemic coronavirus disease 2019 (COVID-19) caused by the SARS-CoV-2 virus requires several measures to reduce the risk of contagion. Every successful strategy in controlling the SARS-CoV-2 infection depends on timely diagnosis, which should include testing of asymptomatic carriers. Consequently, increasing the throughput for clinical laboratories for the purposes of conducting large-scale diagnostic testing is urgently needed. Here we support the hypothesis that standard diagnostic protocol for SARS-CoV-2 virus could be conveniently applied to pooled samples obtained from different subjects. We suggest that a two-step sequential pooling procedure could identify positive subjects, ensuring at the same time significant benefits of cost and time. The simulation data presented herein were used to assess the efficiency, in terms of number of required tests, both for random assignment of the subjects to the pools and for situations in which epidemiological and clinical data are used to create "informed" pools. Different scenarios were simulated to measure the effect of different pool sizes and different values for virus frequency. Our results allow for a customization of the pooling strategy according to the specific characteristics of the cohort being tested.


Introduction
The pandemic coronavirus disease 2019  is caused by the SARS-CoV-2 virus. The infection is predominantly transmitted through large droplets and by contact with infected surfaces or fomites. The alarming spread of the infection and the severe clinical disease that it may cause have led to the implementation of several measures to reduce the risk of contagion. Active case detection, rapid case isolation, and contact quarantine, as well as rigorous application of infection control practices are successful strategies in controlling SARS-CoV-2 infection outbreaks. The success of these strategies relies initially on viral diagnosis. The overloading to which the laboratories are currently subjected causes a cascade delay of all virus containment procedures with potentially dramatic results for prevention of the infection.
In most countries, testing for COVID-19 is mainly restricted to people with symptoms. However, a large percentage of asymptomatic subjects is estimated to exist [1]. Asymptomatic spread has likely driven the silent growth of the SARS-CoV-2 epidemic, which emerged only a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 when health systems began to collapse. Asymptomatic cases play a significant role in infection transmission, also considering that the chance of transmission through inanimate surfaces is less frequent than previously recognised [2]. It is therefore essential that the degree to which asymptomatic individuals affect viral diffusion be evaluated [3]. Tracing contacts of known positive cases, travel bans, and social distancing are the main strategies for reducing the risk of contagion by asymptomatic subjects. A widespread testing strategy to screen asymptomatic subjects could be useful in reducing transmission of SARS-CoV-2, but this approach is highly challenging taking into account of the amount of work, time, and cost that it would entail.
For this reason, we propose here a pre-screening strategy which should increase the capacity for clinical laboratories to conduct large-scale diagnostic testing, enough to screen a significant portion of the asymptomatic population.
SARS-CoV-2 is an enveloped virus containing a single strand of positive-sense RNA, and its diagnostic protocol is a RT-PCR assay, as previously described ( [4,5]). Briefly, SARS-CoV-2 has been detected from a variety of upper and lower respiratory sources including throat, nasal nasopharyngeal (NP), sputum, and bronchial fluid ( [6,7]). Oropharyngeal (OP) and NP swabs are the most frequently used samples. The sampling is carried out using two distinct swabs which can be inserted in the same test tube containing the viral transport medium to increase the yield for RT-PCR analysis [8]. Recent studies have shown that the SARS-CoV-2 detection can also be correctly applied on saliva, with the advantage of an easier sample collection [9]. Total RNA is extracted and SARS-CoV-2 target genes are simultaneously amplified and tested during the quantitative RT-PCR assay.
Recently, Hogan et al. (2020) [10] performed a retrospective study on SARS-CoV-2 based on sample pooling. The roots of this idea go back to Dorfman [11]. "Pooling" means that swab samples taken from different subjects can be combined before the RNA extraction phase. These authors used 2888 samples from nasopharyngeal and bronchoalveolar lavages that were collected between January 1, 2020, and February 26, 2020, from subjects who had not been tested for SARS-CoV-2. Nine or ten samples were pooled, and screening was performed by RT-PCR. A total of 292 pools were screened and the confirmed positivity rate for SARS-CoV-2 was 0.07% (2/2888). The aim of pooling is to reduce the number of test kits used, significantly shortening the time and costs of analysis.
A single positive sample can be properly detected in pools of up to 32 or 64 samples, using standard kits and protocols, with an eventual slight increase in the PCR cycle threshold [12]. Some pre-prints claim that it is also technically possible to detect a single positive sample in largerpool sizes, with samples of up to 100, 120, or even 1000 [13]. However, the possibility of a single positive sample escaping detection in such large pools, especially if the viral load is low, must be taken into account. This could happen especially for samples at the initial or final phase of infection, regardless of whether the patient is symptomatic or not [14]. Various proposals have already been made to reduce the risk of increase false negative results due to pooling such as incrementing the capability of the extraction protocol [15]. Moreover, it must be considered that a small reduction in sensitivity should be conveniently balanced by the possibility of screening more people more often. By significantly reducing the number of analysis, pooling offers the possibility of increasing the frequency of monitoring, which is probably the most important factor to achieve an effective surveillance strategy [16].
However, the suitability of pool size does not depend only on the limit of sensitivity of the RT-PCR, but must also be set on the basis of statistical evaluations that are the subject of this publication.
The basic concepts for understanding the pooling strategy are simple: 1) only a pool made up of all negative samples will give a negative result for the pool analysis; and 2) a single positive sample within a pool makes the result of the pool analysis positive. If the pool is positive, it is necessary to proceed to individual testing for the purposes of identifying true positives (TP) and false positives (FP; i.e., a negative subject whose swab has been mixed with at least one positive swab). As all individual samples in a negative pool are considered as true negative (TN), the pooling approach significantly reduces time and cost when a large proportion of pools tests negative. However, it is clear that the effectiveness of pooling is inversely proportional to the frequency of the virus in the selected cohort and, as we will demonstrate more precisely in the results section, this approach can be inefficient or even counter-productive if the presence of the virus is high.
The aim of this paper is to i) propose a two-step sequential pooling strategy; ii) identify the variables for which the pooling method can be more or less effective; and iii) to develop strategies to further improve this approach. We began by identifying the main variables to be included in our model. The first and perhaps most important variable, as already mentioned, is the frequency of the virus. Unfortunately, this information is not known a priori, but can be estimated. The second variable is the effectiveness of the clinical and epidemiological criteria that are adopted to create the pools, compared to an analysis in which these pools are created randomly. The third variable is the size of the pool. We have taken into consideration a wide range of scenarios to adjust the variables to give the best results using fewer tests.
As certain strategies have the potential to improve the pooling approach, we compared alternative methods of pool creation and evaluated their different performance in relation to the variables described above. Our data suggest that a pre-screening strategy based on the use of a sequential informed pooling approach ensures that, in the most favourable conditions with low virus frequency, the number of required tests can drop to 20% of those required for individual testing. Higher virus frequencies still make sequential pooling efficient, provided that pool size is decreased and/or reliable epidemiological and clinical data are used for pool creation.

Methods
The volume of samples initially collected from an individual must be enough for both pooled and individual follow-up testing. Alternatively, subjects requiring validation will be subjected to a new swab. This may be the most convenient choice only for very low viral frequency, as few repetitions of the tests would be expected. No patients were recruited specifically for this study.
Sequential pooling workflow follows these steps: 1. Assume that samples are arranged on a grid. A portion of each sample is collected to create a homogeneous pool following each row: "horizontal" pooling (pool H).
2. Perform the RT-PRC analysis of the "H" pools, each of size s. Based on these results, all negative pools can be excluded from further investigation, as they solely contain samples from TN subjects. Should all pools test negative, the procedure is complete.
3. Using samples not excluded in step B, create the vertical pools (pool V) following each column of the grid and perform RT-PCR analysis of the V pools (vertical pooling). The V pools will have the same size s, but their composition will be different from that of H pools, even if step B did not exclude any pool. Again, all negative pools can be excluded from further investigation, as they only contain samples from TN subjects. Informed sequential pooling follows the same procedure as Sequential pooling, with the only difference being that a score for the probability of being infected will be associated to each subject in order to tag the subject as "suspected positive" or "suspected negative." The standard diagnostic protocol has so far been applied in an emergency situation for which priority has been given especially to those who manifested symptoms. In this model, however, independence among samples is assumed, and other possible correlations are neglected (e.g. family members or co-workers should preferably be pooled together). The aim is to include in the same pool subjects with higher scores, avoiding their random spreading in the matrix. The correct assignment of this score would be accomplished by compiling a dedicated online questionnaire consisting of a few multiple choice questions. Results would be processed automatically, without being time consuming. The score is calculated on the basis of clinical and epidemiological criteria that have already been associated with a higher risk of acquiring COVID-19 [17]. For instance, susceptibility seems to be strongly associated with age and biological sex ( [18][19][20]) suggesting that these simple criteria may play an important role in pool assignment.
Figs 1 and 2 show a simple graphic representation of the sequential pooling and informed sequential pooling approach, respectively. For the purposes of facilitating visual representation, we have chosen a test cohort of dimension N equal to 30.
In Fig 2, the upper panel shows a hypothetical scenario for which all positive subjects are grouped in the first pool. This result can be obtained if the information available to classify subjects as "suspect positive" or "suspect negative" is optimal. In the lower panel, we show another scenario for which clinical and epidemiological information allowed a grouping of the positive subjects, which is only partially correct. However, in this case, the informed approach is still useful for improving the efficiency of the method compared to random pool creation.

Results
In order to assess the advantage of this two-step sequential pooling strategy in comparison with a standard approach in which each subject' swab is tested separately, we performed simulations under different conditions. Results were obtained with Wolfram Mathematica 12.1. The simulated analysis was based on an assumed group of N = 600 subjects. The size s of each
As a first step, we examined the performance of this strategy without using prior information about the subjects, that is, by creating pools completely at random. To do this, after setting s and vf, we performed 5,000 simulations and recorded the ratio between the total number of swab tests required, T, and N. For two-step sequential pooling, T includes both H and V pools required in steps B and C, but also validation tests in step D required on all the swabs from subjects not previously excluded. Since without a pooling strategy, N tests must be performed, the ratio T/N measures the efficacy of the proposed procedure. The smaller the value of this ratio, the fewer the number of required tests. Conversely, values close to 1 (or even above 1) would represent a useless (or a counter-productive) strategy. Table 1 and Fig 3 show the results for s equal to 5, 12, and 24 (the entire set of plots is available in the S1 Fig). The curves plotted represent the 1st, 25th, 50th (median), 75th, and 99th percentiles of T/N obtained in the set of 5,000 simulations, for different vf values. In particular, given that the actual number of required tests depends on the random assignment of samples to pools, the 1st and 99th percentiles give an idea of the range of T/N between "favorable" or "unfavorable" assignments to the pools. The spread between the 25th and 75th, which is always very small in Fig 3, represents the central half of the simulations (after excluding the 25% more "favorable" and the 25% more "unfavorable" ones). As the pool size increases, we notice that the curves are less linear and the spread between the 1st and 99th percentile increases. For very small pools (s = 3) with a low virus frequency, the number of tests required in this approach is about 40% of the number of tests required separately testing each subject. As the value of vf increases, the number of tests grows slowly and pooling remains efficient (T/N<1) even if 25% of the subjects are positive in the group. Conversely, if we use larger pools (s = 24), the number of tests could drop to 20% for low virus frequency. However, the number of tests would increase faster as vf grows, and the procedure would be efficient only up to about 10% of positive subjects in the analysed cohort. In summary, the linear path of small pools ensures efficiency even for larger vf, but the nonlinear path observed for larger pools make them efficient for populations with a low virus presence.
As mentioned in the introduction, simple pooling was recently proposed for SARS-CoV-2 detection by Hogan et al. [10]. We notice that their study does not provide general efficiency results apart from their specific application, where pools of sizes 9 and 10 have been used and a very small vf has been reported (their value is even smaller than the smallest virus frequency assessed in our simulations). Their pooling strategy was originally proposed by Dorfman [11] and it is characterized by a pooling step followed by the validation phase. Conversely, the approach here proposed adds to the Dorfman scheme a further step, since two pooling steps have to performed before the validation phase. This is an important point to be considered when planning a pooling strategy, because this further step requires more time and organizational complexity within the laboratory. It is thus important to assess whether and under which conditions this increase in time and complexity generates an improvement in terms of efficiency. Fig 4 shows a comparison of a simple one-step pooling strategy with our two-step sequential procedure for different vf and s values. In this picture, the 25th, 50th (median), and 75th percentiles of T/N are shown. For very small pools (s = 5), they are almost equivalent. But, as soon as s is slightly increased to sensible values (ranging from 8 to 20), the sequential twostep pooling shows a better performance up to vf = 0.15. For bigger pools (s = 24, 30), we

PLOS ONE
Informed sequential pooling approach to detect SARS-CoV-2 infection observe the same result up to vf around 0.10. For higher vf, both pooling strategies are counter-productive, as highlighted above for sequential pooling. All of the previous results have been obtained assuming a completely random assignment of subjects to the pools. Often, however, clinical and epidemiological data about the subjects are available. If we could use these data to concentrate a portion of the positive subjects in the same horizontal pools, we would increase efficiency due to a higher number of negative pools at step B. In order to assess the savings of such an "informed pooling creation," we extended our simulations to different settings. We may conceive a scenario in which, prior to the test we detect a certain number of subjects, say x, that we expect to be positive (according to epidemiological criteria). We create x/s horizontal pools, each of size s, with those subjects. The remaining (Nx) subjects are assigned to the remaining (N-x)/s horizontal pools. Should epidemiological criteria be perfect, all the x subjects turn out to be true positive and thus the first x/s pools are positive. At the same time, all the (N-x) subjects without prior indication of an infection, with perfect epidemiological criteria, would be true negative and thus their (N-x)/s horizontal pools would yield a negative result. Of course, such an assumption is unrealistic and we expect that some of the x subjects suspected to be positive are true negative and also that some of the (N-x) subjects suspected to be negative are true positive.

PLOS ONE
Informed sequential pooling approach to detect SARS-CoV-2 infection Let us denote by α the fraction of the vf�N true positive subjects in the population that are correctly assigned to the initial pools. The remaining (1-α) fraction is undetected and it is wrongly assigned to the second part of the pools. Criteria with perfect performance in prior detection of positive subjects would result in α = 1. In addition, let us denote by β the fraction of the (1-vf)�N true negative subjects in the population that are correctly assigned to the final pools. The remaining (1-β) fraction is wrongly assigned to the first part of the pools. Criteria with perfect performance in prior detection of negative subjects would result in β = 1. For the same settings analysed in the random creation of the pools (N = 600, vf from 0.01 to 0.30, and s from 2 to 300), we explored the performance of the sequential procedure for different values of α and β. In particular, we allowed α and β to vary in the set {0.5, 0.6, 0.7, 0.8}. When both α and β are equal to 0.5, criteria are essentially unreliable and our situation is equivalent to the random assignment setting discussed above.  Tables 2-4). Our aim is to compare the results of the number of tests required when swabs are randomly assigned to the pools with the number of tests required for different α and β values.
As above mentioned, we started with α and β equal to 0.5, because this is substantially equivalent to uninformative prior criteria. As α and/or β increase, we observe that the number of required tests decreases, and this decrease is larger when the virus frequency is greater. When vf is below 5%, random pooling and informed pooling are almost equivalent. With a low vf, sequential random pooling was, however, already very performant, substantially decreasing the number of tests with respect to separate individual tests. For larger vf, the curves corresponding to random assignment and informed pooling separate more and more. This implies that reliable informed pooling increases the performance of the pooling exactly when the situation is less favourable. For example, with a pool size equal to 12, with a random assignment, the median of T/N is equal to 1 when vf � 0.18 (making random pooling application questionable). Conversely, if informed pooling is performed with α = β = 0.8, at the same vf, the median of T/N is approximately equal to 0.73. With α = β = 0.8, pooling is still efficient (T/ N<1) even if the virus frequency approaches 30%. In summary, reliable informed pooling makes the performance path much more linear than we observed for random pooling, even if we use larger pools. That is, larger pools, besides providing substantial savings for low vf, ensure efficiency even for larger vf if epidemiological criteria provide reliable information.

PLOS ONE
Informed sequential pooling approach to detect SARS-CoV-2 infection

Discussion
Every successful strategy for controlling the SARS-CoV-2 infection depends on timely diagnosis. Hence, there is an urgent need for systematic population screening on a massive scale. Currently, around the world, there is a plethora of different scenarios depending on the spread of infection. Transmission of the SARS-CoV-2 has a high degree of heterogeneity across diverse environments, and even within a single country, there are categories with different contagion risks; for each category, the optimal monitoring frequency must be determined to prevent outbreaks. Moreover, in this variegated context, there are completely different economic situations, and the pooling strategy can become truly attractive for countries with fewer resources. The study published by [10] is certainly an excellent starting point for the evaluation of an alternative approach to individual analysis of swab samples for the RT-PRC based diagnosis of the SARS-CoV-2, but some additional considerations are needed.
First, it must be highlighted that in the study [10], 292 pools of 9 or 10 samples were created and two positive cases in a collection of 2888 samples were found. The one-step pooling method gave excellent results because the frequency of the virus in the analyzed samples was

PLOS ONE
Informed sequential pooling approach to detect SARS-CoV-2 infection extremely low (0.07%). Second, if it were possible to roughly estimate the frequency of the virus in the collection as being lower than 5%, our data suggest increasing the pool size. Using a pool size of 24, for example, the screening of the 2888 samples would need about 120 tests instead of 292. The most difficult samples to be detected are those from patients who are in the early or late stage of infection, because of the lower viral load [14]. Furthermore, these samples risk going undetected as false negatives if the pooling procedure causes dilution or fractionation of the positive specimen. There are ongoing studies attempting to determine the ideal protocol and possible technical solutions to reduce this risk ( [21][22][23]). However, it should also be stressed that the pooling method is so efficient that it could also allow for an increase in the frequency of serial testing and a timed monitoring-an effective strategy especially for subjects with a higher risk of infection.
Our most straightforward result is that the sequential pooling approach is more efficient than the one-step pooling method. In addition, the informed version of sequential pooling can further improve its performance, in particular for larger size pools and moderate to large virus frequency. Table 5 broadly describes practical suggestions to decide the pool size, s, according   Table 5. Summary of practical indications for pooling creation.
Random sequential pooling Informed sequential pooling (α,β�0.7) If we can assume a vf below 10%, pools with sizes ranging from 10 to 15 can provide relevant savings in the number of tests. With vf below 5%, even stronger savings can be obtained with pool sizes increased to 20 or 25.
If we can assume a vf below 10%, very large pools, with sizes ranging from 20 to 25 could substantially reduce the number of tests. Pools with sizes equal to 30 or 40 are a good strategy with vf below 5%.
For situations in which vf may reach 10%-20% of the cohort, there can still be a moderate reduction in the number of tests, with pools of size between 5 and 8.
For situations in which vf may reach 10%-20% of the cohort, we can still have a relevant reduction in the number of tests, with pools of size between 12 and 20.
For situations in which there is the risk of a vf value above 20% of the cohort, pooling strategies should be avoided.
For situations when there is the risk of a vf value above 20% of the cohort, a moderate reduction can be attained with pool sizes about 12, to be further reduced to 8  to rough assumptions about the virus frequency, both for random and informed sequential pooling. Larger pools ensure a significant reduction in the number of tests when vf is small. Smaller pools may be a conservative approach when dealing with cohorts with heavier exposure. Finally, indications are also given to avoid the use of pooling when virus frequency is higher and random pooling would result in a waste of resources, since too many pools are expected to yield a positive result.