Data processing of qualitative results from an interlaboratory comparison for the detection of “Flavescence dorée” phytoplasma: How the use of statistics can improve the reliability of the method validation process in plant pathology

A working group established in the framework of the EUPHRESCO European collaborative project aimed to compare and validate diagnostic protocols for the detection of “Flavescence dorée” (FD) phytoplasma in grapevines. Seven molecular protocols were compared in an interlaboratory test performance study where each laboratory had to analyze the same panel of samples consisting of DNA extracts prepared by the organizing laboratory. The tested molecular methods consisted of universal and group-specific real-time and end-point nested PCR tests. Different statistical approaches were applied to this collaborative study. Firstly, there was the standard statistical approach consisting in analyzing samples which are known to be positive and samples which are known to be negative and reporting the proportion of false-positive and false-negative results to respectively calculate diagnostic specificity and sensitivity. This approach was supplemented by the calculation of repeatability and reproducibility for qualitative methods based on the notions of accordance and concordance. Other new approaches were also implemented, based, on the one hand, on the probability of detection model, and, on the other hand, on Bayes’ theorem. These various statistical approaches are complementary and give consistent results. Their combination, and in particular, the introduction of new statistical approaches give overall information on the performance and limitations of the different methods, and are particularly useful for selecting the most appropriate detection scheme with regards to the prevalence of the pathogen. Three real-time PCR protocols (methods M4, M5 and M6 respectively developed by Hren (2007), Pelletier (2009) and under patent oligonucleotides) achieved the highest levels of performance for FD phytoplasma detection. This paper also addresses the issue of indeterminate results and the identification of outlier results. The statistical tools presented in this paper and their combination can be applied to many other studies concerning plant pathogens and other disciplines that use qualitative detection methods.


Introduction
The use of analytical methods capable of producing reliable analytical results is a prerequisite to the effective control of quarantine plant pathogens. Consequently, it is relevant to evaluate and test the methods to define how valid and reliable the produced results are for an intended purpose, i.e. to validate the methods.
Validation is defined in the ISO/IEC 17025 standard [1] as "the confirmation by examination and the provision of objective evidence that the particular requirements for a specific intended use are fulfilled". A rigorous validation process includes both single laboratory validation and an interlaboratory test performance study (TPS), the latter also being referred to as a ring test or collaborative trial [2].
An interlaboratory test performance study is considered to be a more reliable indicator of test performance when used in other laboratories because it requires testing of the method in multiple laboratories, by different analysts using different reagents, supplies and equipment and working in different laboratory environments. Interlaboratory test performance studies are an essential part of the validation process for analytical methods. The purposes of an interlaboratory test performance study are to determine the performance of one or more tests among laboratories, to estimate the reproducibility of the test(s), and if several tests included, to establish their comparability. In this context the use of statistics, both for the design of interlaboratory test performance studies and the processing of the participants' results, is essential to make sure that differences observed between tests have a high probability of being real and not due to random effect.
As a French National Reference Laboratory for plant pathology, the ANSES Plant Health Laboratory organizes interlaboratory test performance studies in order to ensure that the methods used by officially approved French laboratories (certified by government services) are capable of producing reliable analytical results for the detection of plant pathogens. In this context, the ANSES Plant Health Laboratory participated in a working group established in the framework of EUPHRESCO (EUropean PHytosanitary RESearch Coordination, a phytosanitary European Research Area Network (ERA-NET) project) which aimed to compare and validate diagnostic protocols for the detection of "Flavescence dorée" (FD) phytoplasma in grapevines. FD is one of the main grapevine diseases in Europe included in the European legislation as a quarantine pest (Directive 2000/29/EC). Seven methods based on conventional and real-time PCR for the detection of this phytoplasma were selected and subject to interlaboratory trials performed in 14 European laboratories.
Interlaboratory test performance studies in plant pathology have a number of notable features including the processing of qualitative results. This paper focuses on different statistical tools that can be used to process the results of an interlaboratory test performance study and presents their application to the collaborative study on the detection of FD phytoplasma. The use of statistics in data processing helps to distinguish between differences that are due to chance and those that probably indicate a real effect of the evaluated methods and consequently ensures the reliability of results and the related decision-making process. Standard statistical approaches based on the PM7/76 (2) [3], PM7/98 (2) [4] and PM7/122(1) [2] standards and on the ISO 16140 standard [5] first made it possible to determine traditional performance In detail, the 15 positive samples included 11 grapevine samples positive for FD phytoplasma and four samples positive for other phytoplasmas of the 16SrV group. The nine negative samples included four samples of healthy Vitis vinifera and five samples contaminated by phytoplasmas from groups other than the 16SrV group.

Evaluation of analytical specificity
Number of laboratories a 14 12 (13)  Details on the samples used at each step of the TPS are provided in Tables 2 and 3. Preliminary tests aiming to verify the homogeneity and stability of the different samples were successfully performed (data not shown).
From these samples, the TPS participants were to carry out the different methods under the normal working conditions of the laboratory and in the same manner as other samples which are usually analyzed in the laboratory. PCR reagents and controls (positive controls, negative controls) were not provided by the organizer.
Data analysis. All statistical tests were performed using the R statistical software (version 3.3.1; R Development Core Team, Vienna, Austria). Statistical tests are considered significant for a calculated p-value lower than 5%.

Performance criteria
First stage of the evaluation: Analytical specificity. In reference to the PM7/98(2) standard [4], analytical specificity (ASP) was defined as the degree of correspondence between the responses obtained by the evaluated method and the expected theoretical results (samples' real status), and was assayed using two criteria: diagnostic sensitivity (DSE) i.e., the ability of the method to detect the target when it is present in the sample, and diagnostic specificity (DSP) i. e., the ability of the method to fail to detect the target when it is not present in the sample. Some indeterminate results (i.e. the operator was unable to determine the status of the sample) were reported by some laboratories. Tests on the equality of the number of indeterminate results between methods on one hand and between laboratories on the other hand were performed using Fisher's exact test and Cochran-Mantel-Haenszel's test. The ISO 16140 standard [5] stipulates that collaborative studies should be based on data from laboratories with high competence for the techniques that are being compared. Consequently, the results of a laboratory were excluded (considered as outliers) for a given method when the statistical analysis showed a significant difference for the number of indeterminate results obtained by a laboratory compared to others and when the number of indeterminate results obtained by this laboratory represented more than 50% of indeterminate results obtained for the method and when the number of indeterminate results obtained by this laboratory represented more than 50% of expected negative or positive results obtained from the panel of samples (i.e. number of indeterminate results ! 5 for negative samples or ! 8 for positive samples). Several statistical procedures used in this paper cannot take into account missing data, so, once outliers were eliminated, we tested two scenarios: (H1) the laboratory hypothetically made the right decision for the indeterminate results in relation to the samples' real status (i.e. the indeterminate results were counted as positive for positive sample and negative for negative samples) and (H2) the opposite. This made it possible to estimate an interval, which included the range of the parameters' real values. However, to lighten the presentation of data, the first scenario which better reflects reality will be favored.
Total number of true positives (TP, a positive result is obtained when a positive result is expected), true negatives (TN, a negative result is obtained when a negative result is expected), false positives (FP, a positive result is obtained when a negative result is expected) and false negatives (FN, a negative result is obtained when a positive result is expected) were determined for each laboratory and each method. As explained for indeterminate results, collaborative studies should be based on data from laboratories with high competence for the techniques that are being compared. Consequently, the results of a laboratory were excluded for a given a In reference to samples listed in Table 2 Diagnostic specificity (DSP) was calculated using the results obtained for the negative samples (samples b, c, d, e, h, k, q, r, v) and was defined as the TN/N-ratio, where N-refers to the total number of tests for negative samples. Diagnostic sensitivity (DSE) was calculated for positive samples (samples a, f, g, i, j, l, m, n, o, p, s, t, u, w, x) and was defined as the TP/N+ ratio, where N+ refers to the total number of tests for positive samples.
Analytical specificity (ASP) was evaluated for all of the results by calculating the ratio of the sum of the number of positive and negative agreements between a method and the expected theoretical results for the number of tested samples. Confidence intervals (95%) were calculated for ASP, DSE and DSP criteria using the Wilson score method with no continuity correction [19]. Tests on the equality of ASP, DSE and DSP between methods were performed using Fisher's exact test.
For this second stage of the evaluation, no indeterminate results were reported by the participating laboratories. Moreover, the sub-group of five laboratories (P1, P2, P7, P12 and P14) selected for this second stage of the evaluation had already demonstrated its competence in FD detection (routine use of end-point PCR and real-time PCR protocols, and no outlier results identified during the first stage of the evaluation). Consequently, all results were included in the data analysis.
Lastly, it is worth noting that only two out of five laboratories could provide results with method Ma, whereas results were provided for all other methods by the whole sub-group of 5 laboratories. As result, for method Ma, the performance criteria were evaluated with poor precision in comparison to other methods, and for some statistical tests, no results were presented as there were insufficient data to ensure a reliable result.
According to the PM 7/76(2) standard [3], analytical sensitivity is the lowest amount of target that can be reliably detected.
For each method, a probability of detection was calculated per dilution level, according to the following equation x n where x corresponds to the number of positive results and n corresponds to the number of results obtained for a given dilution level. This calculated probability was compared to the theoretical detection level of 95% using the exact binomial test. For a given method, the highest dilution level (i.e. the lowest amount of target) for which no significant differences with the theoretical detection level of 95% were identified was considered as the dilution level that can be reliably detected. To have an overview for this criterion, an overall probability of detection ("overall" ASE) was calculated per method. Confidence intervals (95%) were calculated for ASE using the Wilson score method with no continuity correction [19]. Tests on the equality of ASE between methods were performed using Fisher's exact test.
To determine variability within a laboratory and between laboratories other criteria were evaluated from the same samples and results: repeatability, reproducibility and concordance odds ratio.
Repeatability is the level of agreement between replicates of a sample tested under the same conditions according to the PM 7/76(2) standard [3]. Considering qualitative data, it was estimated by calculating accordance (DA) as recommended in the ISO 16140 standard [5] (i.e., the probability of finding the same result from two identical test portions analyzed in the same laboratory, under repeatability conditions) according to the following equation: pr where pr and nr are the number of positive and negative responses, respectively, and N is the total number of responses.
Reproducibility is the ability of a test to provide consistent results when applied to aliquots of the same sample tested under different conditions (time, persons, equipment, location etc.), according to the PM 7/76(2) standard [3]. Considering qualitative data, it was estimated by calculating concordance (CO) as recommended in the ISO 16140 standard [5] (i.e., the percentage chance of finding the same result for two identical samples analyzed in two different laboratories). Concordance was calculated taking each replicate in turn from each participating laboratory and pairing with the identical results from all laboratories. Concordance was the percentage of all pairings giving the same results for all possible pairings of data. If concordance is smaller than accordance, it indicates that two identical samples are more likely to give the same result if they are analyzed by the same laboratory than if they are analyzed by different ones, suggesting that there can be variability in performance between laboratories or that the method is not robust enough to reproduce the same results under different laboratory conditions. As the magnitude of qualitative repeatability and reproducibility is strongly dependent on the level of accuracy, the ISO 16140 standard [5] recommends calculating the concordance odds ratio (COR) defined as follows: DA x ð1À COÞ CO x ð1À DAÞ , to assess the degree of interlaboratory variation. The larger the odds ratio, the more predominant the interlaboratory variation. For COR values above 1.00, Fisher's exact test was used to evaluate the statistical significance of the variation between laboratories. Confidence intervals (95%) for accordance and concordance were calculated with the basic bootstrap method [20] using R statistical software ("boot" package). To lighten the presentation of data, these confidence intervals are given only for overall results. Confidence intervals (95%) for COR values were calculated with Woolf's logit method [21] as follows: where n is the total number of possible interlaboratory pairings for the same sample, DA is accordance, CO is concordance and z is the 1-α/2 point of the standard normal distribution (z = 1.96 for a 95% confidence interval i.e. with a risk (α) of 5%). The approach of the PM 7/76(2) [3] and ISO 16140 [5] standards was supplemented by another approach based on the probability of detection(POD) model. This model aims to harmonize the statistical concepts and parameters between quantitative and qualitative method validation [22]. The POD model provides a tool for the graphical representation of response curves for qualitative methods. In addition, the model enables comparisons between methods, and provides calculations of repeatability, reproducibility, and laboratory effects from collaborative study data. POD characterizes the method response with respect to concentration as a continuous variable. As described previously ( § analytical sensitivity), for each method, a probability of detection was calculated per dilution level. For an interlaboratory trial, the POD value is called LPOD. Confidence intervals (95%) were calculated. The POD (or LPOD) for two methods can be compared by difference at a given analyte concentration. The statistical significance of the difference in POD values (termed dPOD or dLPOD) is determined by its confidence interval. If it includes zero, then the difference between the methods being compared is not significant.
Method variances from collaborative validation studies were modeled as follows where s R is the standard deviation of reproducibility, s r is the standard deviation of repeatability, and s L is the laboratory effect. The estimation method for qualitative POD variance used the analysis of variance (ANOVA) model as defined for the quantitative model given in ISO 5725-1 [23] and used the same ANOVA calculation methods, but instead of entering a qualitative result, results were coded as 1 for a positive response and 0 for a negative response. All calculations made in this paper concerning the POD model were based on LaBuddle's recommendations [24] and are detailed in the publication by Wehling [22].
Variance component estimation via ANOVA with an additive model was possible because the number of replicates analyzed by each laboratory per dilution level was greater than 12 [22]).
Valorization of all TPS outcomes: Bayesian approach. From the results obtained by the laboratories during the first stage (to lighten the presentation of data, indeterminate results obtained during the first stage of the evaluation were considered under scenario H1 only) and the second stage of the evaluation, likelihood ratios (LR) were calculated using the following formulas: where SE is the proportion of positive results obtained from positive samples and SP is the proportion of negative results obtained from negative samples, i. e. the probability of a positive test result for a target sample divided by the probability of a positive result for a non-target sample.
Negative LR LRÀ ð Þ ¼ ð1À SEÞ SP , with the same notation as the previous formula, i.e. the probability of a negative test result for a target sample divided by the probability of a negative test result for a non-target sample.
The LR indicates how much a given diagnostic test result will raise or lower the pretest probability of the disease in question and is a useful tool for assessing the effectiveness of a diagnostic test [25]. The interpretation of the LRs of a diagnostic test was established as follows [25]: LRs of > 10 or < 0.1 generate large and often conclusive changes from pre-to post-test probability, LRs of 5 to 10 or 0.1 to 0.2 generate moderate shifts in pre-to post-test probability, LRs of 2 to 5 or 0.2 to 0.5 generate small (but sometimes important) changes in probability and LRs of 1 to 2 or 0.5 to 1 alter probability to a small (and rarely important) degree.
where for LR+, p1 = SE, p2 = 1-SP, p1n1 = true positives and p2n2 = false positives and for LR-, p1 = 1-SE, p2 = SP, p1n1 = false negatives and p2n2 = true negatives. z is the 1-α/2 point of the standard normal distribution (z = 1.96 for a 95% confidence interval). Bayes' theorem was further used to translate the information given by the likelihood ratios into a probability of disease. Post-test probabilities were simulated using a range (0.01%-99%) of pre-test probabilities using Bayes' theorem as follows [27,28]: Pre-test probability = prevalence (defined as the proportion of grapevine plants infected by FD phytoplasma in a particular population of plants at a specific time) To combine the results of two methods, the post-test odds were calculated as follows [28]:

Analytical specificity
Results submitted by the different laboratories for the first stage of the evaluation are available in S5 Table and were used to calculate the rates of indeterminate results presented in Table 1 and the performance criteria presented in Table 4. All these results are commented on in the following sub-sections.
Indeterminate results. According to the method, the rate of indeterminate results (Table 1) ranged from 2.08% (method Ma) to 6.48% (method M6). Using Fisher's exact test, no significant differences in the rate of indeterminate results were identified between methods for the overall results and also when considering only positive or negative results (p-value respectively of 0.264, 0. On the contrary, significant differences in the rate of indeterminate results were identified between laboratories for the overall results and also if when considering only positive or negative results (p-value respectively of 3.9Á10-3, 3.4Á10-4 and 1.7Á10-3 for overall results, positive results and negative results). The same conclusion was reached using Cochran-Mantel-Haenzel's: the p-value became significant (0.038) between laboratories when method M5 was introduced in the calculation, which was due to the number of indeterminate results produced by laboratory P9 with method M5. More than 50% of indeterminate results obtained for this method were recorded from this laboratory (6/11). In addition, all indeterminate results produced by laboratory P9 were obtained from negative samples, so indeterminate results represented more than 50% of expected negative results (6/9). Consequently, the results of laboratory P9 for method M5 were excluded from the analysis. From a technical point of view, it appeared that laboratory P9 was not able to determine the cut-off value for method M5 and some late Ct values were observed in two of the wells for the samples declared as indeterminate.
Outlier results. Expected results were obtained for all positive and negative controls in each laboratory for each method. Thus, no data were excluded from the statistical analysis for this reason.
Regardless of the indeterminate results counted (scenario H1 or H2), the results of laboratory P6 for method M2 (FP = 9 which represented, according to the scenario for indeterminate results, 45% to 53% of the number of FPs for the method) and the results of laboratory P5 for The results of laboratory P9 for method M5 should also be excluded from the analysis but only for scenario H2 (PD = 7, which represents 41% of the number of FPs for the method). This result is due to the fact that this laboratory concentrated a great number of indeterminate results.
Consequently, the results of laboratory P6 for method M2, the results of laboratory P5 for method Ma and the results of laboratory P9 for method M5 were excluded from the analysis. If no clear technical explanation was found for the results of laboratory P6 for method M2, and for the results of laboratory P9 for method M5 (possible contaminations of samples during the analyses), laboratory P5 did not implement the RFLP analysis recommended for method Ma.
Diagnostic sensitivity and diagnostic specificity: The two components of analytical specificity. The performance criteria assessed in the first stage of the method evaluation (analytical specificity, diagnostic sensitivity and diagnostic specificity) are summarized in It is worth noting that many false-negative results were obtained from sample "w" (more than 25% of laboratories obtained false-negative or indeterminate results from this sample, whatever the method used, and this percentage increased to more than 50% for methods M2, Ma, M3, M4 and M6). The presence of natural inhibitors was suspected for this sample.
Diagnostic specificity ranged from 88.9% (scenario H2) to 94.4% (scenario H1) for M4 and from 84% (H2) to 95.1% (H1) for M6. Using Fisher's exact test, the DSP results for methods M4 and M6 under scenario H1 were not significantly different from the results for methods Ma, M2, M1 and M5, but were significantly better than the results obtained with method M3 under this same scenario.
Moreover, only the DSP results for methods M4 and M6 under scenario H1 presented non-significant variation with the theoretically expected results (p = 0.059 and p = 0.120 respectively).
The low DSP performance of method M3 (ranging from 61.9% to 68.3% depending on the scenario) was due, partially, to the positive detection of the two samples "e" (Ca. P. fraxini -16SrVII) and "r" (Western X grapevine -16SrIII) by, respectively, four and six laboratories out of seven participants that implemented this test. Method M3 targets the 16S rDNA of phytoplasmas of the 16SrV group but this region of the genome is a conserved region for phytoplasmas. This method is nonspecific even if the primer affinity is less significant for phytoplasmas of the 16SrVII and 16SrIII groups than for phytoplasmas of 16SrV group.

Analytical sensitivity
Results submitted by the different laboratories for the second stage of the evaluation are available in S6 Table and were used to calculate the performance criteria presented in Tables 5-7 and represented in Fig 1.  Fig 1 provides a graphical representation of the probability of detection of the different methods based on the dilution level of the target. Graphically, the results of method M2 are largely below those of the other methods, whereas method M5 appears to be the best method regardless of the dilution level.
We can note that some results seem to be inconsistent with the serial dilution: method M1 (dilution D1), method Ma (D1), method M4 (D1) and method M3 (D1 and D2). If there were differences between laboratories, no evidence of outlier results could be identified (lower probability of detection could not be related to one laboratory in particular), so all data were included in the statistical analysis. The presence of natural inhibitors in the samples could explain these unexpected results at low dilution levels, just as for sample "w" in the evaluation of diagnostic sensitivity (it can be noted that sample "w" was used to produce the "C" serial dilution). The inhibition occurring at low dilutions could be removed at high dilution levels (by the dilution of inhibitors).
The analytical sensitivity results for the different methods are summarized in Table 5. Best results were obtained for method M5 for which the target could be reliably detected up to the D4 dilution (no significance with the theoretical detection level of 95% identified using the exact binomial test). For method M6, this level corresponded to the D3 dilution. For methods M4, M3 and Ma, results were less interpretable because of inconsistent results in the first dilution levels as previously described. However, method M4 presented reliable detection of the target for dilution levels D2 and D3. By contrast, results for method M2 were significantly different from the theoretical detection level of 95%, for all dilution levels. The results of overall ASE confirmed these results: using Fisher's exact test, the overall ASE of method M2 (32.3%) was significantly lower than for other methods.

Repeatability, reproducibility and odds ratio
Overall repeatability (Table 5) was above 90% for methods M5, M2, Ma and M4, whereas overall reproducibility was above 90% only for method M5. While repeatability remained good for    all methods (greater than 80%), the results for reproducibility were very poor for some methods (52.1% and 57.9% for M2 and Ma respectively). The concordance odds ratio was not significantly different from 1.00 only for methods M5 and M6 (Fisher's exact test), meaning that no significant differences between laboratories were identified concerning the results obtained from these two methods. Similar results were obtained when results per sample were considered (Table 6). In the case of method M4, even though the overall COR was significantly different from 1.00, it is worth noting that when results per sample were considered, only COR results for samples corresponding to the D1 dilution were significant. No significant differences between laboratories were identified concerning the results obtained from samples corresponding to dilutions D2 to D5 for method M4. For other methods, the overall COR was significant and in detail, differences between  Data processing of qualitative interlaboratory test performance study results laboratories were identified for many samples and dilution levels. Greater differences between laboratories were identified for methods M2 and Ma.    Data processing of qualitative interlaboratory test performance study results In particular, the case of method M2 is intriguing regarding the significance of the difference between laboratories: three laboratories presented an overall probability of detection close to 5% whereas the other two laboratories presented a probability of detection close to 70%. The case of method Ma, even if it should be put into perspective given the very small number of laboratories able to produce results with this method, already reflects problems in the reproducibility of results between laboratories. This is broadly confirmed from a technical point of view by the fact that more than 50% of laboratories could not provide results with this method during this second stage of the evaluation.

Probability of detection model
Statistical parameters of the POD model applied to the different methods are shown in Table 7. For the calculations of dLPOD values, method M5 was considered as the reference method (in regard to the previous results), and side-by-side comparisons occurred between M5 and each of the other methods.
The results for repeatability standard deviation, reproducibility standard deviation and laboratory effect were consistent with previous results concerning repeatability, reproducibility and COR. The laboratory effect was particularly high for method M2 (0.169 to 0.529, depending on the dilution level), whereas it was very low (equal or close to zero) for methods M5 and M6, and for method M4 for all dilutions except for D1. The p-value associated with the Fisher-Snedecor test in the ANOVA performed from binary results obtained by each laboratory indicated a significant difference between laboratories for all dilution levels for methods M2 and M3, a significant difference for three dilution levels (D1, D2 and D5) for method M1, and a significant difference for only one dilution level for methods Ma (D1), M4 (D1) and M6 (D2). No significant differences between laboratories were identified for method M5.
Lastly, using the dLPOD for comparing the responses of methods with respect to the target concentration, no significant differences were identified between methods M4 and M5, and between methods M6 and M5 (except for the highest dilution, D5). Significant differences were identified for all dilution levels for methods M1 and M2, for three dilution levels (D2, D3 and D4) for method Ma and for two dilutions levels (D1 and D2) for method M3.

Bayesian approach
Likelihood ratios are shown in Table 8. The LR+ values from methods M6, Ma and M4 (respectively 18.30, 17.90 and 16.11) are high (> 10), indicating that these methods generate a large change from pre-to post-test probability. The reliability of a positive test result is therefore higher for these methods than for methods M5, M1 and M2 (moderate change: LR+ values between 5 and 10) and more particularly than for M3 (small change: LR+ = 2.63 < 5). The LRof M5 is very close to zero (equal to 0.06 < 0.1), indicating that this method generates a large change from pre-to post-test probability. The reliability of a negative test result is therefore much higher for this method than for methods M6 and M4 (moderate change: LR-values between 0.1 and 0.2) and more particularly than for methods Ma, M1, M3 and M2 (small change: LR-> 0.2).
The likelihood ratio can be combined with the prevalence of infection to determine the post-test probability of infection. Fig 2 illustrates the post-test probabilities of FD phytoplasma (i.e. after a test result) as a function of the pre-test probabilities for each evaluated method and also for the combination of the two most reliable methods (methods M5 and M6). In this graph, the effect of the test result is described by two curves, one for a positive result and the other for a negative one (Lamb, 2007), making it possible to calculate the post-test probability of infection with a positive or negative result depending on the prevalence of the FD phytoplasma in the studied population. For example, in a population with a prevalence of 50%, the probability of a tested individual really being infected after a positive result is higher than 90% for methods Ma, M6 and M4; it is between 80% and 90% for methods M5, M1 and M2 and only 73.2% for M3. After a negative result, there is only 5.5% probability that the grapevine plant is infected by the FD phytoplasma when tested with method M5. This probability increases to 9.2% and 10.0% for methods M6 and M4 respectively, but remains low for these methods. Conversely, relatively high probabilities of infection are found for samples tested negative with Ma, M1, M3 and particularly M2 (35.4%).
The Bayesian approach can be used to choose the most appropriate detection scheme for a particular epidemiological situation. If disease prevalence is low, as is usually the case for FD phytoplasma detection, one of the three real-time PCR methods (M4, M5 or M6) is the most convenient test for routine FD phytoplasma assessment. However, when the FD phytoplasmafree status needs to be accurately assessed (e.g. for the production of healthy plants and the grapevine certification scheme), it would be appropriate to use two detection tests both based on real-time PCR (e.g. methods M5 and M6). The probability of infection of a plant with a positive result obtained with two detection tests is higher than the probability of infection of a plant with a positive result obtained by only one detection test. Similarly, the accuracy of a negative result is very high when the analysis is performed by both detection tests. For example, the post-test probability of infection is lower than 1% if a negative result is obtained both with method M5 and method M6 from a grapevine plant sampled in a population presenting up to 63% prevalence of infection (vs. 14% if method M5 is used alone). This suggests that the combination of methods could minimize the risk of releasing infected material when the two test results are negative, which is particularly important for the certification of grapevine plants. Similarly, the post-test probability of infection is higher than 90% with a positive result obtained both with method M5 and method M6 from a grapevine plant sampled in a population with at least 6% prevalence (vs. at least 30% prevalence if method M6 is used alone). It can be important to guarantee a positive result through the use of two tests if grubbing up and destruction decisions with major economic consequences are taken on the basis of these analysis results.
Pre-test probability (prevalence) was defined as the proportion of plants infected by FD phytoplasma in a particular population at a specific time. Post-test probability was calculated as follows: postÀ test odds ð1 þ postÀ test oddsÞ , where post À test odds ¼ preÀ test probability ð1 À preÀ test probabilityÞ x likelihood ratio. For each method, the solid line represents the post-test probabilities of FD phytoplasma infection after a positive test result for different prevalence rates. The broken line represents the posttest probabilities of FD phytoplasma infection after a negative test result for different prevalence rates.
During the first stage of the evaluation based on the analysis of a variety of target and nontarget samples for the determination of analytical specificity, the best results were obtained with methods M4 and M6. The results obtained with methods M4 and M6 were not significantly different from the results of M5, but were significantly better than the results obtained with other methods.
During the second stage of the evaluation based on the analysis of serial dilutions of target samples for the determination of analytical sensitivity, repeatability, reproducibility, and the POD model, the best results were obtained with method M5 for which the target can be reliably detected up to the dilution of 1.1Á10 −3 , with overall repeatability and reproducibility higher than 90%. The concordance odds ratio was not significantly different from 1.00 only for methods M5 and M6, meaning that no significant differences between laboratories were identified concerning the results obtained from these two methods, whereas significant differences were identified for all other methods (however for M4 only for the first dilution level). Using the POD model, no significant differences were identified between M4 and M5, or between M6 and M5 (except for the highest dilution, 3.7Á10 −4 ). Significant differences were identified for at least two dilution levels (and for up to all dilution levels) for other methods.
Lastly, the Bayesian approach helps summarize all these results and choose the most appropriate detection scheme depending on the epidemiological context. The graphical representation of post-test probabilities of FD phytoplasma as a function of pre-test probabilities for each evaluated method demonstrate the relevance of methods M4 and M6 for the positive predictive value (i.e. confidence in the positive test result), and the relevance of method M5 for the negative predictive value (i.e. confidence in the negative test result). The combination of methods M5 and M6 minimizes the risk of releasing infected material when the two test results are negative, which is particularly important for the certification of grapevine plants. Similarly, it can be important to guarantee a positive result through the use of two tests if grubbing up and destruction decisions with major economic consequences are taken on the basis of these analysis results. Data processing of qualitative interlaboratory test performance study results This paper underlines the usefulness of interlaboratory trials for method validation. These studies are essential to establish the reliability and compatibility of test results. They enable the evaluation of performance criteria such as reproducibility that are essential to evaluate how transferable the method is among laboratories performing routine analyses. For plant pathology, collaborative studies remain rare [29], although they are recommended by different regional and international organizations in plant health such as the European Plant Protection Organization.
The use of statistics in the data processing of interlaboratory collaborative studies is essential to identify significant differences in performance criteria between tests i.e. to ensure that the differences observed between tests have a high probability of being real and not due to chance.
However the situation can be more complex. For example, when comparing two methods, we can come to the conclusion that the first method shows better performance than the second one for a criterion (e.g. analytical specificity) but poorer performance for another criterion (e.g. analytical sensitivity). Thus, in this type of situation, it is very difficult to draw a conclusion as to the overall performance of the two methods. This case precisely demonstrates the relevance of new statistical approaches such as the probability of detection model and the Bayesian approach. Supplementing the traditional approach, these new statistical approaches provide an overview of method performance. They are essential tools to reliably compare methods in their entirety (including the diversity of contamination levels, the diversity of the target, the diversity of sample matrices, etc.). Graphical representations are used to summarize and communicate the results. In the case of the POD model, the graph (Fig 1) summarizes the rate of positive results as a function of the target concentration. Anyone can see at a glance that methods M5 and M6 have the best performance overall when considering different contamination levels; however it remains important to refer to the calculated values (e.g. Table 7) to evaluate the significance of these differences. In the case of the Bayesian approach, the graph (Fig 2) summarizes method performance, including both the results for positive and negative samples and simulating different epidemiological situations. It is very useful to quickly identify the most appropriate methods according to the epidemiological context. For example, in situations of high prevalence, method M5 appears as the most appropriate method to confirm a negative result, whereas in situations of low prevalence, methods M6, M5 and Ma appear as the most appropriate methods to confirm a positive result. This graph also provides users with an overview of method performance irrespective of the prevalence and the type of sample (positive or negative). For a given method, the closer to the vertical and horizontal axes the solid (and respectively the dotted) curves are, the higher the overall method performance is. Thus, it is easy to identify methods M4, M5 and M6 as being the most effective in all situations. Lastly, the graph illustrates the relevance of combining two different methods (e.g. M5 and M6).
The contribution of the POD model and Bayesian approaches for qualitative methods can be compared to that of the accuracy profile which is now widely used for quantitative analytical methods [30,31].However, this paper also illustrates the difficulty in plant pathology of designing a TPS perfectly in line with theoretical statistical rules. Current available guidelines recommend a minimum of ten valid laboratory data sets (and not fewer than eight). This number can be difficult to meet in plant pathology for several reasons including the lack of availability of reference materials and the small number of laboratories competent for a given pathogen. Consequently, very often interlaboratory test performance studies are driven by collaborative projects such as EUPHRESCO which involve different partners at a regional or an international scale (European scale for EUPHRESCO), making it possible to meet the required number of competent laboratories. However, the downside is that these TPSs usually include a variety of methods to be evaluated, each partner legitimately wishing to include the method it routinely uses in the TPS. Thus, the need for a consensus in collaborative projects generally leads to a wide variety of methods being included, making it difficult for all the participants to implement all the methods.
For example, in the case of TPS for FD phytoplasma detection, for the first step of the evaluation, methods were implemented by five to 14 laboratories. This created distortion in the precision of assessment of the methods. In this paper, we have made the decision to process all the data mentioning this distortion of precision with all the caveats, and being aware that the nonsignificance of a statistical test does not mean the absence of differences, but rather the nonidentification of differences.
This also raises the question of the competence of laboratories participating in a TPS. The more methods in the TPS, the more laboratories may be led to implement methods with which they are not familiar. Hence the importance of establishing upstream criteria to identify outliers. The issue of indeterminate results is also important to consider because the impact can be very different from one method to another. Not to integrating indeterminate results in calculations can lead to bias in the results. In this paper, significant differences in the number of indeterminate results were identified between laboratories but not between methods, and therefore the impact of the presence of indeterminate results on the method assessment was low. A number of questions we need to deal with in the data processing stage should be anticipated in the design of TPS issues. When designing a TPS, special effort should be made upstream to define how data will be used and processed and to consequently define the appropriate experimental design (number of participants, number of evaluated methods in line with the number of participants, number of target and non-target samples, number of repetitions, number of dilution (or concentration) levels, diversity/representativeness of samples etc.). However, this need for design framing should not be a pretext to include only artificially contaminated samples. This collaborative study, with the example of sample "w" underlines the fact that it is essential to include in the sample selection, when possible, naturally contaminated samples, which match as closely as practicable the type of samples encountered in routine testing, to assess the tests in "real" conditions.
In conclusion, this paper underlines the usefulness of statistics to increase the reliability of validation data. The statistical tools presented in this paper, some standard and some new in plant health, are deliberately exhaustive to provide useful guidelines for research staff wanting to validate new detection tests through interlaboratory collaborative studies. They can be applied to many other studies concerning plant pathogens and other disciplines that use qualitative detection methods.
Supporting information S1