A Bayesian reanalysis of the effects of hydroxychloroquine and azithromycin on viral carriage in patients with COVID-19

Gautret and colleagues reported the results of a non-randomised case series which examined the effects of hydroxychloroquine and azithromycin on viral load in the upper respiratory tract of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) patients. The authors reported that hydroxychloroquine (HCQ) had significant virus reducing effects, and that dual treatment of both HCQ and azithromycin further enhanced virus reduction. In light of criticisms regarding how patients were excluded from analyses, we reanalysed the original data to interrogate the main claims of the paper. We applied Bayesian statistics to assess the robustness of the original paper’s claims by testing four variants of the data: 1) The original data; 2) Data including patients who deteriorated; 3) Data including patients who deteriorated with exclusion of untested patients in the comparison group; 4) Data that includes patients who deteriorated with the assumption that untested patients were negative. To ask if HCQ monotherapy was effective, we performed an A/B test for a model which assumes a positive effect, compared to a model of no effect. We found that the statistical evidence was highly sensitive to these data variants. Statistical evidence for the positive effect model ranged from strong for the original data (BF+0 ~11), to moderate when including patients who deteriorated (BF+0 ~4.35), to anecdotal when excluding untested patients (BF+0 ~2), and to anecdotal negative evidence if untested patients were assumed positive (BF+0 ~0.6). The fact that the patient inclusions and exclusions are not well justified nor adequately reported raises substantial uncertainty about the interpretation of the evidence obtained from the original paper.


Introduction
In March 2020, at the beginning of the corona virus pandemic, Gautret and colleagues reported the results of a non-randomised open-label case series which examined the effects of HCQ and azithromycin (AZ) on viral load in the upper respiratory tract of SARS-CoV-2 patients. The authors reported that HCQ had virus reducing effects, and that dual treatment of both HCQ and AZ further enhanced virus reduction. At the time, these data triggered urgent speculation whether these drugs should be considered as candidates for the treatment of severe COVID-19. However, questions were quickly raised regarding the study's data integrity, statistical analyses, and experimental design. In light of these questions, we reanalysed the original data by performing a multiverse analysis in which we systematically varied assumptions regarding the inclusion of patients in the analysis, and their clinical status. The Bayesian methods we apply have several advantages in this context. Firstly, in contrast to the frequentist methods applied in the original paper, Bayesian statistics allows for a direct evaluation of the evidence for and against competing hypotheses. In other words, we can go beyond accepting or rejecting the null hypothesis of no treatment effect, by assessing how strong is the evidence provided by the data, for one hypothesis over and another. Secondly, by performing a multiverse analysis in which the data depends on the inclusion assumptions made, we can explore how the strength of evidence for the clinical efficacy of the HCQ treatments depends on which subjects were included or excluded, and what their clinical status was.

Experimental methods
The experimental details are reported in the original published paper [1] which we henceforth refer to as the original paper.

Treatment groups
For brevity we deviate from the nomenclature of the original paper. HCQ mono refers to treatment with only HCQ. HCQ +AZ refers to treatment with both AZ and HCQ. HCQ group refers to all patients treated with either HCQ mono or HCQ +AZ . Comparison group refer to those patients not receiving either treatment. Note that these are erroneously referred to as controls in the original paper. Note that the statistical analysis file available for this reanalysis paper refers to the comparison group as controls.

Data
The experimental details are reported in the original paper. Raw data was not available at the time of writing but was transcribed from Table 1 of the original paper. We assess the robustness of the original paper's claims by testing four variants of the data, which vary assumptions pertaining to deteriorated and untested patients: (1) Data orig is the data as originally reported. This is the original data, as reported in the original paper.
(2) Data det includes deteriorated patients. It is questionable to exclude several patients who could not complete the treatment because their condition deteriorated. This could introduce a selection bias that inflates the effect of the treatment. We therefore modified the original data as follows: of the HCQ mono group, six were originally described as being excluded. One patient died, three deteriorated into intensive care, one patient stopped because of nausea and one left the hospital. These can be considered counterfactual cases that are necessary to entertain for a conservative estimate of the effects of HCQ mono . In the following, we add the patients who died or entered intensive care to the positive test cases for day 6. This is tabulated in Fig 1A. We exclude both the patients who stopped treatment due to nausea, and the one patient who left the hospital, due to the ambiguity of their cases. This means that four cases are added to the HCQ mono that tested positive for SARS-CoV-2.

PLOS ONE
(3) Data xcon includes deteriorated patients and excludes the untested patients. On day 6 there were 5 patients who were untested, even though the day 6 test outcome was the primary outcome. The untested patients were assumed to be positive in the original paper, and for this data variant we simply exclude them. This can be motivated by the fact that the tests have some level of stochasticity, as can be seen from the fact that there are a total of 9 transitions from negative tests on a given day, followed by positive tests the next day. For 3 of the 5 patients, they tested positive on day 5, and for 2 of the patients the tests were not performed either on day 4 or day 5. Hence it is not known with any certainty what the test outcomes would have been in the five untested patients had they been tested on day 6. This is especially problematic since all 5 of these untested patients belonged to the comparison group. For this reason it is important to analyse data that excludes these untested patients.
(4) Data negcon includes the deteriorated patients and assumes untested patients test negative. Given the problem with untested patients, we perform an analysis to evaluate what would happen to the results had these patients been tested and they were negative, rather than positive as assumed in the original data. This is included as a data variant not because it is the most likely case, but because it is the most conservative possible outcome, given the uncertainty of the reported data.

Statistical analysis
Bayesian statistical analyses of the data were performed in JASP (version 0.11, jasp-stats.org). We note that caution should be taken with reanalyses of this data set because there are some discrepancies between different pre-prints and published versions. The analysis file, including the raw data we transcribed from the original paper, and all materials, are available on the Open Science Framework at https://osf.io/5dgmx/. All references to strength of evidence refer to standard conventions for the evidentiary support of Bayes factors (BF) such that 1-3 is classed as anecdotal, 3-10 as moderate, 10-30 as strong, and 30-100 as very strong [2]. For Bayes factors below 1, the reciprocal can be taken to obtain the strength of evidence in the opposite direction. Our initial analyses attempted to reproduce the findings of the original paper using the same data. We then performed the same analyses again, but with modified assumptions for how to treat excluded patients and untested patients. This can be considered a form of sensitivity analysis, of the sort recommended by a statistical review of the original paper [3]. Unless otherwise stated, we focus on the primary outcome of viral carriage on day 6 relative to inclusion point into the study.

Main effect of HCQ mono on viral carriage reduction
The original paper compared HCQ group , which is a composite of two groups (HCQ mono and HCQ +AZ ) with different drug treatments, to the comparison group. A more appropriate test for this question would be HCQ mono versus comparison group. Here we perform the test HCQ mono versus comparison group, and assess its sensitivity to the variants of the data under different assumptions regarding deteriorated and untested patients. Fig 1A shows the number and proportion of patients testing positive for SARS-CoV-2 grouped by HCQ mono or comparison group.
Here we quantify the degree to which HCQ mono reduces viral carriage of SARS-CoV-2 (viral carriage hence). We conducted a Bayesian A/B test [4,5] that considered three rival models. The first model is a null model H 0 which states that the viral carriage in HCQ mono is equal to that of the comparison group. This entails that the log odds ratio ψ, for viral carriage reduction is equal to 0. The second model is a positive effect model H + which predicts that the effect of HCQ mono exceeds that of the comparison group, and is thus indicative of a beneficial effect of HCQ mono on viral carriage. Under this model, ψ is assigned a positive-only truncated normal prior distribution N + (μ,σ). The third model is a negative effect model Hthat predicts that the effect of HCQ mono is smaller than that of the comparison group, which would indicate a harmful effect of HCQ mono on viral carriage. Under this model ψ is assigned a negative-only truncated normal prior distribution N -(μ,σ). For all models in this paper, we perform a default analysis in which the parameters of the normal distribution are set such that μ = 0 and σ = 1. The results are tabulated in Fig 1B, 1D, 1F and 1H for each of the four data variants. The Bayes factors that we report indicate how likely the data is under each model. Thus for each data variant one can use these factors to find which model finds most support from the data. For all data variants, the prior probabilities of each model were the same, namely that H 0 is assigned a probability of 0.5, H + and Hare each assigned a probability of 0.25. Given these priors, for each of the three models, the corresponding posterior model probabilities are computed P (Model | data) as can be seen in the tables of Fig 1B, 1D, 1F and 1H. For all data variants (except Data negcon ) it is the positive effect model H + that receives most support from the data. For Data original the evidence is strong (BF +0 = 10.57) meaning that the data is approximately 11 times more likely under H + than under H 0 . For Data det the evidence is moderate when including the deteriorated patients (BF +0~4 .35), and for Data xcon the evidence is anecdotal when excluding untested patients (BF +0~2 ). For Data negcon there was anecdotal evidence against the positive effect model if untested patients were assumed negative (BF +0~0 .6). As is evident from this, the strength of the evidence for the positive effect of HCQ mono over the comparison group is highly sensitive to the assumptions regarding what to do with the deteriorated or untested patients. The more conservative the assumptions that were made, the lower the strength of the evidence for the viral reduction effect of the HCQ treatment. In other words, different choices in the pre-processing of the data can sway the evidence from strong evidence for a positive effect of HCQ mono to anecdotal evidence against such an effect.
This analysis provides interval estimates that were missing from the original report, which allow us to assess the size of the odds ratios, and their plausible ranges under different assumptions about the data. This sensitivity to assumptions is expressed in the credibility intervals for the odds ratios of the treatments. For Data original the 95% credibility interval for the odds ratio has a lower bound of~1.02 and an upper bound of~13. For the Data negcon however the same intervals run from a lower bound of~0.33 to an upper bound of~3.2.   its own. We again conducted an A/B test that considered three rival models. The first model is a null model H 0 which states that the viral carriage in HCQ mono is equal to that of the HCQ +AZ , thus offering no clinical benefit or harm, in terms of viral carriage. This entails that the log odds ratio ψ, for viral carriage reduction is equal to 0. The second model is a positive effect model H + which predicts that the effect of HCQ +AZ exceeds that of the HCQ mono , and is thus indicative of a beneficial effect of adding AZ to HCQ to reduce viral carriage. Under this model ψ is assigned a positive-only normal prior distribution N + (μ,σ). The third model is a negative effect model Hthat predicts that the effect of HCQ +AZ is smaller than that of HCQ mono , which would indicate a harmful effect of adding AZ to HCQ in terms of viral carriage. Under this model ψ is assigned a negative-only normal prior distribution N -(μ,σ).

Main effect of combined treatment of AZ and HCQ on viral carriage reduction
The results are tabulated in Fig 2 for each of the two data variants, Data original and Data det . As with the previous model, for both data variants, the prior probabilities of each model were the same, namely that H 0 is assigned a probability of 0.5, H + and Hare each assigned a probability of 0.25. Given these priors, for each of the three models the corresponding posterior model probabilities are computed P(Model | data), as can be seen in the tables of Fig 2B and  2D. For both data variants it is the positive effect model H + that recieves most support from the data. Note that only two data variants are computed since the comparison group is not part of this test.
For Data original the evidence is anecdotal (BF +0 = 2.776) meaning that the data is approximately 3 times more likely under H + than under H 0 . This level of evidence is sometimes referred to as "barely worth mentioning" [2]. This would appear to temper the conclusions of the original paper, which inferred that this was a clinically important result, and one that was central to the medical recommendations of the paper. Given the priors described above, the corresponding posterior model probabilities would be H 0 (0.391), H + (0.542) and H -(0.067). As can be seen in Fig 2C the 95% credibility interval for the odds ratio has a lower bound of 0.54 and an upper bound of~11. This indicates that there is also large uncertainty in the size of the positive clinical effect. The lower bound represents a 46% reduced chance of viral clearance having been improved by adding AZ to HCQ. The upper bound of this estimate represents a 1000% improved chance. The large uncertainty in this estimate of the odds ratio is due to the small sample size obtained in the original findings, in which the HCQ +AZ group had only 6 members.
For Data det the evidence is moderate when including the deteriorated patients (BF +0 5.812), meaning that the data is approximately 6 times more likely under H + than under H 0 . This demonstrates that the more conservative exclusion criteria actually increases the strength of evidence for the superiority of HCQ +AZ over HCQ mono . This is because including the deteriorated patients negatively impacts on the proportion of negative tests for the HCQ mono group but not the HCQ +AZ group. If the null model H 0 were assigned prior probability 0.5 and the H + and Hwere each assigned a probability of 0.25, the corresponding posterior model probabilities would be H 0 (0.248), H + (0.719) and H -(0.033). As shown in Fig 2E the 95% credibility interval for the odds ratio has a lower bound of~0.77 and an upper bound of~14, indicating a large uncertainty in the positive clinical effect. The lower bound represents a 23% reduced chance of viral clearance having been improved by adding AZ to HCQ. The upper bound of this estimate represents a 1300% improved chance. Again, the large uncertainty in this estimate is due to the small sample size obtained in the original findings, in which the HCQ +AZ group had only 6 members.
As is evident from this analysis, the strength of the evidence for the positive effect of HCQ +AZ over HCQ mono is sensitive to the assumptions regarding what to do with the deteriorated patients. Different choices change the evidence, from anecdotal based on the original data, to moderate under more conservative exclusion criteria. This analysis provides interval estimates that were missing from the original report, which importantly allow assessment of the odds ratios, and their plausible ranges under different assumptions about the data. For both data variants there is large uncertainty in the odds ratios, ranging from moderate reductions in the chance of improvement, up to very large chances of improvement.
The strength of evidence for all statistical comparisons for all data variants is shown in Fig  3. Note that we have focused on the comparisons of HCQ mono versus comparison group, and HCQ +AZ vs. HCQ mono , because these answer the questions set out in the original paper. We computed two other comparisons for completeness, HCQ group versus the comparison group, and HCQ +AZ versus the comparison group. Focus on these last two tests were downgraded, as because the first test aggregates two different treatments, and the second test confounds the effect of AZ. The full analysis details are available in the supplementary materials. As can be seen for both of these additional tests, the evidence is sensitive to the assumptions pertaining to the inclusion of deteriorated patients as well as to the status of untested patients.

Summary
Using a complementary (Bayesian) statistical framework, we evaluated the strength of the statistical evidence for the main claims of Gautret et al., and we asked how robust this evidence is to different assumptions about how to treat deteriorated and untested patients. Though we were able to qualitatively reproduce a positive effect of HCQ on viral load reduction, and a further improvement by adding AZ, the strength of the evidence was highly sensitive to variations in these assumptions. We discussed these in detail and provided a broader context for evaluating the quality of evidence offered by the original paper. In the original paper, the main test for the effect of HCQ was performed by comparing a group of two different treatments (monotherapy HCQ mono and the combination therapy HCQ +AZ ) against the comparison group. This test does not directly answer the question of what is the clinical effect of HCQ on upper respiratory tract SARS-CoV-2 viral load reduction, and to answer this one needs to compute the effect of HCQ mono against the comparison group. It is regrettable that this test was not reported because it is the test that is necessary to evaluate the effect of HCQ on viral reduction. Performing a Bayesian A/B test, we found that for the original data, there was strong statistical evidence for the positive effect of HCQ mono improving the chances of viral reduction when compared to the comparison group. However, we found that the level of evidence drops down to moderate evidence when including the deteriorated patients, and it drops further to

PLOS ONE
anecdotal evidence when excluding the patients that were not tested on the day of the primary outcome (day 6). For context, anecdotal evidence is generally considered 'barely worth mentioning' [2]. We were able to qualitatively reproduce the finding of an improvement of HCQ +AZ over HCQ momo . However, although this finding was statistically significant in the original finding, our reanalysis revealed only anecdotal evidence for the positive effect of HCQ +AZ over HCQ momo . However, when we included the deteriorated patients into the analysis, this evidence increased to moderate. We also performed another test, which is to compare HCQ +AZ against the comparison group. It should be noted that this test is not the most relevant test because it varies two drugs at the same time. Nevertheless, the statistical evidence for the positive effect of the combined treatment over no treatment is very strong for the original data, and drops down to moderate when excluding the untested patients (Fig 3).

Statistical considerations
Common to both of these sets of analyses is the fact that they are highly sensitive to the assumptions that are made about the exclusion of patients, and of the test status of untested patients. That these assumptions are not well justified, nor adequately reported gives reason to be cautious in the interpretation of the evidence obtained from the original paper. Furthermore, this uncertainty is generally reflected in the credibility interval estimates for the odds ratios, which can range from moderate decreases or small increases in the chances of improvement, through to very large chances of improvement. These uncertainties stem from the small sample sizes for each subgroup. Indeed, the original paper has been criticised for being underpowered due to its relatively small sample size (16 comparison group, 20 treated). The Bayesian inference framework is a natural way to incorporate sample size in the analysis. Small sample sizes will lead to more variance and hence less certainty on the estimated model parameters, and as such typically do not provide compelling evidence; as sample size increases, the evidence will generally become stronger, either in favor of H0 (when there is no discernable effect in the data) or in favor of H1 (when the data do show an effect). Thus, as sample size grows the evidence typically grows as well, in a smooth fashion. We argue that, from the perspective of statistical evidence, this criticism of being underpowered is less relevant once the data are observed. Firstly, such criticisms, though commonly espoused, should be made with reference to the effect size they are underpowered for. Small sample studies can be well powered for detecting large effect sizes. More importantly, although estimating power is useful in planning experiments, it can be misleading when making inferences from observed data [6]. In this reanalysis we rely on Bayes factors, which are an extension of likelihood ratios beyond point hypotheses. These methods of inference do not average over hypothetical replications of an experiment, but instead condition on the data that were actually observed. For instance, the fact that a small sample can reveal strong evidence for an effect indicates that the effect size could be relatively large. In this way, Bayes factors rationally quantify the evidence that a particular dataset provides for or against the null, or any other hypothesis. Recourse to claims about the power of an experiment can be displaced by considering the strength of the evidence for one model over other models. This is clinically important because the strength of this evidence is not apparent from the statistical reporting of the original paper (which only reported p-values). Put simply, the findings of the original paper cannot be dismissed solely on the basis of being "underpowered".

Negative effect hypotheses
In this paper we assigned prior probabilities of 0.5 to H 0 and 0.25 each to H + and H -We assign probability mass to both the negative and positive hypotheses because when testing a drug it is important to know whether it negatively or positively impacts on clinical outcomes. A different model which only considered only H 0 and H + which, say, assigned prior probability of 0.5 to both hypotheses would have assigned higher posterior probabilities to the H + than those reported here. However, these different priors on the hypotheses would not change any of the Bayes factors reported, because such priors cannot influence the Bayes factor. This is one of the reasons we have emphasised the interpretation of the Bayes factor rather than the posterior probabilities of the hypotheses.

Experimental design and pre-registered protocol
The most fundamental problem with the original paper is that there was no randomisation of the treatment, which means it is vulnerable to differences in baseline risk between the subgroups. In the original paper, the treatment groups are confounded by several variables including whether or not they met the exclusion criteria, which centre implemented treatment, and differences in consent (the comparison group were composed of those that met the exclusion criteria or did not consent to treatment). For a full statistical review of these considerations see Dahler et al [3]. Most importantly, the comparison between HCQ +AZ and HCQ mono is confounded by the unreported clinical reasons for which the physicians decided to add the AZ treatment to some patients but not to others. If these reasons were important enough to warrant different treatment, then they are important enough to impact on the comparability between the two groups. Whilst we refrain from making formal inferences, it is relevant to note that the HCQ group patients were older than the comparison group patients (median age 51.5 and 32.5 years respectively, Table 1). It is also worth mentioning, that the comparison group included five cases aged 16 or younger, which should again warrant caution when comparing outcomes between groups. We briefly comment on the existence of putative deviations from the pre-registered protocol, available at https://www.clinicaltrialsregister.eu/ctr-search/ trial/2020-000890-25/FR. Outcomes specified in advance included evaluation of upper respiratory tract viral carriage at 1,7, and 14 days, and yet the primary outcome reported in the paper was on day 6. This has been interpreted by some as outcome switching, however we would, in the absence of further information, suggest the possibility that this is an issue of how the days are numbered, whether one starts counting from zero or one. The day 14 outcome was presumably not included such that the report could be published 7 days earlier, which is defensible given the urgency of the pandemic at the time of writing. The secondary outcomes registered in the protocol are not adequately reported or analysed. Finally, the raw data tables changed between different versions of the preprint and the published paper, and thus questions can be asked about data integrity. Clearly, accommodation must be made for the speed at which the original report was published, and the conditions under which the data were presumably collected. The integrity of the reanalysis presented here is explicitly predicated on the assumption that all these possible deviations and data integrity issues can be adequately resolved. Good clinical practice inspection for the sake of patient safety and data transparency would help to resolve such issues.

Measurement of viral load
It is important to note that the PCR based test uses a threshold of 35 cycles (CT) to distinguish between PCR positive and PCR negative, some PCR positive patients in particular in the HCQ treatment group show CT numbers that are quite close to this threshold indicating that the status might be somewhat ambiguous during the test. Furthermore, a number of patients are later tested positive after being tested negative (occurring a total 9 times in 8 patients) which may further question the use of a hard threshold on the number of cycles. For these reasons using duplicate sample analysis and confirmatory tests and eventually developing quantitative PCR tests for assessment of treatment effects would be recommended for future studies. Also note, that the current recommendation for a FDA-emergency approved test is that negative PCR results do not preclude presence of SARS-CoV-2 infection and recommend that such results be accompanied by clinical observations, patient history, and epidemiological information. Finally, it is important to determine whether SARS-CoV-2 virus nucleic acid detected by PCR is replication competent or not. At the time of writing, detailed clinical outcome data was not available, precluding any analysis relevant to a clinical outcome other than change from a positive to a negative PCR-based test.

Clinical safety
While the viral load measurements were noisy, showing multiple reversals between test outcomes, there is greater certainty around other clinical outcomes such as the 4 patients whose condition seriously deteriorated. It is important to stress that all of these belonged to the HCQ mono group, a fact that did not adequately temper the central claims of the original paper regarding the clinical potential of HCQ. Another way to state this would be that, though there is varying degrees of evidence for an effect of HCQ on viral load, it is known with greater certainty that all of the deteriorations occurred in the HCQ treatment group. Greater weight should be placed on this fact, when stating the possible clinical benefits of HCQ in the treatment of Covid-19.

Conclusions
We find that computing the appropriate statistical tests for the effect of HCQ on viral load reduction, yields results that are highly sensitive to the assumptions about which patients are included and how. While this evidence is strong for the assumptions made by the original paper, for more conservative assumptions, the evidence is substantially weaker than originally reported. Performing the same analysis approach to the question of whether AZ improves HCQ treatment, we find moderate statistical evidence for a positive effect. Whether this is a meaningful comparison however is questionable, based on the fact that it is confounded by undisclosed clinical decision making, that lead to some being treated with AZ and others not. To be clear, our analysis does not resolve the uncertainties that follow from the original experimental design, nor does it address concerns that have been raised about the study's data integrity. The only way to resolve these will be via the randomised controlled trials (RCT) that are already underway.