Diagnosing fraudulent baseline data in clinical trials

The first table in many articles reporting results of a randomized clinical trial compares baseline factors across arms. Results that appear inconsistent with chance trigger suspicion, and in one case, accusation and confirmation of data falsification. We confirm theoretically results of simulation analyses showing that inconsistency with chance is extremely difficult to prove in the absence of any information about correlations between baseline covariates. We offer a reasonable diagnostic to trigger further investigation.


Introduction
In clinical trials, baseline variables are used to: 1) document that the trial recruited its target population, 2) summarize the natural history of the disease in the control arm, and 3) adjust the treatment effect for baseline differences in prognostic factors. Because baseline variables are measured before randomization, any differences between arms are attributable to chance. That is, the null hypothesis of no treatment effect should hold marginally for each baseline variable. For each continuous baseline variable compared using a continuous test statistic, the marginal distribution of its p-value should be uniform if the assumptions underlying the test (e.g., the data are normally distributed) are satisfied.
Clinical trialists sometimes go one step further and assume that p-values should behave like independent uniform deviates. Seeing appreciably more or fewer than the expected one in 20 statistically significant differences at α = 0.05 arouses suspicion. In one case uncovered by Bolland et al. [1], that suspicion led to the accusation and later verification of data falsification. A randomized study in dogs uncovered by Calisle et al. [2] also appeared to show implausibly little variability of baseline covariates across arms. It and other publications by the same authors were retracted.
Betensky and Chiou [3] and Bland [4] use simulation to show that, in practice, p-values for baseline variables in clinical trials frequently do not behave like independent uniforms for several reasons: 1) the assumptions underlying a test may not be satisfied (e.g., skewed data do not fit the normality assumption), 2) the covariate may be binary, in which case even marginal uniformity of p-values does not hold exactly, and 3) many baseline covariates are correlated, so their p-values are also correlated. The authors conclude that interpretation of standard tests of uniformity applied to p-values is problematic. A natural question is whether a legitimate case for data falsification can be made based solely on p-values reported in a baseline table (i.e., with no information presented on correlations between baseline covariates).
We propose a statistical test based on the sum of squared z-scores of baseline covariates that can be used to determine whether further investigation of fraud is warranted. This article complements the simulation results in [3][4] with theoretical results showing the difficulty of actually proving fraud. The problem is that the distribution of the test statistic depends critically on the correlation between z-scores comparing arms on baseline covariates. Naively treating these z-scores as independent rejects the null hypothesis of no fraud too often if there is any true correlation. On the other hand, using the worst case correlation matrix leads to an extremely conservative test that virtually never rejects the null hypothesis. We characterize correlation matrices that ought to be conservative, but not so conservative as to be useless.

Weighted combination of iid χ 2 (1)s
Let P i be the one-tailed p-value for testing whether treatment observations tend to be larger than control observations for the ith continuous baseline covariate, i = 1, . . ., k. Assume that the corresponding test statistic has a continuous distribution and its underlying distributional assumptions are satisfied. In the absence of data falsification, the P i are dependent uniform (0, 1) random variables. Betensky and Chiou [3] evaluate the impact of correlation on chi-squared and Kolmogorov-Smirnov statistics of uniformity of the P i . We consider instead a test specifically targeting too little variability of actual results from expected results, an indication of possible data falsification.
We begin by transforming the dependent uniforms P i to dependent standard normals Z i by Z i = F −1 (1 − P i ), where F −1 is the inverse of the standard normal distribution function. Although the Z i need not have a multivariate normal distribution, they will be approximately multivariate normal if the test statistics are asymptotically sums of iid random variables and the sample size dwarfs the number of baseline covariates. Given that our intent is to show the difficulty of proving data falsification even under the best circumstances, we assume that the Z i are exactly multivariate normal with mean vector (0, 0, . . ., 0) and nonnegative-definite covariance matrix S whose diagonal elements are 1. That is, the Z i are correlated (unless S is the identity matrix) but marginally standard normal.
We will suspect data falsification if the Z i are too close to their expected value of 0; i.e., there is too much balance between arms. The sign of the Z i is not important, so we will be suspicious if Z 2 i is very small for multiple baseline variables. A natural way to combine the Z 2 i is through , the squared length of the vector Z of z-statistics for the k baseline covariates. If the Z i were iid, we could determine whether L 2 is "too small" by comparing its value to the αth percentile of a chi-squared distribution with k degrees of freedom. But the Z i are dependent, so we must derive the distribution of L 2 under a marginal standard normal assumption, but with arbitrary correlation matrix S. We derive this distribution using standard results in linear algebra (see, for example, [5]) as follows.

Distribution of L 2
We can find an orthonormal basis e 1 ; . . . ; e k of eigenvectors of S and form the orthogonal matrix Γ whose columns are e 1 ; . . . ; e k . Then Γ T SΓ = D, a diagonal matrix whose diagonal entries λ 1 , . . ., λ k are the eigenvalues of S, which are all nonnegative real numbers because S is a nonnegative-definite, symmetric matrix. Therefore, Let S 1/2 denote the matrix ΓD 1/2 Γ T , where D 1/2 is the diagonal matrix whose diagonal elements are the square roots of those of D. Then Z has the same distribution as S 1=2 d, where d is a vector of k iid standard normals, because cov where Z � ¼ G T d. Also, cov ðG T dÞ ¼ G T IG ¼ I, so the distribution of Z � is that of k iid standard normals. Eq (2) implies that is a weighted sum of squares of iid standard normals. The weights λ i are the eigenvalues of S, which sum to k. We interpret these eigenvalues in terms of variances of linear combinations of the Z i as follows. The vector c maximizing the variance of the linear combination c � Z, subject to jjcjj 2 ¼ 1, is c ¼ e ðkÞ , the eigenvector associated with the largest eigenvalue, λ (k) , of S. We can view c � Z as the projection of Z onto the axis defined by c (Fig 1). The variance of e ðkÞ � Z is λ (k) . The second largest eigenvalue e ðkÀ 1Þ of S is the maximum variance of linear combinations c � Z such that 1) jjcjj 2 ¼ 1, and 2) c is orthogonal to e ðkÞ . The variance of e ðkÀ 1Þ � Z is λ (k−1) .
Continuing in this fashion, the smallest eigenvalue e ð1Þ of S is the maximum variance of linear combinations c � Z such that 1) jjcjj 2 ¼ 1, and 2) c is orthogonal to each of e ðkÞ ; e ðkÀ 1Þ ; . . . ; e ð2Þ .
The variance of e ð1Þ � Z is λ (1) . If there are a few very large eigenvalues and the rest are close to 0, the Z i are highly correlated. On the other hand, if the eigenvalues are all of similar size, the Z i are close to being uncorrelated. Summary: 1. The distribution of L 2 under correlation matrix S for Z is a weighted sum of iid chi-squared random variables with 1 degree of freedom.
2. The weights are the eigenvalues of S, which sum to k.
3. If all eigenvalues are 1, the Z i are iid and L 2 � w 2 k . 4. If k − 1 eigenvalues are 0, the Z i are maximally correlated and L 2 � kw 2 1 .

Peakedness as a function of l
We have characterized the distribution of the test statistic L 2 in the absence of fraud and under an assumed correlation matrix S. Next we investigate the peakedness of this distribution. If the distribution is peaked, then a small value of L 2 suggests possible fraud, as small values would be quite unlikely otherwise. On the other hand, if the distribution of L 2 is very dispersed, then small values of L 2 might be common even under the null hypothesis of no fraud.
It is critical, therefore, to determine limits on the peakedness of the null distribution to properly interpret evidence engendered by small values of L 2 . There are different ways to quantify peakedness, but perhaps the simplest is based on the variance.

Minimum and maximum variance.
We consider next how the variance of L 2 depends on the eigenvalues of the correlation matrix of Z. This will be important to evaluate how misleading it can be to treat p-values for baseline covariates as if they were independent. The mean and variance of L 2 are Thus, the mean of L 2 does not depend on the correlation matrix S for the baseline z-scores, but the variance of L 2 does. For this reason, it is important to find the minimum and maximum values V min and V max of var(L 2 ) and determine which correlation matrices yield those extreme values. Because the λ i sum to k, � l ¼ ð1=kÞ In other words, the independence case, S = I, produces the smallest variance, V min = 2k, of L 2 . In a sense, this produces the greatest peakedness for the null distribution of L 2 . To see the serious implications of this fact, suppose that the observed value of L 2 is small. If we wrongly assume that the z-scores for baseline covariates are independent, then we will be using the minimum possible variance of L 2 to determine whether L 2 is implausibly small. Consequently, the observed L 2 value might be many assumed standard deviations away from its mean value of k. The resulting p-value will be tiny, and the level of evidence supporting data falsification will be greatly overstated if the true correlation matrix S is far from the identity matrix corresponding to independent z-scores.
To avoid inflating the probability of erroneously suspecting data falsification, we could assume the correlation matrix S yielding the largest variance of L 2 . It can be shown that the l maximizing the variance of L 2 assigns value k to one of the λ i and 0 to the remaining λ i s. In that case, var In summary, the smallest and largest values of var(L 2 ) are: We will see that using V max is almost always too conservative to be useful. Therefore, we want to select conservative values of V that are not so conservative that they are useless. To see how to do this, notice that the vectors (1, . . ., 1) and (0, 0, . . ., k) in (6) and (7) are at opposite ends of a certain spectrum. Imagine two different communities, each with k luxury cars divided among k people. In one community, everyone has 1 luxury car, and in the other community, one person has all k luxury cars. Vectors (1, . . ., 1) and (0, 0, . . ., k) correspond to these least and most polarized distributions. This concept can be formalized as follows. A vector y is said to majorize another vector x, written x � y, if the ordered values x (1) � . . . � x (k) and y (1) � . . . � y (k) satisfy P k i¼j x ðiÞ � P k i¼j y ðiÞ for j = 1, . . ., k, with equality when j = 1. In other words, y is more polarized (the rich are richer and the poor are poorer) than x. The smallest and largest vectors, in terms of majorization, with sum k are (y (1) , . . ., y (k) ) = (1, . . ., 1) and (0, . . ., 0, k), respectively. The generalization of the ordering of variances in (6) and (7) is as follows.
Theorem 2.1. V = var(L 2 ) increases as the vector l of eigenvalues of S becomes more polarized (i.e., increases in the majorization ordering).
The proof follows from D.2 on page 101 of [6]. Therefore, computing the null distribution of L 2 assuming that l is one of the larger (although not the largest possible) vectors in the majorization ordering should be conservative, but not prohibitively so.
Our treatment in this section implicitly assumed that the distribution of L 2 is approximately normally distributed for large k, which is reasonable for many l because L 2 is a linear combination of iid chi-square random variables with 1 degree of freedom. However, for certain extreme vectors l such as (0, . . ., k), L 2 is not normal. We would like to show, without invoking asymptotic normality, that the distribution of L 2 has fatter tails as l becomes more polarized (increases in the majorization ordering). We defer discussion of this technical and difficult topic to the appendix.

Simple Ss allowing exact calculation
For any given critical value C, we can compute P(L 2 � C) analytically without using a normal approximation for certain classes of correlation matrices S. Equivalently, we can think in terms of the eigenvalues of S, which, as we have seen, can be interpreted in terms of variances of projections of the Z i onto directions defined by its eigenvectors. Suppose the total variance is spread equally among a small number of directions. Then all but a few eigenvalues are 0, and the remainder all have the same value. For instance, with only 1 direction, all but one eigenvalue is 0, and the nonzero eigenvalue is k. This is the most extreme correlation matrix in which all Z 2 i are identical. More generally, if all variability is focused equally in j directions, then each of the j nonzero eigenvalues has value k/j. In that case, expression (2) becomes k(X j / j), where X j has a chi-squared distribution with j degrees of freedom. The probability of a type 1 error is where G j is the distribution function of 1/j times a chi-square random variable with j degrees of freedom. The appendix shows that G j has fatter tails as j decreases. Therefore, the distribution of L 2 has fatter tails if the total variability of Z is spread equally over a smaller number of directions.
Another relatively simple class of correlation matrices corresponds to the same correlation ρ for all pairs of z-scores. It can be shown that all but one eigenvalue is 1−ρ, and the remaining eigenvalue is 1 + (k − 1)ρ. In that case, expression (2) can be written as where X k−1 and X 1 are independent chi-squared random variables with k − 1 and 1 degree of freedom, respectively. Let H j and h j denote a chi-squared distribution and density function with j degrees of freedom, j = 1, . . ., k. From (9), the type 1 error rate is Table 1 uses Eqs (8) and (10) to compute the inflation of the type 1 error rate when one erroneously assumes that the z-scores comparing baseline covariates across arms are independent, when the true S is either equicorrelated or has all variability focused in a few directions. For the rows labeled by directions, the true S corresponds to total variability of z-scores divided equally among 1, 2, or 3 directions. For rows labeled by ρ, the z-scores for baseline comparisons all have the same pairwise correlation ρ. For example, the "1 direction" row shows that if critical value C is computed assuming the Z i are independent, but the truth is that all variability of the Z i is focused in only 1 direction, the actual type 1 error rate is 47.0 percent or 62.3 percent if k = 10 or k = 100, respectively. On the other hand, if the true S has the same pairwise correlation ρ = 0.50 for all pairs, the true type 1 error rate is 15.1 percent or 54.1 percent for k = 10 or k = 100, respectively. The probability of falsely becoming suspicious increases as the Z i become more correlated.
On the other hand, if one assumes perfect correlation and sets the critical value using V max , the test becomes extraordinarily conservative. Table 2 shows that when k = 25, the actual type 1 error rate of the V max test if the z-scores have common correlation ρ = 0.75 is 7.9 × 10 −20 instead of 0.05. In other words, if we want to protect against the most drastic correlation matrix S ij = 1 for all i and j, the test becomes incredibly conservative even if the true correlation matrix still has unrealistically high correlation. Likewise, even if the true correlation matrix has all variance focused in only 3 directions, the V max test has ultraconservative type 1 error rate 0.0003. Remember that the L 2 test is being used as a diagnostic to see if further investigation is warranted. Further investigation would provide an estimate of S that could be used to compute the true distribution of L 2 , resulting in a much more accurate test. Thus, a reasonable option for the diagnostic test is to make a conservative assumption, such as that all correlations are 0.75 or all variability is focused equally in only 3 directions. This is almost guaranteed to overstate the degree of correlation in a real clinical trial.

Example.
Fujii et al. [7] was a study randomizing 24 dogs to one of three doses of midazolam to evaluate the effect of midazolam on contractility of the diaphragm. Table 3 shows baseline means and standard deviations in each of the three arms for each of 8 continuous variables. There appears to be little variability across arms. We apply our test to dose Table 1. Probability of suspecting fraud (that is, L 2 � C) when C is determined assuming the Z i are independent. The true S either has all variability focused equally in 1, 2, or 3 directions, or each off-diagonal element has value ρ.  = 1 for all i, j (i.e., perfect correlation). The true S either has all variability focused equally in 1, 2, or 3 directions, or each off-diagonal element has value ρ. groups 1 and 2. For each variable, compute a one-tailed p-value P i using an unpaired t-statistic with alternative hypothesis that group 2 has a higher mean than group 1. Then convert each pvalue to a z-score by Z i = F −1 (1−P i ), and compute the test statistic L 2 ¼ P 8 i¼1 Z 2 i . We find that L 2 = 0.2556. If we erroneously assume independence of baseline covariates, and therefore of Z statistics, the p-value is Pðw 2 8 � 0:2556Þ � 10 À 5 . This overstates the strength of evidence for fraud. On the other hand, assuming perfect correlation between z-scores for baseline covariates almost certainly understates the evidence for fraud. That p-value is Pð8w 2 1 � 0:2556Þ � 0:14. We feel confident that a real randomized experiment would not result in all correlations being 0.9. Therefore, making the assumption of a common ρ of 0.9 should still be highly conservative. The p-value using (10) with ρ = 0.9 is 0.0055. In other words, even under what we feel is an unrealistically large degree of correlation, namely 0.9 for all pairs, the evidence for fraud certainly seems sufficient to warrant further investigation.

Number of significant z-scores
We have focused on L 2 as a test statistic for detecting cheating, but other goodness of fit statistics such as those considered by Betensky and Chiou [3] and Bland [4] have similar behavior. A particularly simple statistic is the number J of statistically significant z-scores. We might be suspicious if the number of continuous baseline covariates is large and none result in a statistically significant difference between arms. Suppose these z-scores are equicorrelated with nonnegative correlation ρ. Then Z 1 , . . ., Z k have the same distribution as X + � 1 , . . ., X + � k , where X and the � i are mutually independent normal random variables with zero means and variances The specific distribution function F(p) for the random variable P = p(X) is unimportant. The important fact is that P has mean α, as the following calculation shows: Accordingly, the distribution of J is a mixed binomial: where f(p) is the density function corresponding to distribution function F(p). Note that R pf (p)dp = α.
Under independence, the number of significant Z i has an ordinary binomial distribution with parameters k and α. let J and J 0 denote random variables from the mixed binomial (12) and the unmixed binomial bin(k, α). To see that extreme results are more likely for J than for J 0 , note that by Jensen's inequality, for k > 1, More generally, Shaked [8] has shown that a mixed binomial has fatter tails than an ordinary binomial with the same mean. This explains the common phenomenon of observing what appear to be too few or too many statistically significant baseline differences in clinical trials. Therefore, if one falsely assumes that the Z i are independent, the chance of falsely suspecting fraud will be inflated.

Discussion
We have proposed a diagnostic for detecting suspiciously low between-arm variability in baseline covariates in clinical trials. The test statistic L 2 , the squared length of the vector Z of zscores of balance in baseline covariates, has a distribution that depends on the correlation matrix S of Z only through its eigenvalues. We confirm analytically for L 2 what is demonstrated through simulation in [3][4] for similar goodness of fit tests applied to baseline covariates in clinical trials when no information about correlations is available. Assuming independence between covariates (and, therefore, between the Z i ) results in an unacceptably high probability of falsely suspecting fraud. In fact, the distribution of L 2 has thinnest tails when one falsely assumes that z-scores for baseline covariates are independent, and fattest tails when one assumes the most extreme possible correlation. We draw two conclusions: 1) one should never conclude fraud solely because L 2 is unusually small under the independence assumption and 2) to feel confident that the Z i are too small to have occurred by chance, L 2 must be unusually small even assuming unrealistically high correlation. Assuming perfect correlation produces a test that virtually never triggers further investigation. Therefore, we suggest using a practical upper bound on the correlation matrix such as all correlations equal to 0.75 or all variability focused equally in only 3 directions. The final verdict will almost always be based on the totality of evidence. The case made by Bolland et al. [1] was based on numerous trials by the same authors that contained warning signs such as suspiciously fast enrollment and few deaths and dropouts despite recruiting older patients with substantial comorbidity. Our test statistic is a useful diagnostic that can be used in conjunction with other evidence to bolster the case for data falsification.

A Appendix: Fat-tailed distributions
The argument in Section 2.3 that L 2 should have fatter tails as l becomes more polarized was based on approximate normality of L 2 for large k. But if l ¼ ð0; . . . ; 0; kÞ, L 2 has the distribution of k times a chi-squared variate with 1 degree of freedom, which is not normal. This section addresses whether L 2 has fatter tails as l becomes more polarized even without the approximate normality assumption.
The first problem lies in defining "fatter tails.-This is easy for normal distributions or other distributions symmetric about their mean: the same mean and larger variance implies fatter tails. For asymmetric distributions such as those of linear combinations of chi-squared random variables, we must use a different definition. One possibility is the following.
Definition A.1. Distribution function F 2 has fatter tails than distribution function F 1 , In other words, is at least as large as the left tail F 1 (x) for all x � x � , and the right tail 1−F 2 (x) is at least as large as the right tail 1−F 1 (x) for all x > x � . Another way of expressing this fact is that if F 1 −F 2 has any sign changes, then there is exactly one, and the sequence of signs is −, + as x increases.
Bock, et al. [9] conjectured, but did not prove the following. Conjecture A.1. (Bock et al. [9]) if Z � 1 ; . . . ; Z � k are iid standard normals and l 1 � l 2 , then Theorem 1 of Roosta-Khorasani and Szekely [10] is closely related, but it shows that large values are more likely for l 2 � Z �2 than for l 1 � Z �2 . We are interested in the opposite tail, namely that very small values are more likely for l 2 � Z �2 than for l 1 � Z �2 .
Although we have been unable to prove Conjecture A.1 in complete generality, empirical evidence and heuristic arguments support its veracity. For example, we investigated the k = 2 case using an extensive grid of possible values of l 1 and l 2 and computing the distributions of l 1 � Z �2 and l 2 � Z �2 through numerical integration. For k = 3, we repeatedly generated random vectors l 1 and l 2 from a simplex in a way that l 1 � l 2 , and used simulation to compute the distributions of l 1 � Z �2 and l 2 � Z �2 . Further details are available from the authors.
One special case of Conjecture A.1 is when all of the variability of Z is concentrated equally in each of j directions. In that case, the distribution of L 2 is ðk=jÞv � Z �2 , where v contains j ones and k−j zeroes. Since v � Z �2 has a chi-squared distribution with j degrees of freedom, L 2 is kðw 2 j =jÞ, where w 2 j denotes a chi-squared random variable with j degrees of freedom. Thus, Conjecture A.1 says that w 2 j =j has fatter tails as j diminishes. Although we are unable to prove Conjecture A.1, we prove this special case at the end of the appendix.
Another important special case of Conjecture A.1 is when all pairs (Z i , Z j ) have the same correlation ρ � 0. In that case, the eigenvalue vector is (1−ρ, . . ., 1−ρ, 1 + (k−1)ρ), which increases in the majorization ordering as ρ increases. The conjecture implies that L 2 has fatter tails as ρ increases. Eq (9) shows that L 2 ¼ tðw 2 1 =1Þ þ ð1 À tÞðw 2 k =kÞ, where τ = {1 + (k−1)ρ}/k. Therefore, Conjecture A.1 applied to the special case of equicorrelated Z i s is equivalent to tðw 2 1 =1Þ þ ð1 À tÞðw 2 k =kÞ having fatter tails as τ increases from 1/k to 1. It should be noted that one could define fatness of tails of a distribution function in ways other than Definition A.1. For example, suppose that E{ψ(X)} � E{ψ(Y)} for every convex function ψ such that the expectations exist. Then not only is the variance of X no greater than that of Y, but the same is true for the fourth central moment, the sixth central moment, etc. This is one way of formulating the idea that the distribution function of Y has fatter tails than that of X.
It is very easy to prove, using the alternative definition above, that assuming the Z i are independent produces the thinnest tailed distribution. This is demonstrated in Theorem A.2.
This follows from the definition of convex function: for any u 1 , . . ., u k and nonnegative w 1 , . . ., w k with ∑w i = 1, ψ(∑w i u i ) � ∑w i ψ(u i ). Set w i = λ i /k, i = 1, . . ., k. Then The derivative of g(x) = exp(x/2k)/x 1/2 is negative for 0 � x < k and positive for x > k. Thus, g (x) decreases for x < k and increases for x > k. Moreover, g(x) has a limit of + 1 as either x # 0 or x ! 1. These facts also hold for h(x), which is just a positive constant times g(x). It follows that the number m 1 of x such that h(x) = 1 is either 0, 1, or 2. But m 1 cannot be 0 or 1 because that would imply that f j−1 (x) > f j (x) for all x or for all but one x, contradicting the fact that f j−1 (x) and f j (x) both integrate to 1. Therefore, h(x) = 1 for exactly two values, x = x 1 and x = x 2 , x 1 < x 2 . Let F j−1 (x) and F j (x) be the distribution functions corresponding to the densities f j−1 (x) and f j (x). Because f j−1 (x) > f j (x) for x < x 1 and x > x 2 , F j−1 (x) > F j (x) for x < x 1 and F j−1 (x) < F j (x) for x > x 2 . Because F j−1 and F j are continuous, there must be a point x � for which F j−1 (x � ) = F j (x � ). Necessarily, x 1 < x � < x 2 . But f j−1 (x) < f j (x) for x 2 (x 1 , x 2 ), so if F j−1 (x � ) = F j (x � ), then F (j−1 )(x) < F j (x) for all x 2 (x � , x 2 ). But we have already established that F j−1 (x) < F j (x) for x > x 2 , so F j−1 (x) < F j (x) for x > x � . Putting these facts together, we have established that This completes the proof.

Author Contributions
Methodology: Michael A. Proschan, Pamela A. Shaw.