Testing for goodness rather than lack of fit of an X–chromosomal SNP to the Hardy-Weinberg model

The problem of checking the genotype distribution obtained for some diallelic marker for compatibility with the Hardy-Weinberg equilibrium (HWE) condition arises also for loci on the X chromosome. The possible genotypes depend on the sex of the individual in this case: for females, the genotype distribution is trinomial, as in the case of an autosomal locus, whereas a binomial proportion is observed for males. Like in genetic association studies with autosomal SNPs, interest is typically in establishing approximate compatibility of the observed genotype frequencies with HWE. This requires to replace traditional methods tailored for detecting lack of fit to the model with an equivalence testing procedure to be derived by treating approximate compatibility with the model as the alternative hypothesis. The test constructed here is based on an upper confidence bound and a simple to interpret combined measure of distance between true and HWE conforming genotype distributions in female and male subjects. A particular focus of the paper is on the derivation of the asymptotic distribution of the test statistic under null alternatives which is not of the usual Gaussian form. A closed sample size formula is also provided and shown to behave satisfactorily in terms of the approximation error.


Introduction
The Hardy-Weinberg law belongs to the key concepts in genetic epidemiology [1]. Departure from Hardy-Weinberg equilibrium (HWE) can be caused by factors such as inbreeding, assortative mating, selection or migration [2]. The effect of these factors on HWE can be expected to be small in most human populations although selection may play an important role in infectious diseases [2]. Another reason is population stratification which causes a deficit of heterozygotes. Population stratification can be controlled for by methods, such as genomic control (for a detailed overview see e.g., [3]). The presence of copy number variations generally leads to an excess of heterozygotes. Finally, deviation from HWE may be simply caused by genotyping errors. We have previously argued that deviations from HWE should be investigated only in controls for case-control studies and in the entire cohort in cohort studies [2]. For autosomal loci, several "how to guides" have been published for assessing deviation from HWE [4,5]. These approaches are commonly used as part of the regular quality control in genome-wide association studies and meta-analyses.
For testing deviation from HWE for X-chromosomal markers, no such guidelines are available although testing for HWE is used for quality control on the X-chromosome as well [6]. The complicating factor for assessing deviation from HWE is that males are hemizygous, thus have only one allele on X-chromosomal markers outside of the pseudoautosomal regions, while females have two alleles as on autosomes. Some software packages therefore ignore male subjects and conduct a test for HWE in females only [5]. However, this reduces the sample size and results in a loss of power. Furthermore, an X-chromosomal marker can only be in HWE if the allele frequencies are equal in males and females. If males are neglected, deviation from HWE cannot be thoroughly investigated. Other software packages ignore the difference between autosomal and X-chromosomal markers (see the genetics package in R). As described by [5], these tests are potentially misleading due to coding the genotype of a hemizygous male either as AA or aa as in the standard data format.
The problem of HWE testing on the X chromosome has caught attention quite recently. For example, Graffelman and Weir [7] proposed four frequentist tests for diallelic markers using data from both males and females. An implementation of these procedures is available in an R package called 'Hardy-Weinberg' [8]. A Bayesian HWE testing procedure has also been proposed [9]. Other tests are described in the work by Wang et al. [10] and You et al. [11], and an extension to multiallelic markers by Graffelman and Weir [12]. Zheng et al. investigated the impact of deviations from HWE on the properties of association tests for X-chromosomal SNPs [13].
The usual strategy to protect oneself against the distorting effects entailed in violations of the HWE condition consists of filtering markers that do not conform with HWE prior to the conduct of genetic association tests. For autosomal SNPs, i.e., diallelic genetic markers located at an autosome, the traditional statistical procedure to assess HWE is the standard Pearson χ 2test. Unfortunately, any testing procedure of this type fails to serve the purpose of confirming the compatibility of the marker with the model. Actually, the conventional χ 2 -test is tailored for establishing lack rather than goodness of fit, since the statement that the distribution underlying the data is in agreement with HWE plays the role of the null rather than an alternative hypothesis. A significant test result thus indicates incompatibility of the observed data with the model. A well-established way around this basic difficulty inherent in the logic of significance testing is to reformulate and solve the problem of HWE assessment as what is called in biostatistics a problem of equivalence testing (for a systematic treatise on this still fairly fast developing area in statistical methodology, see [14]).
This change of the basic inferential paradigm has been successfully exploited by [15,16] and [17] for the case of autosomal SNPs. The equivalence tests derived there are tests for goodness rather than lack of fit, in the sense, that they allow one to control the risk of erroneously deciding in favor of the hypothesis that the populations underlying the samples under evaluation are "essentially" compatible with the HWE model. In this phrase, "essentially" means that the deviations between model and truth, if existing at all, are small enough for being treated as marginal and thus irrelevant. Inverting a traditional lack-of-fit test by deciding for the new alternative hypothesis of equivalence between the actual and a HWE-conforming population if it yields a p-value above the conventional significance level of 5%, fails to provide control over the type-I error. In the equivalence setting, the latter consists in incorrectly rejecting the null hypothesis of relevant deviations from the model. The actual size of this risk highly depends on the order of magnitute of the sample size: for small sample sizes, it can become as large as 95%, where for huge sample sizes it approaches zero so that the procedure becomes extremely conservative. The goodness-of-fit test for HWE at autosomal markers constructed by [15] is an exact, uniformly most powerful (UMPU) procedure based on the conditional distribution of the observed number X 2 , say, of heterozygotes given the total number S of A-alleles (with A denoting the allele of minor frequency). It rejects if X 2 falls in the interior of some interval whose endpoints depend in a fairly complicated manner on the value of S and the significance level (defined as the maximum acceptable probability of incorrectly rejecting the null hypothesis of relevant disequilibrium). In a subsequent paper [17], we were able to show that without substantial loss of power, the exact UMPU test can be replaced with a computationally much simpler approach based on confidence intervals for a function of the population genotype frequencies providing a natural measure of the amount of disequilibrium (the definition of this parametric function will be made precise below in the first subsection of Materials and Methods).
The aim of the present paper is to extend the confidence limit based approach to testing for approximate compatibility of the distribution of some given SNP with the HWE model to the case of X-chromosomal loci. The Materials and Methods (M&M) section, which is the core part of the paper, goes far beyond the description of routine methods of data analysis. It focusses on a rigorous derivation of the newly proposed testing procedure and the formal machinery required for investigating basic properties of the method and planning genetic association studies requiring to ensure the compatibility of sex-linked markers with HWE. It starts with a formally precise description of the equivalence testing approach to HWE assessment for diallelic markers and an extension of the hypothesis formulation to the case that the population under assessment consists of a mixture of allele pairs and single alleles. The proposed way of measuring the amount of disequilibrium jointly for females and males is to define for the two subpopulations separate measures Δ f and Δ m of the distance of the underlying distribution from the model and to combine these by calculating the ordinary Euclidean distance of (Δ f , Δ m ) from the origin of the plane. In the Subsection 2 of M&M, we study the asymptotic distribution of the natural estimator of the Euclidean distance of (Δ f , Δ m ) from 0 obtained by plugging in throughout the observed relative genotype and allele frequencies for the theoretical frequencies (π 1 , π 2 , π 3 ) and p Y , respectively. This provides the mathematical basis for the computation of an upper confidence bound to D ≔ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi , and the corresponding testing procedure, which decides in favor of goodness of fit if this bound falls below the prespecified equivalence margin. In Subsection 3 of M&M, we derive an expression for the exact rejection probability of the goodness-of-fit test under any parameter configuration and establish approximate formulas for the power against different types of alternatives focussing on socalled null alternatives specifying perfect coincidence with the HWE model. In the latter case, which is the most interesting one for applications, the asymptotic distribution of the test statistic is no longer Gaussian and must be established separtely by means of a non-standard construction. The Results section starts with an investigation on level and power of the goodnessof-fit test, which is inherently an asymptotic procedure, in finite samples. Subsequently, the new method is compared to the combined χ 2 -test for lack of fit proposed by [7] both for real data taken from a GWAS on venous thrombosis, and simulated data sets. The assessment of the approximate methods of power calculation and the associated sample-size formulas for the new test, is again done by means of exact numerical computation.

Mathematical notation and formulation of the testing problem
The first goodness-of-fit testing procedure made available for purposes of HWE assessment in genetic association studies involving diallelic markers ( [15]) was constructed by solving the equivalence problem In (1), δ˚stands for a fixed positive constant to be chosen a priori defining the equivalence range for the function of the true proportions π 1 , π 2 , and π 3 of the possible genotypes AA, AB, and BB at the selected locus in the underlying population. The adequacy of the hypotheses formulation (1) for the purpose of establishing goodness rather than lack of fit of an autosomal SNP to HWE is ensured by the following facts: 1. θ/4 − 1 has the same sign as p 2 À 2ð ffi ffi ffi ffi ffi p 1 p À p 1 Þ, and any genotype distribution with parameter (π 1 , π 2 , π 3 ) is in perfect HWE if and only if (π 1 , π 2 ) is a point on the graph of the function 2. For any 0 < δ˚< 1, there holds the relationship 4=ð1 þ d� Þ < y < 4ð1 þ d�Þ ,p l;d� where the region bounded by the curves ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ð1 þ d� Þ À 1 p 1 ð1 À d� p 1 =ð1 þ d� ÞÞ q À p 1 =ð1 þ d� ÞÞ ; 0 < p 1 < 1; ð5Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi encloses the HWE curve (2).
The family fMðp 1 ; p 2 ; p 3 Þj0 < p 1 ; p 2 ; p 3 < 1; p 1 þ p 2 þ p 3 ¼ 1g of all trinomial distributions is readily seen to be an exponential family with parameters θ (as defined in Eq (2)) and ϑ = π 1 /π 3 . This fact is the starting point for the construction of the optimal-precisely: uniformly most powerful unbiased-solution carried out by [15]. The practical implementation of the UMPU test requires acquaintance with advanced statistical software (in R, the programs gofhwex and gofhwex_1s of the package EQUIVNONINF [18] can be used). Since this might restrict the suitability of the method for routine use in the analysis of large-scale genetic association studies, in a more recent paper [17], we developed an asymptotic testing procedure for the same problem as a more user-friendly alternative. The latter relies on the principle of confidence interval inclusion, which was introduced by [19] into the field of bio-equivalence assessment and can easily be shown (cf. [14], § 7.1) to be a special case of the intersectionunion principle (IUP) as formulated by [20]. Despite its conceptual and computational simplicity-a pocket calculator suffices, the IUP-based asymptotic test for (1) turns out to produce rejection regions which, for the sample sizes commonly availabe for genetic association studies, do not differ by more than a practically negligible amount from the critical region of the exact UMPU test for the same setting and specifications. Fig 1 illustrates the confidence interval inclusion rule for the case that a sample of size n = 200 is available from a genotype distribution of an autosomal SNP and the choice δ˚= 0.96 for the constant determining the equivalence bounds to θ considered acceptable for a SNP in sufficiently good agreement with HWE. Using de Finetti's coordinate transformation (π 1 , π 2 )7 !(π 1 + π 2 /2, π 2 ), the graph shows the rejection region of the test obtained by checking an asymptotic 95%-confidence interval for θ for inclusion in the equivalence interval specified under the alternative hypothesis K of (1). As shown by [17], the choice δ˚= 0.96 can be justified by the fact that the corresponding equivalence margin is the smallest one for which the sample size required to attain a power 90% against the null alternative of perfect agreement with HWE in a test at nominal level α = 0.05 does not exceed 3,000, provided a SNP with minor allele frequency satisfying 0.1 � MA � 0.5 has to be evaluated.
If A and B are the two possible alleles for a SNP at an X-locus, generalizing the alternative hypothesis of (1) in a natural way leads to replacing K by the statement that the values taken in the underlying subpopulations, i.e., female and male subjects, by the following two distance measures are both "sufficiently small": Δ f = distance among females between the parameter θ of (2) or some suitable transform of it, and its value under perfect HWE Δ m = distance between the true distribution of A-alleles among males being binomial with parameter p Y , say, and a binomial distribution having the allele frequency holding for females under perfect HWE as its parameter.
Regarding the female subpopulation, we adopt the confidence-limit based approach to constructing a goodness-of-fit test for HWE assessment with autosomal SNPs in a 1:1 manner. Conceptually, this version of a test for equivalence of the genotype distribution of a diallelic marker under analysis with HWE relies on the following fact: a measure of distance from the model which combines straightforward biological interpretability with mathematical convenience can be based on the difference between half the square root of the parameter θ as defined above in Eq (2), and unity. Actually, 1 is the value of o :¼ ð1=2Þ in a population being in perfect HWE. With a view to symmetry of the distribution of its natural point estimator, we prefer to replace the parameter ω, which we proposed to term relative excess heterozygosity (REH), by its logarithm and to measure in the subpopulation of females the degree of disequilibrium in terms of the distance of log ω from zero, i.e., |log ω|. Accordingly, we define the first component of an aggregate measure of disequilibrium combining the characteristics of the genotype distributions for both gender strata to be given by assuming throughout that the π j denote the genotype frequencies in the subpopulation of females (with the subscript f being omitted for brevity).
As explained above, the other component, Δ m , must be defined as a function of (π 1 , π 2 , π 3 ) and p Y , the true frequency of the allele A of interest in the male subpopulation. In order to make this definition suitable for the present purpose, Δ m has to be a reasonable measure of distance between two binomial distributions with parameters p 1 = π 1 + π 2 /2 and p 2 = p Y . The literature on equivalence testing methods for clinical trials contains several different proposals for choosing such a measure. As has been argued by [14] (see also [21]), a particularly well suited definition is based on the log odds ratio between p 1 and p 2 , which in the present context leads to setting Thus, as an aggregate criterion of approximate compatibility of an X-chromosomal SNP with HWE to be satisfied under the alternative hypothesis of the test to be subsequently derived, we use the condition Denoting the signed version of Δ f and Δ m by D � f and D � m , respectively, the set of all combinations of subpopulation genotype and allele frequencies satisfying (9) obviously corresponds to a circular disc of radius ε in the parameter space of ðD � f ; D � m Þ centered at the origin. Hence, it seems reasonable to choose ε to be the radius of the smallest circle which contains a square with edges of length being equal to twice a suitable common margin to Δ f and Δ m . In testing for equivalence of two binomial distributions with respect to the log odds ratio, a well-established specification of the equivalence margin is ε LOR = log(12/8) � 0.41 (for the rationale behind this recommendation (cf. [14], § 1.6). Furthermore, the margin which has been proposed by [17] for REH = ω 2 = θ/4 is 1.96 corresponding to log(1.4) � 0.34 for Δ f . Since these margins are not identical we propose to take the tighter one as a basis for specifying the margin

Interval estimation and testing procedure
As a pivotal quantity for inference about our distance measure Δ we consider the plug-in point estimator obtained through replacing the population frequencies involved by the homologous empirical proportionsp (X 1 , X 2 , X 3 ) and Y are assumed to belong to independent samples of sizes n 1 [females] and n 2 [males] from a multinomial distribution with parameters (π 1 , π 2 , π 3 ) and a binomial distribution with parameter p Y , respectively. Recalling (7) and (8), this leads to the expression As usual in an asymptotic treatment of inferential procedures for two-sample settings, all statements about convergence in distribution of variables being functions of thep j andp Y will hold under the assumption that the relative sample sizes n 1 /N and n 2 /N tend to nondegenerate limits λ and 1 − λ, say, as the total sample size N = n 1 + n 2 increases to infinity. The basic properties of the multinomial family and the independence of (X 1 , X 2 , X 3 ) and Y ensure that the limiting distribution of ffi ffi ffi ffi N p ððp 1 ;p 2 ;p 3 ;p Y ; 1 Àp Y Þ À ðp 1 ; p 2 ; p 3 ; p Y ; 1 À p Y ÞÞ is multivariate normal with expected value zero and (singular) covariance matrix Weak convergence of ffi ffi ffi ffi ) to a centered Gaussian distribution with the above covariance structure is the starting point for establishing the following result (for details of a proof see S1 Appendix).
The plug-in estimator of the joint distance measure Δ to be eventually used for HWE assessment can be written asD ¼ Hence, except for suitable centering and rescaling by means of ffi ffi ffi ffi N p , it has a limiting normal distribution whose variance is a weighted average of s 2 f ;l and s 2 m;l . Precisely speaking, there holds the following PROPOSITION 2. Assume that for at least one subgroup G 2 {f, m}, the true value of D � G does not vanish. Then, as N ! 1, ffi ffi ffi ffi N p ðD À DÞ converges in law to a normally distributed variable with expectation zero and variance given by Proof. The result follows directly from Proposition 1 by means of the delta method (cf. [22], § 14.6).
Obviously, D 2 f , D 2 m , s 2 f ;l and s 2 m;l are all continuous functions of (π 1 , π 2 , π 3 , p Y ) so that the same holds true for the asymptotic variance t 2 l of ffi ffi ffi ffi N pD . Since the relative frequencies ðp 1 ;p 2 ;p 3 ;p Y Þ are consistent for the corresponding population frequencies, this implies, that plugging in the latter in all terms appearing on the right-hand side of Eq (15) and replacing the limiting relative size λ of the sample of females with the actual proportion n 1 /(n 1 + n 2 ) yields a consistent estimator of t 2 l . Consistency of this estimator denotedt 2 n 1 ;n 2 in the sequel, allows us to infer from Proposition 2 that there holds The testing problem which we are interested in reads in formal terms and it can be solved through checking an upper confidence bound to the target parameter Δ for non-exceedance of the equivalence margin ε. By (16), an upper confidence limit to Δ at asymptotic level 1 − α is given by ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Finally, as the critical region of the corresponding test at asymptotic level α for (17), we obtain

Exact and approximate methods of computing rejection probabilities and sample-size planning
The rejection probability of the test with critical region (19) under arbitrary parameter configurations is accessible to exact numerical computation. Exploiting the assumed independence of (X 1 , X 2 , X 3 ) and Y, we can write: where I (0,1) (�) denotes the indicator of the positive real half-line. Evaluation of the triple sum appearing on the right-hand side of this equation by means of the SAS/IML script we made available for that purpose is fast enough for keeping execution time within reasonable limits even for sample sizes exceeding 1,000. Planning a study under a non-null alternative. Under any alternative ðp � 1 ; p � 2 ; p � 3 ; p � Y Þ, say, for which the assumptions of Proposition 2 are satisfied, an approximate formula for the sample size required in order to guarantee that the power does not fall short of some prespecified target value 1 − β, say, is readily obtained. According to that result, the rejection probability of the test using the critical region defined in (19) under an arbitrary parameter configuration with Δ > 0 converges to F½z a À ffi ffi ffi ffi N p ðD À εÞ=t l Þ� as N ! 1 and n 1 /N ! λ. In terms of Δ, our testing problem is one of one-sided equivalence or, as one would put it in the language of the methodology of clinical trials, of non-superiority. In the literature on asymptotic testing procedures for non-inferiority problems (cf. [23]), it is recommended to approximate the power of an asymptotic test with critical region f ffi ffi ffi ffi N p ðT N À y 0 Þ=s 0 > z 1À a g through computing the probability that the data fall in this region from a normal distribution with variance s 2 1 rather than s 2 0 , where s 2 1 denotes the limiting variance of ffi ffi ffi ffi N p T N under the selected alternative θ = θ 1 > θ 0 . Adapting this approach in the straightforward way to the setting we are dealing with and denoting the distance of ðD � f ; D � m Þ from zero under the selected alternative by Δ � leads to POW ε ðD � ; l; NÞ � F½ðt l z a À ffi ffi ffi ffi N p ðD � À εÞÞ=t l � : In this approximate equation, t 2 l has to be computed by evaluating (13)- (15) with and in order to determinet 2 l , the same formulas have to be applied with some ðp 1 ;p 2 ;p 3 ;p Y Þ such that the corresponding point in the paramter space of ðD � f ; D � m Þ lies on the circle of radius ε around the origin. For definiteness, we propose to choose The final step required for transforming (21) into the desired sample size formula consists of specifying the power 1 − β one wants to attain and solving the equation F½ðt l z a À ffi ffi ffi ffi N p ðD � À εÞÞ=t l � ¼ 1 À b for N which yields after a little algebra the expression Goodness of fit of a X-chromosomal SNP to HWE The case of null alternatives. Despite the often unsatisfactory accuracy provided by formula (22) for sample size planning under non-null alternatives, its derivability from standard weak convergence results is obvious. In contrast, for the power of the test with critical region (19) under alternatives under which the true value of Δ is zero, no useful approximation by means of a simple Gaussian distribution exists. An approach which will turn out to solve the problem in a very satisfactory way is based on the following concept. DEFINITION 1. Let Z 1 . . ., Z k be mutually independent with Z j � N ð0; c 2 j Þ where c 1 = 1 and c j denotes an arbitrary positive constant for all j = 2, . . ., k. Then, the distribution of Q ≔ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P k j¼1 Z 2 j q is called an extended χ-distribution with k degrees of freedom and parameter c. Its cdf (cumulative distribution function) will be written Q c ð�Þ.
Although Q c ð�Þ is not a known statistical function for which the packages provide predefined routines (except, of course, for the standard χ-distribution corresponding to the special case c j = 18j = 1, . . ., k), it is not difficult to find a representation which can serve as a basis for an easy to implement algorithm for numerical computation. In the special case k = 2 where we drop the subscript from the only non-unity component of c, we can rely on the following result.
LEMMA 1. For arbitrarily fixed c > 0 and any q 2 R þ , there holds ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi with ϕ(�) and F(�) denoting, as usual, the standard normal density and cdf, respectively.
The key computational tool being required for working with the distribution function Q c ð�Þ in practice is an efficient procedure for the evaluation of the integral appearing on the right-hand side of Eq (23). The SAS/IML script we developed for that purpose uses Gauss-Legendre 96-point quadrature and partitioning of the range of integration into 10 subintervals. Even when numerical integration is done at that high level of accuracy, the algorithm is still fast enough to enable also exact computation of the corresponding quantile function Q À 1 c ð�Þ. The relevance of the distribution function Q c ð�Þ for finding an approximation to the power of our test for goodness of fit to HWE becomes obvious from PROPOSITION 3. Let P ðNÞ � ð � Þ denote the joint distribution of (X 1 , X 2 , X 3 , Y) under some fixed parameter configuration ðp � Then, there holds for every d > 0 Proof. From the definition ofD, it is immediately clear that denoting the Euclidean distance of any point (z 1 , z 2 ) in the plane from the origin by q(z 1 , z 2 ), we can write where (Z 1 , Z 2 ) are as assumed in Definition 1 with k = 2, c 2 = σ m;λ /σ f;λ . Since q(�, �) is continuous, the mapping theorem for weakly convergent sequences of probability measures (see, e.g., [24], p. 379) allows us to conclude from (26) that we also have qð ffi ffi ffi ffi N pD � f =s f ;l ; ðs m;l =s f ;l Þ ffi ffi ffi ffi N pD � m =s m;l Þ ! L qðZ 1 ; Z 2 Þ � Q s m;l =s f ;l ð�Þ as N ! 1; which in view of (25) completes the proof. The steps to be taken in order to exploit Proposition 3 for approximating the probability of the event fD þ z 1À atn 1 ;n 2 = ffi ffi ffi ffi N p < εg under a fixed null alternative ðp � 1 ; p � 2 ; p � 3 ; p � Y Þ are analogous to those which lead from Proposition 2 to the power approximation (21) for the case of nonnull alternatives. First of all, we replace the empirical asymptotic standard errort n 1 ;n 2 of ffi ffi ffi ffi N pD witht l , i.e., the square root of the theoretical asymptotic variance of ffi ffi ffi ffi N pD computed at a point ðp 1 ;p 2 ;p 3 ;p Y Þ on the boundary of the equivalence circle in the parameter space of ðD � f ; D � m Þ being conjugate to ðp � 1 ; p � 2 ; p � 3 ; p � Y Þ in the sense made explicit above (! paragraph following Eq 19). Making this substitution reduces the problem of power computation against null alternatives to that of calculating By definition (recall Eq 13),t 2 l is a weighted mean ofs as the desired null-alternative analogue of (22)

Small-sample properties of the proposed test for goodness of fit
A first basic question to answer is whether the procedure maintains the nominal significance level when performed with samples of sizes being commonly available in genetic association studies. The results shown in Tables 1 and 2 give the exact rejection probabilities at a selection of points in the parameter space lying on the common boundary of the hypotheses we are interested in. The position of these points in the ðD � f ; D � m Þ-plane is shown in Fig 2. In the constellations covered by Tables 1 and 2 and many other instances we found no single case of an anti-conservative behavior. On the other hand, it becomes obvious from the entries in the table that the convergence of the rejection probability under the null hypothesis of relevant deviations from HWE to the nominal significance level is comparatively slow. Even for settings with sample sizes of more than 1000 in both subgroups, the absolute difference by which the rejection probability under H 0 falls below the nominal level of 5% can still be larger than 1%.
In testing for goodness rather than lack of fit of empirical distributions to some probability model, the specific alternatives of primary interest are usually those satisfying the null hypothesis of the corresponding test for lack of fit. In the present case, under such null alternatives, there is only a single free parameter left, namely, the common frequency of allele A in females and males. As becomes obvious from the results shown in Table 3, the power of the proposed test against null alternatives is highly sensitive to changes in this parameter. For an allele frequency of 50%, 400 observations from each subpopulation are sufficient to increase the power above 95%. In contrast, for alleles occurring at a frequency of only 10% both in females and males, the sample size per group must be a bit more than three times as large if one wants to rule out that the power drops below 75%. With regard to power, null alternatives are obviously most favorable parameter configurations, and perfect fit of the distributions underlying the data to the model is of course a limiting case which will hardly occur in reality. Given anything else, it has to be expected that the power drops quite rapidly when the point in the parameter plane of ðD � f ; D � m Þ corresponding to the specific alternative of interest is shifted from the origin towards the boundary of the equivalence circle. In order to obtain insight in the speed of this decline in power, we calculated the exact power of the test at nominal level α = 0.05 attained at 9 equidistant points on the segment between 0 and ε ¼ ffi ffi ffi 2 p logð1:4Þ on the main diagonal for samples of size 800 each. From the results of these calculations which are shown in Table 4, one can see that increasing

Illustrating examples
To illustrate our method, we use the same data as Graffelman and Weir [7]. They illustrated the application of their combined χ 2 -test for lack of fit of an X-chromosomal SNP to HWE. The genotype and allele frequencies observed in these examples were extracted from the publicly accessible [25] database of the GENEVA venous thrombosis project, a genomewide association study (GWAS) performed in 2010/11 with the objective to identify genetic variants associated with venous thromboembolism (VTE). The subjects recruited for the project were 1300 VTE cases and 1300 unrelated controls, frequency-matched on 5 elementary criteria. The observed genotype and allele counts obtained in the GENEVA project for four X-chromosomal SNPs (indexed here for brevity by integer numbers) analyzed by [7] are the entries in the left-hand columns of Table 5, which additionally shows the values of the basic estimators required for carrying out the goodness-of-fit test derived in this paper. Except for rs12010339, the upper 95% confidence bound to the combined distance measure Δ falls below the proposed numerical value of the equivalence limit ε to Δ so that 3 out of the 4 SNPs under consideration pass the check for approximate compatibility with HWE. The only setting for which there is full coincidence in terms of the qualitative conclusions between our procedure and the lackof-fit test proposed by [7] is that of rs5968922: with these data, the latter gives a (2-sided) pvalue of 100% and thus clearly indicates nonexistence of deviations from HWE. In the other cases, a well-judged synoptic interpretation of the results of both testing procedures requires to take into account that a small p-value of a test tailored for detecting differences in no way rules out that an equivalence test carried out with the same data likewise leads to a positive decision. This follows from the fact that the indifference zone corresponding to the alternative hypothesis of an equivalence problem consists of points which also belong to the alternative to the classical null hypothesis of perfect coincidence with the model. Thus, there are parameter constellations under which both tests may have moderate or even high power (In the setting of Fig 2 this holds true for all interior points of the circular disc with radius ε = 2log(1.4)).
In order to have a broader basis for comparing the new testing procedure with the inverted traditional χ 2 -test for lack of fit, we generated by simulation 100,000 samples of varying sizes consisting of genotype distributions observed at an X-chromosomal SNP in a population with pre-specified parameters, and computed the rejection rates of both procedures. For the first half of these simulations, the parameters were chosen as in the 7th horizontal block of Table 1 studying the behavior of the tests under a configuration belonging to the null hypothesis of Goodness of fit of a X-chromosomal SNP to HWE relevant deviations from HWE. For each individual sample, in the first and second of these simulation experiments, the number of genotyped subjects was chosen to be for both females and males 100 and 1200, respectively. The rejection rates obtained with these data are shown in the upper half of Table 6. The other part of the simulation experiments whose results are summarized in Table 6, were run to compare both procedures in terms of the power against null alternatives generating the data under the parameter configuration appearing in the middle block of Table 3. Not surprisingly, the outcome of comparisons of that kind highly depends on the sample size: for small sample sizes, inverting the lack-of-fit test in the naive way entails gross eceedances of the target significance level, whereas in large samples, the same procedure becomes grossly overconservative. In the latter case, the power falls distinctly below that of the correct goodness-of-fit test, in the former it provides a strong pseudo-advantage in power. Another inherent feature of the inverted lack-of-fit test becoming conspicuous from the entries in Table 6 is that its power against null alternatives, is constant (except for minor deviances due to the large-sample approximations involved) rather than increasing in the sample sizes. Thus, it is lacking a property to be required of any statistical decision procedure which merits being called a test of significance.

Sample size calculation for the test for goodness of fit to HWE
The sample sizes shown in Table 7 as entries in Column 2 and 3 from right are obtained by applying formula (22) to a selection of specific non-null alternatives, again for a nominal significance level of 5% and with the equivalence margin ε chosen as proposed in Subsection 2 of M&M. Comparing the exact power attained with these approximate sample sizes which is shown in the right-most column of the table, with the target power of 80% reveals that the accuracy of the approximation is acceptable for settings for which it has to be expected that the number of male subjects is a multiple of the size of the sample of females. In strongly unbalanced cases of the complementary kind, the approximation error becomes much too large for making formula (22) useful for real applications. Even when n 2 has to be much larger than n 1 , using (22) for sample size planning of a study where interest is in controlling the power against a non-null alternative leads to marked underestimation of the exact numbers of subjects. Evaluation of the accuracy provided by formula (28) was performed along the same lines as in assessing formula (22): For a selection of null alternatives ðp � 1 ; p � 2 ; p � 3 ; p � Y Þ and values of the proportion λ of females among all subjects to be recruited, the target power was compared with the exact power attained with the sample sizes required according to the approximation formula. Overall, the results of these comparisons being shown in Table 8 are distinctly more satisfactory than those obtained with formula (22) for alternatives which, in terms of the distance measure Δ, fall in between zero and the equivalence margin ε. Except for the low-power settings with 1 − β = 0.60, which are of limited relevance for real applications, the maximum of the absolute difference between exact and target power taken over all parameter configurations covered by the table, is less than 3%. More often than not, the solution obtained by means of the formula turns out conservative, in the sense of (slightly) overestimating the sample sizes effectively required.

Discussion
It was demonstrated over a decade ago that autosomal SNPs could be tested for HWE in a way being logically adapted to the needs of genetic association studies. It has never been explicitly challenged that this requires to treat goodness rather than lack of fit to the model as the hypothesis to be established. The equivalence test to be performed for establishing goodness of fit has been made available both as an exact optimal procedure [15] and an asymptotic procedure being particularly attractive for practitioners for its computational simplicity [17]. Nevertheless, the process of revising the practice of genetic association studies through switching from lack-of-fit to goodness-of-fit testing in the HWE-related part of preliminary data analysis has taken place only hesitantly up to now. The problem of extending HWE testing to X-chromosomal SNPs has been addressed in the literature only recently, and the authors of the pertinent papers [7,10,11,13] adopt the traditional perspective treating the statement that the distribution underlying the data satisfies the model, as the null hypothesis.
As is generally the case in the derivation of equivalence testing methods, we had to start with making precise the notion of "sufficient closeness" between the true and the HWEconforming joint distribution of the genotype frequencies for females and the allele frequency in the subpopulation of males through defining a suitable distance measure. This was done in two steps: First, we introduced separate distance measures for the trinomial genotype distribution among females and the binomial distribution of the count of the allele of interest (denoted by A) among male subjects. Considering the female subpopulation only, the problem of measuring the amount of disequilibrium is the same as in the case of an autosomal diallelic marker. In the existing literature on the latter convincing arguments can be found for looking at the deviation of the relative excess heterozygosity (REH), defined as 1/2 times the frequency of heterozygotes over the geometric mean of the population frequencies of both homozygotic Table 7. Sample-sizes approximated by means of formula (22) and exact power effectively attained with them against selected non-null alternatives of the form considered in Table 3 shape. This rectangle needs to be neither a square nor centered about the origin so that, in principle, 4 margins have to be specified numerically which considerably complicates the process of finding a consensus about how to make the testing problem fully precise. Insisting nevertheless on testing for equivalence in the sense that there holds À ε 0 1 < log REH < ε 0 2 and À ε 00 1 < log OR < ε 00 2 raises a problem for which an asymptotic solution is comparatively easy to derive exploiting the results of Section 3. The construction of such a test can be carried out through combining separate tests for equivalence in terms of log REH and log OR by means of the intersection-union principle. The details of this construction as well as an analysis of basic properties of the resulting procedure are left to a future publication.
From a technical perspective, the most innovative result of the paper is the derivation of an approximation to the rejection probability at the boundary of the range of the parameter of interest of a test based on a statistic shown to be asymptotically normal at any interior point of the parameter space. The corresponding sample size formula provides reasonable numerical accuracy and involves as the only non-elementary ingredient the inverse of a distribution function which can easily computed by means of standard tools of numerical analysis. For the implementation of the formula, a SAS/IML and a R script are available for download from the website hosting supporting materials (! S1 Programs).