Figures
Abstract
The distribution-free approach to the construction of age-dependent reference centiles which has been originally published by this author in 1995 and applied since then in a multitude of large-scale studies has never been investigated from a sample-size planning perspective. In the present paper, this gap is filled using the precision criterion introduced by Jennen-Steinmetz and Wellek (2005) for the estimation of reference centiles for quantitative diagnostic markers being independent of other variables, and extended by Jennen-Steinmetz (2014) to the study of age-dependent markers. In the age-dependent case, that criterion does not admit an exact representation as a function of the sample size, even when interest is in estimating a one-sided reference limit. Hence, all sample-size results presented here are based on Monte Carlo simulation. The computations cover a broad range of conditional distributions of the marker at given age including both symmetric and positively skewed distributions. For the relationship between the conditional standard deviation and age, a linear function of different slopes was assumed. Except for the most extreme settings investigated, the sample sizes shown in the tables summarizing our numerical results do not exceed the order of magnitude which has been available for a recent, potentially very influential reference-value study of basic parameters making-up the normal fetal growth profile. Furthermore, our results suggest that in terms of sample-size requirements, the distribution-free approach of Wellek & Merz (1995) to the construction of age-dependent reference ranges is typically a good bit more efficient than reference-range determination by means of quantile regression.
Citation: Wellek S (2025) An investigation into the statistical precision attainable with a distribution-free method of constructing age-dependent reference centiles. PLoS One 20(8): e0330330. https://doi.org/10.1371/journal.pone.0330330
Editor: Tadashi Ito, Aichi Prefectural Mikawa Aoitori Medical and Rehabilitation Center for Developmental Disabilities, JAPAN
Received: September 12, 2024; Accepted: July 25, 2025; Published: August 14, 2025
Copyright: © 2025 Stefan Wellek. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
By definition (see, e.g., [1–6]), reference limits for quantitative diagnostic markers are quantiles of the distribution followed by the respective random variable in the population of individuals free of the disease to be detected. Since as a whole, the latter is not accessible to observation, reference limits have in practice to be estimated from samples taken from that population. In the simplest case of a marker which does not depend on additional variables like age or body-weight, the sample consists of n values being realizations of independent, identically distributed random variables , say, with continuous cumulative distribution function (cdf)
. The parametric estimates of the limits of a two-sided reference region with coverage (i.e., probability content) q are given by
where and
denotes, respectively, the mean and standard deviation of the Yi’s, and z(1 + q)/2 the upper
–percentage point of the standard normal distribution. Nonparametrically, the estimated reference limits Ln and Un are obtained through determining the order statistics of rank
and
defined as the smallest integer exceeding
and
, respectively. Despite the simplicity of these estimators, a useful criterion of their statistical precision has been missing until the mid 2000s, where “useful” means in particular that it can be expressed in terms of a quantity which depends in a sufficiently regular way on the sample size n. The starting point of the approach developed by Jennen-Steinmetz & Wellek [7] for filling this gap is the fact that the crucial property of any estimator of a reference limit is the coverage it provides of the distribution of the diagnostic marker Y in the underlying population of non-diseased individuals. None of the observed coverages F(Ln) and F(Un) attained by estimating the individual limits of the reference interval should deviate from its target value
and
, respectively, in either direction by more than some small tolerances
and
. More precisely, the criterion requires that, when only one of both limits is thought to be of diagnostic relevance, the respective double inequality
or
holds with high probability β termed “confidence probability” in the sequel and to be computed under the assumption that the true distribution of Y is in fact F. In the two-sided case, i.e., when both “unusually” small and large values of Y are considered as possible disease indicators, β is defined to denote the probability of the joint event
Constructing reference centiles for an age-dependent quantitative diagnostic marker Y requires that the endpoints Ln, Un of a single interval be replaced with a couple Ln(t), Un(t) of functions of time to be computed from a sample of pairs with ti denoting for each
, the age at which Yi was measured. Roughly speaking, the distribution-free approach developed by Wellek & Merz [3] consists of determining Ln(t), Un(t) in such a way that the band in the (t,y)-plane with the curves corresponding to these functions as boundaries, contains (almost) exactly a proportion of 100q percent of the data points
with equal proportions of
percent falling above the upper and below the lower boundary curve, respectively. The approach proved in numerous applications a worthwhile alternative to the parametric methods proposed and refined by a number of authors (see [1,2,8–12]). Classing the procedure as distribution-free is justified by the fact that for each t, the conditional distribution function
of Y is assumed to be of the form
where and
are continuous functions of time (age) and the baseline distribution-function
is likewise assumed continuous but left fully unspecified otherwise. In the version being the object of the numerical investigations presented in this paper, the distribution-free construction of reference bands involves the following major steps.
- STEP 1: Fitting by ordinary least-squares estimation a regression-function
to the data
which describes the relationship in the mean between Y and t and yields a curve to be used as a pivotal line for the reference band.
- STEP 2: Replacing the theoretical scale-factor function
by an estimate
obtained through fitting regression line to the absolute residuals
.
- STEP 3: Computing for the sample
of signed scaled residuals the order statistics of the same ranks rn((1−q)/2) and rn((1 + q)/2) as used for nonparametric estimation of the reference limits for an age-independent diagnostic marker.
- STEP 4: Computing the boundaries of the reference band as
(3)
where kl(n,q) and ku(n,q) are the order statistics obtained in STEP 3. (For a proper understanding of these equations, it is important to note that the coefficients kl(n,q) and ku(n,q) have opposite signs.)
The availability of a theoretically well-grounded method of sample-size planning is a basic desideratum also for studies aiming to establish reference values for diagnostic markers depending on patient’s age. Actually, dependence of a diagnostic marker on age (or some other quantitative covariate of the continuous type) does not alter the fact that reference limits are unsuitable for use in medical practice unless they are based on a study having included a sufficiently large number of individuals being free of the disease to be diagnosed. Relying on a sample of smaller size than necessary increases the risk of giving rise to a diagnostic test of a specificity which deviates unacceptably far from its target value (typically set at 95%). Including more individuals than necessary has likewise to be avoided, due both to ethical reasons and costs. The criterion proposed here for sample-size planning is conceptually analogous to the power in the context of a clinical trial to be analyzed by means of a test of significance for the hypothesis of primary interest.
A flow-chart allowing for a quick glance over the organization of the body of the paper is shown in Fig 1.
2 Methods
2.1 Extending the concept of confidence probability to distribution-free estimates of age-dependent reference centiles
Conceptually, extending the definition of confidence probability made explicit in the Introduction for the age-independent case, to a precision criterion for reference bands constructed by means of the proposed distribution-free method is largely straightforward. One starts with evaluating at fixed time t the local precision of the interval
and
respectively, taken as a reference range for the conditional distribution of Y given T = t. Subsequently, the least favorable of these local precision values is determined and used as a global precision criterion for the estimated limit(s) of the reference region as a whole. Considering once more the one-sided case first, the global confidence probabilities
and
, say, are given by
and
In the above equations, denotes the joint distribution (conditional on
of the sample
from which Ln and Un are to be calculated, and
the (finite) set of points in time at which measurements of Y will be taken. In other words, the global confidence probability is defined by (4.a) and (4.b) as the smallest local confidence probability attained by
and
, respectively, in the conditional distribution of Y at any age t envisaged according to the time schedule of the study for taking measurements of Y. If interest is in assessing simultaneously the global precision to be attributed to both the lower and the upper boundary of the reference band, the sample size has to be chosen large enough to yield a sufficiently large value of the confidence probability
as defined by
2.2 Simulation procedure for sample-size calculation based on the confidence-probability criteria of Sect 2.1
Even for fixed t, the distribution of the statistics ,
admits explicit representations only when
consists just of a single point, i.e., in the age-independent case. In view of this, for computing the confidence probabilities of (4.a), (4.b) and (5) Monte Carlo simulation instead of exact integration with respect to
will be used throughout. The simulation scenarios to which the results reported in Sect 4 relate were designed to reflect basic features of two typical studies of reference centiles for quantities monitored in prenatal diagnostics.
- I. The first scenario mimics some part of the study of Merz et al. [13] devoted to establishing age-dependent reference limits for quantities making up the fetal growth profile. The selected diagnostic marker is the biparietal diameter of the fetus for which the data and reference limits shown in Fig 2 were found in that study.
For a proper understanding of this and all subsequent graphs, it is important to realize that due to limitations in measurement accuracy, the points of the cloud representing the raw data do not correspond throughout to individuals but small groups of varying size. Therefore, they do not allow for precise verification of the coverage proportion attained by the reference band. Nevertheless, the approximate symmetry of the conditional distributions obtained at different (gestational) ages becomes fairly conspicuous.
The simulation scenario corresponding to the dataset shown in Fig 2 has been designed to meet the following specifications.- i). Timing of measurements:
[
set of gestational ages [weeks] between 10 and 41 inclusive];
vector of measurement times, fixed for all runs of a given Monte Carlo experiment and generated by drawing n values from a discretized beta distribution with parameters (
) on the set
. (In the special case
,
corresponds to a fixed uniform design.)
- ii). Modelling the regression function
: The growth model proposed and extensively applied to data from prenatal ultrasound by Wellek & Merz [14] was used. This model assumes that their holds
where
are unknown parameters to be estimated using the data, and
and
, respectively, denotes the range of t (set equal to [10,41] under the specifications made in (i)) and the cdf of a beta distribution with parameters a and b. In generating the data, it was assumed that the estimates a = 1.1494, b = 1.5265, c = 84.0244, d = 0.1992 obtained in the study from which Fig 2 is taken, are the true values of the parameters of the growth model.
- iii). Modelling the dispersion function
: Throughout,
was chosen to be a linear function of
with parameters
,
to be determined by way of regressing on t the absolute residuals obtained through fitting
to the respective simulated sample
.
- iv). Specifying the baseline cdf
of the conditional distribution of
at age t: The following basic forms of the age-specific distributions were investigated:
- a). Gaussian
- b). Laplace
- c). Log-normal
- d). Gamma
.
Note. All these distributions have mean zero. - a). Gaussian
Figs 3, 4, 5, 6, and 7 show simulated data sets consisting ofobservations generated from different versions of Scenario I together with the estimated reference band of 2-sided target coverage q = 0.90 each.
- i). Timing of measurements:
- The population from which the simulated dataset shown in Fig 3 has been taken, is a particularly regular case within the scope of Scenario : All conditional distributions satisfy a model which differs from a classical linear model only by the nonlinear form of the regression function. Furthermore, the distribution of age is uniform over the range [10,41] (weeks).
The only although crucial difference between the models underlying the above figure and Fig 3 is that this time the distribution of gestational age is umbrella-shaped instead of uniform.
In the example of Fig 5, the conditional distributions are likewise the same as in Fig 2. However, sampling was done in a way ensuring that both extreme tails of the distribution of age are distinctly heavier than a central interval of the same length.
Under the log-normal model, the variance is proportional to the mean implying that the reference band shown in Fig 6 increases in vertical width from left to right.
Comparing Fig 7 with the previously shown diagram one finds it corroborated that both the lognormal and the gamma distributions are skewed to the right. With the parameters set to the values chosen in both examples, this skewness is distinctly more marked for the gamma as compared to the lognormal model. - II. The other scenario was constructed referring to the reference band presented by Merz & Pashaj [15] for the frontal fetal facial angle (FFFA) as measured by 3D ultrasound. The crucial difference as compared with Scenario I concerns the form of the regression function
which is now assumed to be a polynomial of 4th degree. For its value at age t we write
. In the simulations using Scenario II, the data were generated setting
. (These are the estimates obtained by Merz & Pashaj). The range
of ages t of gestation [days] at which measurements of FFFA are taken in that study is assumed to be
The solid curves are the limits of the estimated reference band of 2-sided target coverage q = 0.90.
The solid curves are the limits of the estimated reference band of 2-sided target coverage q = 0.90.
The solid curves are the limits of the estimated reference band of 2-sided target coverage q = 0.90.
The solid curves are the limits of the estimated reference band of 2-sided target coverage q = 0.90.
The solid curves are the limits of the estimated reference band of 2-sided target coverage q = 0.90.
As software, SAS/IML was used throughout, generating separate single data streams for all repetitions of the same Monte Carlo experiment. The target of all simulation studies performed was the sample size n required for ensuring that the confidence probability in one of the definitions given in (4.a), (4.b) and (5) does not fall short of . Throughout,the tolerances
and
upon which the confidence probabilities depend where chosen to be 0.02. With a view to the inherent limitations regarding the exactness of simulated values of the probabilities used in the present context as criteria of statistical precision, all sample sizes have been determined as multiples of 10. For scenarios with a polynomial regression of Y on t, the results were compared to those obtained by Jennen-Steinmetz [16] for reference regions constructed by means of quantile regression in the sense of Koenker [17].
3 Results
A complete account of the sample sizes calculated for Scenario I with a variety of specifications of the parameters involved is given in Table 1.
The results show that the sample-size requirements for a reference-value study to be performed in this scenario heavily depends both on the form of the conditional distributions, the dispersion function and the pattern of measurement times. The most obvious and consistent relationship between one of these properties of the study design and the required sample size n has to be attributed to the slope of the dispersion function chosen to be given by the parameter
. Given anything else, n is a monotonically increasing function of
. The steepness of this increase is considerable indeed. The factor by which the sample size has to be multiplied when a constant dispersion function is replaced with a linear function of slope 0.2450 depends on all other parameters varied across the simulation experiments and may be as large as about 3.5. Effects which are fully uniform in all other parameters result from varying the pattern of measurement times: Choosing the common value of the parameters (
) as a number >1.0, the values of T are taken from a discretized symmetric unimodal density, implying that the proportion of the observations made available for ages in the tails of the distribution of T are scarce which leads to increased values of n. Using designs with U-shaped distribution of T letting both components of the parameter (
) fall below unity so that the distribution of T exhibits heavy tails, reduces the sample sizes below those required for a uniform design.
For right-skewed conditional distributions, the sample size required for estimating at a given level of precision the limits of a one-sided reference region are considerably larger when the estimand is a lower bound as compared with the corresponding right-hand reference limit. The amount of skewness has a marked increasing effect on the sample sizes required for estimating lower reference limits, then. In contrast, the sample sizes needed for estimating upper reference limits tend to be smaller for distributions with positive skewness as compared with symmetric distributions. In symmetric cases, nL and nU coincide except for simulation errors which is in accordance with intuition. Furthermore, the sample sizes n(L,U) required for studies aiming to estimate a two-sided reference region are a good bit larger than the sample sizes needed for estimating one-sided reference limits, exceeding the latter by up to 63% [ Laplacian distribution for a uniform design with slope 0.2450 of the dispersion function]. In contrast, when the distributional shape exhibits skewness on the right, n(L,U) is only slightly larger than the sample size nL for the corresponding left-hand estimation problem.
Even between distributions of the symmetric type, there are marked differences: With Gaussian conditional distributions, the sample-size requirements are uniformly considerably larger as compared with their Laplacian counterparts. Some of the differences exceed the 100% -bound [ umbrella-shaped design with
and slope 0.2450 of
]. The comparisons between the lognormal and the gamma distribution do not reveal an uniform pattern: There are both constellations for which a study of an age-dependent lognormally distributed diagnostic marker requires larger sample sizes than a gamma-distributed marker as well as instances where this difference has the opposite sign.
The differences between homologous entries in Tables 1 and 2 are moderate. This is not surprising since the regression models according to which data generation is carried out in Setting I and II are of comparable complexity. In terms of the dimension of the unknown parameters to be estimated from the data, both cases differ just by unity. Computationally, the second scenario is a good bit easier to handle than Scenario I since in the latter the estimates of the parameters of the regression function must be determined by means of an iteration algorithm instead of explicit formulae.
4 Discussion
A natural competitor to the distribution-free method of constructing age-dependent reference centiles investigated in this paper is quantile regression (QR) as introduced by Koenker [17] .
The paper by Jennen-Steinmetz [16] on sample-size determination for studies aiming to establish reference centiles for age-dependent quantitative diagnostic markers covers also the QR-based approach, restricting the model to be used for describing the relationship of the marker on time to polynomials of second degree at most. Clearly, in any given setting and based on the same criterion of precision, estimating a reference centile curve when the true underlying regression on age is a quadratic polynomial, requires considerable lower sample sizes as compared with a marker depending on time according to polynomial regression of degree 4.
Furthermore, in Scenario II, for a uniform design, i.e., for , choosing the slope of the dispersion function
to be equal to 0.0000, 0.2620 and 0.7860 [cf. Column 4 of Table 2] is equivalent to setting in the tables of Jennen-Steinmetz [16]
equal to 0, 1 and 3, respectively. Thus, the entries in Columns 10 and 12 of Line 12 of Jennen-Steinmetz’ Table 2 can be directly compared to the values of nL and nU shown in the upper three lines of our Table 2. Although they hold for a setting with quadratic rather than polynomial regression of degree 4, the former are substantially larger than the latter (
vs <2200 and
vs <3000 for a
with slope 0.0000 and 0.2620, respectively) as long as the conditional distributions are symmetric. Even when
is markedly right-skewed, the sample sizes required for our approach exceed those for the QR-approach only when interest is in estimating a lower reference limit whereas for estimation of an upper limit the sign of the difference between the respective sample sizes is reverse. Overall, these comparisons admit the conclusion that the estimation procedure studied here is in many cases considerably more efficient than quantile regression. In addition, the QR approach suffers from the serious disadvantage that it produces reference bands whose upper boundary may differ in shape from its lower counterpart.
Finally, it is important to note that the majority of entries in Table 1 do not exceed 10,000. This is the order of magnitude of the sample size which has been available for a recent, potentially highly influential reference-value study of basic parameters making-up the normal fetal growth profile (Merz et al. [13]). (Some parts of the results of a previous version were included in the Mother’s Passport delivered by the GBA [Joint Federal Committee of Germany] to every pregnant woman as a tool for monitoring the course of pregnancy.) This suggests that even under marked asymmetry and/or heteroscedasticity of the underlying conditional distributions, the age-dependent reference centiles published in that source satisfy fairly rigorous criteria of estimation precision.
All numerical results obtained in the simulation study hold under the restriction that the dispersion function is linear in age t. Admitting more complicated models for is a conceptually straightforward generalization but has to be expected to lead to sample sizes which are still larger than those obtained under the assumption of linearity. Fortunately, extensive practical experience has shown that in real studies of age-dependent diagnostic markers, marked deviations from dispersion linearity rarely occur. The adequacy of the assumption can readily checked through analyzing the absolute residuals calculated for the nonlinear regression model for the means. The regression of these residuals on age should be approximately linear.
As holds true for any numerical investigation using Monte Carlo methods, the results obtained in this paper are not fully exact but overlaid by (small) simulation errors. Evidently, in the present context the relevant target of an analysis of these errors is the standard error of the simulated confidence probability (defined as in Equations and (5) ) attainable with one of the sample sizes n to be read from Table 1 or Table 2. Since the aim was to determine n as large as necessary for ensuring that under the respective specifications the confidence probability reaches the value 90% at least, and the number of replications was set at Nrep = 10,000 for the majority of the Monte Carlo experiments behind Tables 1 and 2, an upper bound to the estimated Monte Carlo standard error of the simulated value of
,
, and
, respectively, is given by
= 0.003. This yields approximate 95-confidence intervals for the true values of
,
and
associated with the tabulated sample sizes n whose length is less than 0.012 guaranteeing high precision of those pivotal estimates. In some cases, the entries in Tables 1 and 2 could be even based on 40,000 repetitions per Monte Carlo experiment which reduces
to 0.0015 and the upper bound to the length of an approximate 95-confidence interval for
,
, and
to 0.006.
A particularly promising way of exploiting the results obtained in this paper for improving diagnostic decision making in various medical disciplines would be to screen the literature on reference-limit studies analyzed by means of the distribution-free method of constructing age-dependent centiles investigated here for compatibility with the sample-size criteria to be read from the entries in the tables shown above. Wherever it turns out that the sample sizes which were actually made available fell substantially short of the values required according to that criteria, a recommendable step would consist of planning a replication study set to recruit a reference sample of an appropriately increased size.
Acknowledgments
The author gratefully acknowledges the thoughtful and constructive comments of an expert referee on two previous versions of this paper. The hints contained in his reports constituted a valuable guidance for improving the manuscript.
References
- 1. Royston P. Constructing time-specific reference ranges. Stat Med. 1991;10(5):675–90. pmid:2068420
- 2. Altman DG. Construction of age-related reference centiles using absolute residuals. Stat Med. 1993;12(10):917–24. pmid:8337548
- 3. Wellek S, Merz E. Age-related reference ranges for growth parameters. Methods Inf Med. 1995;34(5):523–8. pmid:8713769
- 4.
Harris E, Boyd JC. <refbooktitle>Statistical bases of reference values in laboratory medicine</refbooktitle>. New York: Marcel Dekker. 1995.
- 5. Wellek S, Lackner KJ, Jennen-Steinmetz C, Reinhard I, Hoffmann I, Blettner M. Determination of reference limits: statistical concepts and tools for sample size calculation. Clin Chem Lab Med. 2014;52(12):1685–94. pmid:25029084
- 6. Wellek S, Jennen-Steinmetz C. Reference ranges: Why tolerance intervals should not be used. Stat Methods Med Res. 2022;31(11):2255–6.
- 7. Jennen-Steinmetz C, Wellek S. A new approach to sample size calculation for reference interval studies. Stat Med. 2005;24(20):3199–212. pmid:16189809
- 8. Cole TJ. Fitting smoothed centile curves to reference data. J R Statist Soc A. 1988;151:385–418.
- 9. Cole TJ, Green PJ. Smoothing reference centile curves: the LMS method and penalized likelihood. Stat Med. 1992;11(10):1305–19. pmid:1518992
- 10. Royston P, Wright EM. How to construct “normal ranges” for fetal variables. Ultrasound Obstet Gynecol. 1998;11(1):30–8. pmid:9511193
- 11. Wright EM, Royston P. Calculating reference intervals for laboratory measurements. Stat Methods Med Res. 1999;8(2):93–112. pmid:10501648
- 12. Cole TJ. Sample size and sample composition for constructing growth reference centiles. Stat Methods Med Res. 2021;30(2):488–507. pmid:33043801
- 13. Merz E, Pashaj S, Wellek S. Normal fetal growth profile at 10-41 weeks of gestation - an update based on 1022 5 normal singleton pregnancies and measurement of the fetal parameters using 3D ultrasound. Ultraschall Med. 2023;44(2):179–87. pmid:36587624
- 14. Merz E, Wellek S. Das normale fetale Wachstumsprofil - ein einheitliches Modell zur Berechnung altersabhängiger Referenzwerte für die gängigen Kopf- und Extremitätenmaße. Ultraschall Med. 1996;17:153–62.
- 15. Merz E, Pashaj S. Anomalies of the fetal face. Donald School Journal of Ultrasound in Obstetrics and Gynecology. 2019;13(1):34–40.
- 16. Jennen-Steinmetz C. Sample size determination for studies designed to estimate covariate-dependent reference quantile curves. Stat Med. 2014;33(8):1336–48. pmid:24307204
- 17.
Koenker R. Quantile regression. Cambridge University Press; 2005.