Figures
Abstract
The presence of outlying data points can have a significant impact on statistical modeling and significance testing. In the specific context of one-sample t-tests, prior studies have shown (primarily through simulations) that outliers make it more likely for t-tests to fail to reject the null hypothesis. In this study, we investigate the opposite scenario: when an outlier can cause the rejection of the null hypothesis. While it may seem intuitive that outliers aligned with the direction of an effect strengthen that effect, prior studies have shown that this is not always the case. Towards this end, we introduce mathematical bounds on how large outliers can be while still increasing the t-statistic in a given sample. These bounds are validated and supported using Monte-Carlo simulations and a survey of available data sets. From these results, we find that although it is not impossible for outliers to cause significant results in paired or one-sample t-tests, it can only occur under rather narrow circumstances. Specifically, it requires a concordant outlier, a minimal sample size of (), and a sufficiently small effect size (
). Based on these findings, we argue that the risk of isolated outliers causing type I errors is low in many practical situations, especially when sample sizes are small.
Citation: Wisler A (2026) Outliers (typically) cannot cause type I errors in one-sample/paired t-tests. PLoS One 21(2): e0341720. https://doi.org/10.1371/journal.pone.0341720
Editor: Abhik Ghosh, Indian Statistical Institute, INDIA
Received: October 6, 2025; Accepted: January 12, 2026; Published: February 17, 2026
Copyright: © 2026 Alan Wisler. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No real data is used in this manuscript (simulation code available at https://osf.io/yfju9/).
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The challenge of dealing with outliers is a pervasive methodological challenge which spans a range of scientific disciplines [1–5]. Although somewhat nebulous, the term outlier broadly refers to a sample of data that markedly differs from other members of its group. Despite often cast as bad data, outliers are not necessarily the result of experimental error and instead can simply be “an extreme manifestation of the random variability inherent in the data” [6, p. 1]. Regardless of the cause, outliers can disproportionately influence statistical estimates and bias statistical tests, and thus should be handled thoughtfully [7,8]. The best methods for handling outliers are a topic of significant debate across scientific disciplines. A common approach is to simply remove extreme values [9,10], though the criteria for identifying outliers can vary depending on the nature of the data. For a standard homogeneous sample, simple rules such as removing observations more than two or three standard deviations away from the mean can be employed [11]. However, in data sampled from different populations (or experimental conditions), more complicated questions arise as to whether the identification of outliers should be performed within each condition or across the entire dataset. For example, while [12] points out several issues with identifying outliers separately within conditions, [13] shows that identifying outliers across the entire dataset can also introduce problems. Note that all of these considerations become significantly more complicated in settings where data is multivariate [14,15] or the product of complex dynamic processes [16]. Because of the challenges associated with identifying outliers, and the risk they pose to many common tests, a range of statistical methods have been proposed with the aim of being robust to their presence [17–22]. The aim of this work is not to determine the best method of handling outliers, but to examine a very common statistical test (the paired/1-sample t-test) which is known to be somewhat sensitive to outliers, and to form a better understanding of the effect they can have on it.
In the standard null hypothesis significance testing framework, there are two types of errors that can occur. Type I errors refer to rejecting the null hypothesis when it is true (also known as false positives or false discoveries), and type II errors refer to failing to reject the null hypothesis when it is false (also called false negatives). A number of studies have examined the effects of outliers on common statistical tests and consistently found that their presence increases the risk of type II errors, whereas excluding outliers tends to increase the rate of type I errors [12,13,23,24]. In the case of the basic one-sample or paired Student’s t-test, the reason for this is quite clear: the presence of observations that deviate dramatically from the mean increases the sample standard deviation. Since the test statistic is calculated as the ratio of the mean to its associated standard error, an increase in variability inflates the denominator, thereby reducing the t-value and making it less likely that the test will reject the null hypothesis. Nevertheless, this seemingly obvious reasoning bears further scrutiny in certain contexts.
Imagine a study in which ten participants are each tested under two separate conditions, A and B. Suppose the first nine participants all perform slightly better in condition B than in condition A. If the tenth participant were found to perform dramatically better in condition B than condition A, it would be natural to assume that this new data point would only strengthen our confidence in the superiority of condition B over condition A. However, in practice, a paradox arises: if this tenth observation deviates too much from the rest, a paired t-test may be less likely to reject the null hypothesis (and suggest condition B is superior) with the outlier than without it. This counterintuitive effect—that concordant outliers can reduce the likelihood of rejecting the null hypothesis, even when they appear to support the main effect—has already been documented [25]. In a paired t-test, the test statistic is calculated using the mean of the within-subject difference scores, divided by the standard deviation of those differences. A concordant outlier increases both the difference in means across conditions and the standard deviation of the difference scores, and the resulting impact on the t-value depends on how these two changes interact. In some cases, the inflated variability dominates, shrinking the test statistic and reducing statistical power and in others the increase to the mean difference dominates and the t-statistic increases. However, the precise conditions under which outliers increase or decrease the t-statistic, and the corresponding likelihood of rejecting the null hypothesis, remain poorly understood.
Whereas much of the previous literature highlights the risk of outliers inflating type-II errors in t-tests, this work focuses on the risk they pose for type-I errors (or false positives). Specifically, we aim to develop a clearer understanding of the exact conditions under which outlying values can cause a paired t-test to reject the null hypothesis. For our purposes, we define an outlier as causing the rejection of the null if: 1) the null hypothesis is rejected when the outlier is included, and 2) the null hypothesis would not be rejected if the outlier were removed or replaced with any non-outlying value. As the definition for outlier is highly subjective, we explore the introduction of points between 0-10 standard deviations away from the mean and leave it to readers to decide which points meet their definition, although we will highlight some specific implications for 2-σ and 3-σ outlier thresholds. Within this framework, we use both mathematical derivations and empirical simulations to illustrate how different magnitudes of “outlier” can influence the t-statistic and corresponding statistical inferences. Finally, although the framework used in this paper is defined in terms of a generic one-sample t-test, we can easily think of these samples as differences and generalize these findings to paired t-tests as well. Thus, we will largely use the terms “one-sample” and “paired” t-test interchangeably.
The organization of this paper is laid out as follows. Sect 2 introduces the mathematical framework for this paper and presents two upper bounds on how large an outlier can be while still increasing the t-statistic. Sect 3 then presents three experiments designed to validate these bounds and develop a more concrete understanding of the scenarios under which concordant outliers can and cannot cause type I errors. The first two experiments use simulations to assess the effects of concordant outliers under different sample sizes and effect sizes. The third experiment looks at a set of fifty real-world datasets to see how frequently concordant outliers are present and whether or not they are driving any of the significant results in these data. Finally, Sect 4 presents a discussion of our findings to synthesize the main takeaways and relate these findings to the existing literature on this topic.
2 Framework and derivations
The methods for this paper will be laid out as follows. We will first introduce the basic framework and notation that will be used in this paper to describe a scenario in which an arbitrary set of data is influenced by the introduction of a new observation that is greater than the mean of the existing data. In general, we observe that when the new observation is only slightly greater than the mean its introduction will generally increase the t-statistic. However, when the new observation is several standard deviations above the mean it will often decrease the t-statistic. Thus, the second part of our methods derives closed form expressions for how many standard deviations above the mean the new observation should be in order to 1) increase the test statistic and 2) maximally increase the test statistic.
2.1 Problem description
Let us start by defining a set of data with sample mean and variance:
Now we extend the dataset with a new observation , where
For our analysis, the new observation xn + 1 serves the role of a concordant outlying data point that is added to an existing sample. Note that we will consider a wide range of Δ values in our analysis and describing xn + 1 as an outlier when Δ is small might be misleading. To avoid drawing an arbitrary line on what is and is not considered an outlier in our analysis, we keep this definition overly broad and leave the task of disambiguating which cases accurately describe outliers to our discussion.
The fundamental question we seek to address in this paper is whether or not it is possible for the introduction of an outlying point into the data to result in false discoveries for 1-sample student t-tests. More specifically, we ask the question: under what conditions of sample size (n), sample mean , and perturbation magnitude Δ, does the introduction of observation xn + 1 increase the t-statistic. Without loss of generality, we assume that:
While the first condition appears restrictive it merely assumes that the data has been scaled to unit variance prior to conducting the analysis. For example, suppose we have sample with mean
and variance
. We could then define
, and our analysis would then hold for the scaled data. Importantly, this scaling has no effect on the t-statistic since it affects the numerator and denominator equally. This condition also makes interpretation of our results significantly easier since parameters like
and Δ are now measured as number of standard deviations, rather than along the context-specific scale of the original data. So, if our criteria for outliers were defined as any points more than three standard deviations away from the mean, that would reflect cases where
. Additionally, because
we can think of the sample mean as representing the effect size in our data (since
), and will frequently refer to it as such going forward.
Regarding the second condition, since our results will be based on the relative magnitude of the t-statistic for the modified data, flipping the sign of both and Δ would not affect our findings (since the t-statistic depends on the magnitude of the effect, not its direction). Thus, this condition only limits generality in the sense that it requires the added outlying value to be in the same direction as the observed effect. This omission represents a rather uninteresting case, since it is generally assumed that outliers in the opposite direction of the observed effect will not result in false discoveries.
Using (1), we can derive the sample mean and variance of the modified data as (full derivation in S1 File):
and from this the t-statistic of the original and modified data are
and
One interesting observation we can make from these equations is that the both t and are dependent on only three parameters: the sample mean of the original data (
), the sample standard deviation of the original data (
), and the number of standard deviations the new sample is placed above the mean (Δ). This means that the effect of an outlier (or any new observation) on the t-statistic will be the same for any two datasets of equal mean and variance regardless of how they are distributed. This is an important point which will (in the subsequent analysis) allow us to make highly general claims about how outliers affect the t-statistic without relying on distributional assumptions.
2.2 Deriving bounds on outlier magnitude
Here we introduce theoretical limits on Δ that provide guarantees on how the new observation will affect the results of a one-sample t-test. The first observation we would like to highlight is that as the value of the new observation gets arbitrarily large () the test statistic goes to one (
). For a two-tailed t-test, this yields a p-value of
. Thus, given a sufficiently large outlier, a t-test will always yield a null result regardless of the strength of evidence in the original (outlier-free) data. Note that these asymptotic characteristics have already been shown in [25]. The second observation is that there is an upper bound on Δ (which we will call
), defined in terms of only
and n, beyond which the introduction of xn + 1 will reduce the t-statistic of the original data. This upper bound is expressed formally in Theorem 2.2 below.
Theorem 1. If and
where
then .
The proof for Theorem 2.2 is provided in S2 File. The third and final observation is that there is a second upper bound on Δ (which we will call ) defined in terms of only
and n, beyond which perturbing xn + 1 any further from the mean will reduce
. This upper bound is expressed formally in Theorem 2.2 below.
Theorem 2. If where
then .
The proof for Theorem 2.2 is provided in S3 File. It is also fairly straightforward to show that under the predefined conditions (also included in S3 File). These two theorems provide general insights into the maximum values a new outlier can take while still exerting positive effects on the t-statistic. Relating this back to our theoretical framework Δ describes the number of standard deviations above the mean where the new observation (xn + 1) is located. The value of
depicts the exact location where the new observation will maximally increase the t-statistic. Thus, if we imagine we can control the placement of this new observation, the further we move it away from that location the less it will increase the t-statistic. And if we move it far enough away from this point in the positive direction,
describes the point at which its inclusion begins to lower the t-statistic (relative to its absence). Taken together with the asymptotic characteristics of Δ, this will eventually lead to failure to reject the null hypothesis. Based on the provided definitions for when an outlier causes rejection of the null hypothesis,
provides a relatively clear indication of when this can occur. For any sample size (n) and sample mean (
) we consider, if the value of
would not meet our definition of outlier then no outlier can cause the rejection of the null, since no outlying observation can increase the t-statistic more than
+
. For example, if n = 10 and
, then
, and any new observations added to the data cannot increase the t-statistic more than one exactly 0.9 standard deviations above the mean. Thus, where the addition of some outlying points might increase the t-statistic in this scenario its influence would necessarily be less than some alternative non-outlying sample.
To better understand what these results tell us about the relationship between the placement of the new data point xn + 1 and , Fig 1 displays the curves of (a) the new t-statistic (
) and (b) the empirical influence function for the t-statistic across Δ for four different sample means
. These plots also display where
and
fall on each of these curves. Going in order, the first plot shows t′, which starts at different values for the four means, but in each case ascends to a particular peak then begins to decline. The dotted purple line represents the
pairs for a continuous range of
values. As expected, this line intersects with the
curves at the peaks of each
-curve and these intersections occur at higher Δ values for lower
curves. The second plot displays the empirical influence function (EIF) of the t-statistic, which we define as:
The left plot displays the modified t-statistic () and the right plot plot displays the empirical influence function (EIF) of the t-statistic (which is a scaled measure of the difference between the modified and original t-statistic). Where appropriate the derived bounds
and
are displayed to show how they intersect with these curves.
This plot displays how much the t-statistic changes as a result of the newly added sample. Note that while the EIF is more typically defined in terms of the location of the added sample, we define it terms of Δ (the location of the added sample relative to the mean) for consistency with other visualizations. Similar to the first plot, the purple line indicates the peak in the EIF for each curve, which occurs at the same Δ values as before. For this plot we also see a line for at
, indicating the transition point between positive and negative EIF values, or alternatively the point at which the new t-statistic (
) falls below the original t-statistic (t).
3 Experiments and results
In this section, we introduce three experiments to validate and support the previous derivations. The first experiment will use a Monte-Carlo simulation to empirically verify the accuracy of the bounds introduced in Sect 2, and to quantify what values for sample means () and outlier magnitudes (Δ) result in changes to t-test results for three different sample sizes (
). The second experiment mirrors the first, but considers all sample sizes between n = 2 and n = 100 to determine the maximum value of
that occurs in cases where the significance result becomes significant after the introduction of the new data point. As
describes the location of the new observation that maximally increases the t-statistic, this value provides an approximate upper bound on the outlier criteria under which an outlier can possibly cause a type I error at a given sample. Finally, experiment 3 applies this methodological framework to a survey of paired datasets to examine whether any cases of outliers causing type I errors can be identified.
3.1 Experiment 1: Comparison of bounds
To validate the previously described results we conduct a Monte-Carlo simulation. The Monte-Carlo simulation begins by randomly generating the mean of the original data and the magnitude of the outlier according to the following uniform distributions: ,
. From this, a raw version of the original dataset
is generated by sampling n values from a Normal distribution with
and
:
This raw data is then normed to create
where is the sample standard deviation of the raw data. This ensures that
as previously assumed. Note that while the data in this simulation is generated from a normal distribution, normality of the data is not required for our analysis to hold. Since t and
can be expressed solely in terms of
, n, and Δ the process generating the data has no effect on the results beyond its relation to these parameters. Thus, this experiment could be repeated using any non-normal distribution (with the same sample mean and variance) to achieve the same results. Alternative versions of this experiment demonstrating near identical results for non-normal cases are displayed in S4 File. From here, the modified dataset is defined just as in our model:
where
+ Δ. As we are interested in not only how the introduction of xn + 1 affects statistical testing, but also how further perturbation of xn + 1 would influence statistical outcomes, we also generate a third dataset
where is a very small perturbation constant meant to simulate calculation of a local derivative. Comparing
with
provides a simple way of determining whether pushing xn + 1 further from the mean will increase or decrease the t-statistic. For each of the three datasets (x,
, and
), we calculate the t-statistics and associated p-values which we will call t,
,
and p,
,
respectively.
From here, within each run of the Monte-Carlo simulation, we will investigate several cases:
- Case 0 (
and
) : Adding xn + 1 reduced the t-statistic
- Case 1 (
and
) : Adding xn + 1 increased the t-statistic, but less than if xn + 1 were closer to the mean
- Case 2 (
and
) : Adding xn + 1 increased the t-statistic more than any observation closer to the mean would have
If the results in Theorems 1 and 2 are valid and their conditions are met (), it should be the case that any observations with
fall into Case 0, any observations with
fall into Case 1, and any observations with
fall into Case 2. The first goal of the simulation is to validate whether the theoretical bounds correctly categorize the empirical t-statistics. The second objective is to examine cases where either the original or modified dataset meets the standard statistical significance threshold
.
Based on this setup, we run 10,000 iteration Monte-Carlo simulation across three different sample sizes (). Fig 2 displays the results of these simulations via a scatter plot of Δ vs.
with lines for the two bounds
and
. Plots in the top row display individual iterations of the simulation color coded by the previously described cases. These plots clearly illustrate that the bounds accurately delineate these cases for all three sample sizes. As would be expected from the formula,
is not noticeably affected by the sample size, while the
line generally shifts leftwards as n gets larger. Plots in the bottom row of Fig 2 display the same results, but this time color coded based on whether p and
are greater than or less than the significance threshold 0.05. Since the value of p is determined entirely by the values of
and n (because
is fixed), there is a clear separation between points for which
and
determined by a fixed value of
for each n. We also see that this
-threshold moves left as the sample size increases, and with this leftward shift we see increases in both
and
. Therefore, for smaller sample sizes there is a lower upper limit on the magnitude of outliers capable of producing false discoveries. In the n = 10 case, for example, we observe Δ values as high as 4.39 yielding a significant result in the modified data, despite no significance in the original data. However, it is important to point out that in these instances, the added data point could have been much smaller and still produced a significant p-value. If we examine the maximum value of
in those trials, we see that it never exceeds 1.58. Since these experiments are conducted on normalized data, this means that a new observation need not be more than 1.58 standard deviations above the mean to maximize the t-statistic (and thus minimize the corresponding p-value). In the larger samples simulations n = 25 and n = 100 the maximum observed values for
in trials where
and
was 2.77 and 5.83, respectively. Therefore, in larger sample sizes it is possible for outlying samples to meet our definition for causing the rejection of the null hypothesis, although these cases represent a relatively narrow proportion of the observed trials. The following subsection will further explore the relationship between sample size and the maximally influential outlier value
at the significance threshold.
Each plot is displayed as outlier magnitude (Δ) vs. sample mean (), with lines depicting the two bounds (
,
). Points colored by either case (top row), which indicates how the new observation affects the t-statistic for that
pair, or p-values (bottom row), which indicate whether the p-value falls below the 0.05 threshold both with and without the new observation. Note that cases in the top row are perfectly separated by the
and
lines supporting the validity of the proposed bounds.
3.2 Experiment 2: Maximum outlier magnitude vs. sample size
In the previous simulation we observed that the values of Δ likely to change the outcome of the significance test increased for larger sample sizes. This was partially a result of the dependency of and
on n, but primarily stems from the inverse relationship between
,
and
combined with the fact that the effect size (
) needed for statistical significance decreases for larger n. Our goal in this simulation is to develop a better understanding for the maximum value an outlier can take while still causing a statistically significant result in the modified sample. Remember, per our definitions, an outlier causing a significant result requires that 1) its inclusion leads to a significant result and 2) replacing it with any non-outlying value would not lead to a significant result. Per these definitions, it is only possible for outliers to cause false discoveries when
meets our outlier criteria as otherwise any outlying sample that yields a significant result could be replaced with a non-outlying sample at
to achieve the same result. Thus the goal of this simulation is to estimate the maximum value of
in cases where
and
for a given sample size. To achieve this, we repeat the same 10,000 iteration Monte-Carlo simulation from before at every sample size
and for each sample size measure the maximum value of
corresponding to trials where
and
. The results of this simulation are presented in Fig 3.
Sample sizes where this value crosses notable 2-σ and 3-σ outlier thresholds are highlighted to indicate the approximate minimum sample size under which outliers can cause type I errors under each definition.
These results offer additional insights into the potential for concordant outliers to cause false discoveries across various sample sizes. If we define outliers as any point that is more than two standard deviations away from the mean, empirical results suggest that it is only possible for outliers to cause rejection of the null hypothesis at sample sizes greater than fourteen. If we impose as stricter definition and require outliers be more than three standard deviations from the mean, the minimum sample size increases to before concordant outliers can cause the null hypothesis to be rejected. Also note that these sample size requirements are necessary for concordant outliers to cause false positives, but not sufficient by themselves. Using the n = 100 simulation in Fig 2 as an example, we see that the cases where
and
still only occur for a very narrow range of
-values.
3.3 Experiment 3: Examination of outlier effects in real paired data
In the final experiment, we will look at some real-world datasets containing paired observations to see how frequently concordant outliers occur in these datasets and how they influence paired t-test results. This experiment also provides the opportunity to illustrate how this approach can be applied to real world datasets to characterize the influence of existing data points. To attain a set of different datasets to examine which contain paired observations, we draw from two existing R packages that contain built in repositories of paired datasets, the PairedData package [26], and the BSDA package [27]. Together these libraries provide a total of 43 datasets. As some of these datasets contain more than 1 set of paired observations we then have 50 sets of paired samples for this analysis.
For each dataset, we go through the following procedure. Let us call the set of paired observations loaded from the data and
. We can calculate the original difference pairs as
−
. If this doesn’t result in
, then instead assign
−
. Now, we identify a candidate concordant outlier as the maximum value in these difference pairs and remove it from the sample.
where . The original data matching our framework (x) is then calculated as the rescaled version of this data
and the modified data () is created by re-adding the scaled version of the candidate outlier back into this set
Using this data, we calculate the t-statistics both with and without the candidate outlier (t and respectively), their associated p-values (p and
). We can also calculate
and
After going through this process, four of the 50 samples showed negative sample means after the candidate outlier was removed and were excluded from subsequent analysis. The results of this analysis are displayed comprehensively in Table 1. The primary goal of this analysis is to determine how frequently concordant outliers are present in these data and whether instances of them causing rejection of the null hypotheses can be identified. Of the 46 paired samples, 23 contained a concordant outlier more than 2 standard deviations above the mean. Of these 23 cases there was only one dataset (the Vocab data) where the results of the test were altered by the concordant outlier. Although this dataset clearly meets the outlier criteria () and the outlier affects the results of the test, this still does not qualify as causing the rejection since
in this instance. Now, if we instead examine cases where the largest concordant observation did not meet the outlier threshold, we see that of those 23 cases in which the maximum observation (x*) fell below the 2-σ threshold eight datasets showed a significant result when the candidate outlier was included. Consequently in these datasets it is far more common for a non-outlying observation to yield a rejection that didn’t exist before than for and outlying one.
Note that as Theorem 2.2 requires that ,
is not reported for datasets where this assumption is violated.
Visualizations of these results are displayed in Fig 4 where each dot represents the results for a single dataset. The first scatter plot displays the updated t-statistic with the maximum concordant observation relative to the t-statistic without it. Although this plot shows the t-statistic increases in the vast majority of cases (37 of 46 according to Table 1) we know from our previous discussion that the majority of these cases the concordant observation increasing the t-statistic does not meet the 2-σ outlier threshold. In contrast, all nine of the cases where the t-statistic decreases the concordant observation exceeds this threshold. Note that by definition cases where the concordant observation decreases the t-statistic are cases where . Looking at these nine cases in Table 1 we see that this tends to occur for datasets with larger effect sizes (
). This makes sense as
is inversely related to
. Interestingly, for three of these datasets (Blink, Grain2, and Oxytocin) the
is less than two, meaning that new observations in these samples don’t have to be that much greater than the mean in order to negatively affect the t-statistic. The second set of plots displays the point where the concordant observation will maximally increase the t-statistic (
) relative to the maximum observation (Δ). There are a few interesting observations we can make from this plot. First, in the majority of datasets
, meaning that for these datasets no outlying point can increase the t-statistic more than a point within two standard deviations of the mean. We also observe that in the majority of these cases
, which indicates that the test statistics in these data would be greater if the maximum observation were actually smaller. Thus while some of these data show statistically significant test results and contain concordant outliers, the strength of the t-statistic generally comes in spite of the magnitude of these outliers not as a product of it. In contrast, if we look at the ten cases where
, we see that none of these data show statistically significant test results either with or without the concordant observation present. This is not surprising since (by definition)
is only large when
is small. So, in summary, of the 50 datasets analyzed in this experiment, we could not find a single case where a statistically significant result could be attributed to an outlying sample by the definitions proposed in this paper.
Each dot represents one dataset and dots are color coded based on the p-value with and without the maximum concordant observation.
3.4 Code availability
Code for the subsequent simulations can be found on Open Science Framework at https://osf.io/yfju9/. Simulations were conducted in R version 4.4.0 [28] and used dplyr for data manipulation [29], ggplot and patchwork for visualization[30,31], and BSDA and PairedData for data examples [26,27].
4 Discussion
The question we sought to investigate in this manuscript is whether it is possible for outliers, specifically concordant ones to cause false discoveries in one-sample or paired t-tests. While it is undoubtedly true under the definitions provided that outliers can lead to false discoveries, the conditions under which this occurs are surprisingly limited. Because the number of standard deviations above the mean that an outlying point maximally increases the t-statistic is shown to be inversely proportional to the effect size, outliers can only increase the t-statistic more than non-outlying values when the effect size is small-to-moderate (approximately for a
outlier threshold). As smaller sample sizes require larger effect sizes to achieve statistical significance and the ability of single observations to increase the t-statistic is limited, we can generally rule out the possibility of outliers causing type I errors in small sample sizes (
for a
outlier threshold). When considering smaller effect sizes, as the initial effect size approaches zero, the number of standard deviations away from the mean at which a point maximally increases the t-statistic approaches infinity. However, in practice, the extent to which such points can increase the t-statistic is limited and frequently insufficient to result in rejection of the null hypothesis. Using a 2-σ outlier threshold, the highest observed p-value for which an outlier caused a significant result (i.e. both
and
) was p = 0.096. Therefore, even in cases where outliers meet our criteria for causing a rejection of the null hypothesis, some initial indication of an effect must already be present in the original data.
To summarize, the following criteria must all be met for an outlier to cause a false discovery. First, the observed effect in the original data must be on the border of statistical significance. Based on our empirical findings, this alone probably occurs less than 5% of the time when the null hypothesis is true. Second, the dataset must contain a sufficient number of samples (n > 15) in order for effects near the significance threshold to be small enough to be positively affected by outliers. Finally, if the previous two criteria are met, the outlier still needs to be both: 1) concordant with the observed effect, and 2) not so great in magnitude that it starts to reduce the t-statistic. The narrowness of these conditions explains why the analysis of sample datasets did not find a single instance in which a significant effect could be attributed to an outlying sample, and justifies the titular claim that outliers typically cannot cause type I errors. These conditions also contrast starkly with the relatively broad conditions necessary for outliers to cause type II errors, which require only: 1) the original data exhibits a significant effect, and 2) the outlier is sufficiently large (in either direction) to nullify that effect. One part of this result that feels counter intuitive is the conclusion that outliers pose greater risk of causing false discoveries in large samples than small samples. Since a single corrupt observation constitutes a greater proportion of a small sample than a large one, it is natural to assume it would carry greater impact. This idea does not conflict with our findings. Examining Fig 2 shows that the number of cases where the p-value falls below the significance threshold as a result of the new observation (red dots) appears to at least as high in the smaller sample sizes (). However, in the smaller sample sizes these cases tend to occur more frequently for smaller Δ values. Thus it is not the case that isolated observations cannot inflate the t-statistic in smaller samples, just that the ones doing so tend not to be far enough from the mean to be generally considered outliers.
Relating these findings to prior research, this is not the first study to show that outliers pose a greater risk for type II errors than type I errors in one-sample or paired t-tests [12,13,23,24]. However, few studies have focused specifically on the effects of concordant outliers. Our work builds naturally on the work of [25], who showed through simulation that increasing the magnitude of a concordant outlier will eventually result in increasing p-values and decreases to the null-hypothesis rejection rate. Building on this work, this study provides closed-form expressions for the point at which: 1) further increases to the outlier will begin to decrease the t-statistic (), and 2) outliers large enough to to decrease the t-statistic by their introduction (
). In doing this, we have also established a clearer understanding of how the effect of concordant outliers on the t-statistic depends on the relationship between the outlier’s magnitude, the sample size, and the effect size of the original data. To our knowledge, this is the first study to propose that outliers pose no risk of causing false-discoveries in one-sample or paired t-tests with small-sample designs.
This work also aligns well with existing theory within the robust statistics literature, which has made efforts to quantify that sensitivity of the t-statistic (and other measures) to corrupt data. One common measure of robustness that is closely related to the work in this paper is the breakdown point. The breakdown point of an estimator refers to the maximum percentage of data that can be corrupted before an estimator becomes unreliable (or exhibits arbitrarily large error) [32]. In the context of statistical inference the breakdown point can be separated into the level breakdown point (robustness of validity), which measures level of contamination needed to guarantee rejection of the null hypothesis, and the power breakdown point (robustness of efficiency), which measures the level of contamination needed to guarantee failure to reject the null hypothesis. Work that has examined these properties for the t-statistic have found it exhibits some robustness of validity, but little robustness of efficiency [17,33,34]. For the breakdown point specifically, the t-statistic has a power breakdown point of zero and a level breakdown point that is generally non-zero and proportional to [33].
One important point to note about this analysis is that it makes no assumptions on the distribution from which the data is drawn. This means that this analysis, and the associated conclusions, are applicable regardless of whether the data is normally distributed or if violations of other standard assumptions such as heteroskedasticity or non-independence are present. As a result the conclusions we draw about whether or not an outlier is capable of causing an outlier for a given sample size, outlier magnitude, and effect size are not dependent on any of these distributional assumptions. So, for example, the claim that outliers cannot cause significant t-test results when n < 15 is true even when any or all of these assumptions are violated. Violation of these assumptions may yield inflated type I or type II error rates in the statistical test, but it does not affect our conclusions about the ability of isolated outlying values to disproportionately impact a t-test towards a significant result. Note that the lack of distributional assumptions made in our analysis is a double edged sword. While it means that the conclusions in this paper are not reliant on properties of the underlying distribution, it also means that we cannot make claims about type I and type II error rates.
It is important to emphasize that these findings in no way diminish the concern that outliers, or more broadly corrupt data and the violation of distributional assumptions, may pose in one-sample or paired t-tests. While we make the case that isolated outliers cannot cause type I errors in small samples, an alternative interpretation of these results is that the erroneous data most likely to cause type I errors cannot be detected based on its deviation from the mean. Rather than diminishing the concern outliers present to 1-sample t-tests, this work reinforces the well established understanding that the primary concern outliers present in this context is diminished efficiency and reduced statistical power. Thus, in cases of known contamination or non-normality alternative robust or non-parametric methods may still be appropriate. For this, there are a range of robust methods that attempt to maintain efficiency close to that of parametric methods while reducing sensitivity to violations of model assumptions. These include strategies based on trimmed means [18], bootstrapping methods [19,20], and M-estimators [18,21,22]. Non-parametric methods can also provide a compelling option in some cases, but are not advisable for every application [35–37]. This work is not meant to discourage the use of such alternatives, but to facilitate proper interpretation of the t-test in cases where it is used with outliers present.
Perhaps the most restrictive limitation imposed by our analysis is that only isolated outliers are considered. Some study of the effects of multiple concordant outliers can be found in the analysis presented in [34], which examines the effects of asymmetric contamination and finds that large contaminator rates (15%) can lead to significantly inflated type I error rates. To better understand the effects of multiple outliers, this framework could be used to test the iterative addition of points at and assess what the minimal number of optimally placed corrupt observations is necessary to produce a significant result in different scenarios. However, considering the joint impact of multiple concordant outliers remains a topic for future work. One final point to note regarding the application of these ideas to other settings is that the framework used here is an inversion of how outliers are usually characterized. Very often outliers are considered as part of the original sample and considered for removal, whereas our framework considers the addition of a possibly outlying data point to a clean original sample. So when we say, for example, that
is the number of standard deviations above the mean at which a new observation maximally increases the test statistic, the mean and standard deviations values referred to here are both measured without the new observation. As a result, these formulas do not apply exactly to the context of outlier removal, but can easily be implemented in that context by re-measuring these terms sans outlier, as shown in Experiment 3. As a final note, it is worth mentioning that the σ-based outlier criteria used in this paper are chosen for convenience, and are generally not recommended as a method for identifying outliers [11,38].
5 Conclusion
This study investigates how outlying values influence the results of one-sample or paired t-tests, specifically in cases where the outlier is concordant with the direction of the effect observed in the rest of the data. We show that, even in these cases, outliers frequently decrease rather than increase the test statistic. To formalize this, we introduced bounds that provide upper limits on how large an outlier can be before it begins to reduce the test statistic. These bounds show that for small sample sizes (n < 15), adding an outlier more than two standard deviations above the mean cannot produce a significant result that could not already be achieved by a non-outlier. In larger samples, outliers can cause significant results that non-outlying values cannot, but only under relatively constrained circumstances.
References
- 1.
Ben-Gal I. Outlier detection. Data mining and knowledge discovery handbook. 2005. p. 131–46.
- 2. Orr JM, Sackett PR, Dubois CLZ. Outlier detection and treatment in I/O psychology: a survey of researcher beliefs and an empirical illustration. Personnel Psychology. 1991;44(3):473–86.
- 3. Benhadi-Marín J. A conceptual framework to deal with outliers in ecology. Biodivers Conserv. 2018;27(12):3295–300.
- 4.
Heritier S, Cantoni E, Copt S, Victoria-Feser MP. Robust methods in biostatistics. Wiley; 2009.
- 5. Gress TW, Denvir J, Shapiro JI. Effect of removing outliers on statistical inference: implications to interpretation of experimental data in medical research. Marshall J Med. 2018;4(2):9. pmid:32923665
- 6. Grubbs FE. Procedures for detecting outlying observations in samples. Technometrics. 1969;11(1):1–21.
- 7. Sullivan JH, Warkentin M, Wallace L. So many ways for assessing outliers: what really works and does it matter?. Journal of Business Research. 2021;132:530–43.
- 8. Yu C, Yao W. Robust linear regression: a review and comparison. Communications in Statistics - Simulation and Computation. 2016;46(8):6261–82.
- 9. Bakker M, Wicherts JM. Outlier removal and the relation with reporting errors and quality of psychological research. PLoS One. 2014;9(7):e103360. pmid:25072606
- 10. Aguinis H, Gottfredson RK, Joo H. Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods. 2013;16(2):270–301.
- 11. Bakker M, Wicherts JM. Outlier removal, sum scores, and the inflation of the Type I error rate in independent samples t tests: the power of alternatives and recommendations. Psychol Methods. 2014;19(3):409–27. pmid:24773354
- 12. André Q. Outlier exclusion procedures must be blind to the researcher’s hypothesis. J Exp Psychol Gen. 2022;151(1):213–23. pmid:34060886
- 13. Karch JD. Outliers may not be automatically removed. J Exp Psychol Gen. 2023;152(6):1735–53. pmid:37104797
- 14. Filzmoser P, Nordhausen K. Robust linear regression for high-dimensional data: an overview. Wiley Interdisciplinary Reviews: Computational Statistics. 2021;13(4):e1524.
- 15.
Aggarwal CC, Yu PS. Outlier detection for high dimensional data. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data. 2001. p. 37–46. https://doi.org/10.1145/375663.375668
- 16. Morato MM, Stojanovic V. A robust identification method for stochastic nonlinear parameter varying systems. MMC. 2021;1(1):35–51.
- 17.
Hampel FR, Ronchetti E, Rousseeuw PJ, Stahel WA. Robust statistics: the approach based on influence functions. Wiley; 1986.
- 18. Mair P, Wilcox R. Robust statistical methods in R using the WRS2 package. Behav Res Methods. 2020;52(2):464–88. pmid:31152384
- 19. Konietschke F, Pauly M. Bootstrapping and permuting paired t-test type statistics. Stat Comput. 2013;24(3):283–96.
- 20. Zhao S, Yang Z, Musa SS, Ran J, Chong MKC, Javanbakht M, et al. Attach importance of the bootstrap t test against Student’s t test in clinical epidemiology: a demonstrative comparison using COVID-19 as an example. Epidemiol Infect. 2021;149:e107. pmid:33928887
- 21. Lucas A. Robustness of the student t based M-estimator. Communications in Statistics - Theory and Methods. 1997;26(5):1165–82.
- 22. Zhou W-X, Bose K, Fan J, Liu H. A new perspective on robust m-estimation: finite sample theory and applications to dependence-adjusted multiple testing. Ann Stat. 2018;46(5):1904–31. pmid:30220745
- 23. Zimmerman DW. A note on the influence of outliers on parametric and nonparametric tests. The Journal of General Psychology. 1994;121(4):391–401.
- 24. Cousineau D, Chartier S. Outliers detection and treatment: a review. Int j psychol res. 2010;3(1):58–67.
- 25. Derrick B, Broad A, Toher D, White P. The impact of an extreme observation in a paired samples design. Adv Meth Stat. 2024;14(2):1–17.
- 26.
Champely S. PairedData: Paired Data Analysis; 2018. https://CRAN.R-project.org/package=PairedData.
- 27.
Arnholt AT, Evans B. BSDA: Basic Statistics and Data Analysis; 2023. https://CRAN.R-project.org/package=BSDA
- 28.
R Core Team. R: A Language and Environment for Statistical Computing; 2024. https://www.R-project.org/
- 29.
Wickham H, Francois R, Henry L, M ü ller K, Vaughan D. dplyr: A Grammar of Data Manipulation. 2023. https://CRAN.R-project.org/package=dplyr
- 30.
Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer-Verlag; 2016.
- 31.
Pedersen TL. Patchwork: The composer of plots. 2025. https://CRAN.R-project.org/package=patchwork
- 32.
Huber PJ. Robust statistics. International encyclopedia of statistical science. Springer; 2011. p. 1248–51.
- 33. He X, Simpson DG, Portnoy SL. Breakdown robustness of tests. Journal of the American Statistical Association. 1990;85(410):446–52.
- 34. Jennings MJ, Zumbo BD, Joula JF. The robustness of validity and efficiency of the related samples t-test in the presence of outliers. Psicologica. 2002;23(2).
- 35. Fagerland MW. t-Tests, non-parametric tests, and large studies–a paradox of statistical practice?. BMC Med Res Methodol. 2012;12:78. pmid:22697476
- 36.
Le Cessie S, Goeman JJ, Dekkers OM. Who is afraid of non-normal data? Choosing between parametric and non-parametric tests. 2020.
- 37. HARWELL MR. Choosing between parametric and nonparametric tests. Jour of Counseling Develop. 1988;67(1):35–8.
- 38. Rousseeuw PJ, Hubert M. Robust statistics for outlier detection. WIREs Data Min & Knowl. 2011;1(1):73–9.