by even
Last Updated June 14, 2017 07:19 AM

Long ago I learnt that normal distribution was necessary to use a two sample T-test. Today a colleague told me that she learnt that for N>50 normal distribution was not necessary. Is that true?

If true is that because of the central limit theorem?

**Normality assumption of a t-test**

Consider a large population from which you could take many different samples of a particular size. (In a particular study, you generally collect just one of these samples.)

The t-test assumes that the means of the different samples are normally distributed; it does not assume that the population is normally distributed.

By the central limit theorem, means of samples from a population with finite variance approach a normal distribution regardless of the distribution of the population. Rules of thumb say that the sample means are basically normally distributed as long as the sample size is at least 20 or 30. For a t-test to be valid on a sample of smaller size, the population distribution would have to be approximately normal.

The t-test is invalid for small samples from non-normal distributions, but it is valid for large samples from non-normal distributions.

**Small samples from non-normal distributions**

As Michael notes below, sample size needed for the distribution of means to approximate normality depends on the degree of non-normality of the population. For approximately normal distributions, you won't need as large sample as a very non-normal distribution.

Here are some simulations you can run in R to get a feel for this. First, here are a couple of population distributions.

```
curve(dnorm,xlim=c(-4,4)) #Normal
curve(dchisq(x,df=1),xlim=c(0,30)) #Chi-square with 1 degree of freedom
```

Next are some simulations of samples from the population distributions. In each of these lines, "10" is the sample size, "100" is the number of samples and the function after that specifies the population distribution. They produce histograms of the sample means.

```
hist(colMeans(sapply(rep(10,100),rnorm)),xlab='Sample mean',main='')
hist(colMeans(sapply(rep(10,100),rchisq,df=1)),xlab='Sample mean',main='')
```

For a t-test to be valid, these histograms should be normal.

```
require(car)
qqp(colMeans(sapply(rep(10,100),rnorm)),xlab='Sample mean',main='')
qqp(colMeans(sapply(rep(10,100),rchisq,df=1)),xlab='Sample mean',main='')
```

**Utility of a t-test**

I have to note that all of the knowledge I just imparted is somewhat obsolete; now that we have computers, we can do better than t-tests. As Frank notes, you probably want to use Wilcoxon tests anywhere you were taught to run a t-test.

In my experience with just the one-sample t-test, I have found that the *skew* of the distributions is more important than the kurtosis, say. For non-skewed but fat-tailed distributions (a t with 5 degrees of freedom, a Tukey h-distribution with $h=0.24999$, etc), I have found that 40 samples has always been sufficient to get an empirical type I rate near the nominal. When the distribution is very skewed, however, you may need many many more samples.

For example, suppose you were playing the lottery. With probability $p = 10^{-4}$ you will win 100 thousand dollars, and with probability $1-p$ you will lose one dollar. If you perform a t-test for the null that the mean return is ~~zero~~ based on a sample of one thousand draws of this process, I don't think you are going to achieve the nominal type I rate.

**edit**: duh, per @whuber's catch in the comment, the example I gave did not have mean zero, so testing for mean zero has nothing to do with the type I rate.

Because the lottery example often has a sample standard deviation of zero, the t-test chokes. So instead, I give a code example using Goerg's Lambert W x Gaussian distribution. The distribution I use here has a skew of around 1355.

```
#hey look! I'm learning R!
library(LambertW)
Gauss_input = create_LambertW_input("normal", beta=c(0,1))
params = list(delta = c(0), gamma = c(2), alpha = 1)
LW.Gauss = create_LambertW_output(input = Gauss_input, theta = params)
#get the moments of this distribution
moms <- mLambertW(beta=c(0,1),distname=c("normal"),delta = 0,gamma = 2, alpha = 1)
test_ttest <- function(sampsize) {
samp <- LW.Gauss$rY(params)(n=sampsize)
tval <- t.test(samp, mu = moms$mean)
return(tval$p.value)
}
#to replicate randomness
set.seed(1)
pvals <- replicate(1024,test_ttest(50))
#how many rejects at the 0.05 level?
print(sum(pvals < 0.05) / length(pvals))
pvals <- replicate(1024,test_ttest(250))
#how many rejects at the 0.05 level?
print(sum(pvals < 0.05) / length(pvals))
p vals <- replicate(1024,test_ttest(1000))
#how many rejects at the 0.05 level?
print(sum(pvals < 0.05) / length(pvals))
pvals <- replicate(1024,test_ttest(2000))
#how many rejects at the 0.05 level?
print(sum(pvals < 0.05) / length(pvals))
```

This code gives the empirical reject rate at the nominal 0.05 level for different sample sizes. For sample of size 50, the empirical rate is 0.40 (!); for sample size 250, 0.29; for sample size 1000, 0.21; for sample size 2000, 0.18. Clearly the one-sample t-test suffers from skew.

See my previous answer to a question on the robustness of the t-test.

In particular, I recommend playing around with the onlinestatsbook applet.

The image below is based on the following scenario:

- null hypothesis is true
- fairly severe skewness
- same distribution in both groups
- same variance in both groups
- sample size per group 5 (i.e., much less than 50 as per your question)
- I pressed the 10,000 simulations button about 100 times to get up to over one million simulations.

The simulation obtained suggests that instead of getting a 5% Type I errors, I was only getting 4.5% Type I errors.

Whether you consider this robust depends on your perspective.

The central limit theorem establishes (under the required conditions) that the numerator of the t-statistic is asymptotically normal. The t-statistic also has a denominator. To have a t-distribution you'd need the denominator to be independent and square-root-of-a-chi-square-on-its-df.

And we *know* it won't be independent (that characterizes the normal!)

Slutsky's theorem combined with the CLT would give you that the t-statistic is asymptotically normal (but not necessarily at a very useful rate).

What theorem would establish that the t-statistic is approximately t-distributed when there's non-normality, and how fast it comes in?

The central limit theorem is less useful than one might think in this context. First, as someone pointed out already, one does not know if the current sample size is "large enough". Secondly, the CLT is more about achieving the desired type I error than about type II error. In other words, the t-test can be uncompetitive power-wise. That's why the Wilcoxon test is so popular. If normality holds, it is 95% as efficient as the t-test. If normality does not hold it can be arbitrarily more efficient than the t-test.

Yes, the Central Limit Theorem tells us this is true. So long as you avoid extremely heavy-tailed traits, non-Normality presents no problems in moderate-to-large samples.

Here's a helpful review paper;

http://www.annualreviews.org/doi/pdf/10.1146/annurev.publhealth.23.100901.140546

The Wilcoxon test (mentioned by others) can have terrible power when the alternative is not a location shift of the original distribution. Furthermore, the way it measures differences between distributions is not transitive.

About the use of Wilcoxon-Mann-Whitney test as an alternative I recommend the paper The Wilcoxon-Man-Whitney test under scrutiny

- ServerfaultXchanger
- SuperuserXchanger
- UbuntuXchanger
- WebappsXchanger
- WebmastersXchanger
- ProgrammersXchanger
- DbaXchanger
- DrupalXchanger
- WordpressXchanger
- MagentoXchanger
- JoomlaXchanger
- AndroidXchanger
- AppleXchanger
- GameXchanger
- GamingXchanger
- BlenderXchanger
- UxXchanger
- CookingXchanger
- PhotoXchanger
- StatsXchanger
- MathXchanger
- DiyXchanger
- GisXchanger
- TexXchanger
- MetaXchanger
- ElectronicsXchanger
- StackoverflowXchanger
- BitcoinXchanger
- EthereumXcanger