Intro to Statistics: Part 9: The Central Limit Theorem
In the previous article we introduced the concept of a sampling distribution. A sampling distribution is defined as the distribution of a summary statistic across multiple samples drawn from an underlying distribution. For example, if you take multiple samples from a random variable distribution, and compute a summary statistic for each sample, such as the sample mean, the collection of means you end up with (one mean from each sample) makes up the sampling distribution of the mean.
We looked at an example of a sampling distribution of the mean, where we generated multiple samples by repeatedly rolling a six-sided die. We observed that the sampling distribution of the mean resembled a normal distribution, centered around the expected value of the underlying distribution (where the underlying distribution is that of the six-sided die roll - a uniform distribution of discrete integers 1 - 6). We also observed that the variance of the sampling distribution got smaller as the sample size got bigger.
These results are consistent with what we'd intuitively expect, if we take a moment to think about it. Each sample drawn from the underlying distribution roughly approximates the shape and characteristics of the underlying distribution itself. The mean of each sample therefore roughly approximates the expected value of the underlying disribution. So it makes sense that the sampling distribution of the mean -- the collection of means calculated from the collection of samples -- would be clustered around the expected value of the underlying distribution.
It also makes sense that the variance of the sampling distribution would get smaller as the sample size gets bigger. As the sample size gets bigger, we'd expect the sample mean(s) to more closely approximate the expected value of the underlying distribution. As the sample mean(s) more closely approximate the expected value, the sampling distribution becomes more tightly clustered around the expected value, which is the same as saying the variance of the sampling distribution gets smaller.
Enter the Central Limit Theorem
The Central Limit Theorem can essentially be summarized as follows. It states that, for any given underlying distribution:
- The sampling distribution of the mean is normally distributed
- The mean of the sampling distribution of the mean converges to the expected value of the underlying distribution
- The variance of the sampling distribution converges to the variance of the underlying distribution divided by the sample size
- The standard deviation of the sampling distribution of the mean is known as the standard error of the mean.
Let's re-state the CLT using typical symbolic notation. For a given underlying random variable X, the sample mean (which itself is a random variable) is typically denoted by an X with a bar over it, as shown in #1 below.
Applying the Central Limit Theorem to our example sampling distribution
Let's revisit the sampling distributions we plotted above. This time we'll overlay the density histogram of the sampling distribution with the theoretically expected normal distribution curve. According to the CLT, the normal distribution curve has mean = E[X] and variance = Var(X) / N, where X is the underlying random variable (the six-sided die roll) and N is the sample size (N=10 in the first chart, N=30 in the second). Recall that the variance of a discrete uniform distribution is given by:
# Note: See the previous article on sampling distributions for the code # that generates the example sampling distribution(s) used below ggplot(data=NULL, mapping=aes(x=sample.means)) + geom_histogram(aes(y=..density.., x=sample.means), binwidth=0.1, fill="yellow", colour="lightgray") + scale_x_continuous(breaks=1:6, limits=c(0.5,6.5)) + ggtitle("Density histogram of the sampling distribution of the mean,\nsample size N=30") + geom_vline(x=3.5,linetype="dashed",size=1,colour="blue") + xlab("Sample means") + ylab("Probability Density") + stat_function(fun = dnorm, arg=list(mean=3.5, sd=sqrt((6^2-1)/12/N)), size=1, colour="red")
As you can see, the normal distribution curve(s) fit almost perfectly over the sampling distribution(s).
It's important to note that the Central Limit Theorem applies to (practically) any random variable with any type of distribution. In this example the underlying random variable is uniformly distributed, but we would get the same results (i.e. the same normally distributed sampling distribution) if the underlying random variable were poisson distributed or normally distributed or whatever. For example, check out my paper on applying the CLT to an exponentially distributed random variable that I wrote for my statistical inference class on coursera.
Recap
- A random variable is described by the characteristics of its distribution
- The expected value, E[X], of a distribution is the weighted average of all outcomes, where each outcome is weighted by its probability
- The variance, Var(X), is the "measure of spread" of a distribution. It's calculated by taking the weighted average of the squared differences between each outcome and the expected value.
- The standard deviation of a distribution is the square root of its variance
- A probability density function for continuous random variables takes an outcome value as input and returns the probability density for the given outcome
- The probability of observing an outcome within a given range can be determined by computing the area under the curve of the probability density function within the given range.
- A probability mass function for discrete random variables takes an outcome value as input and returns the actual probability for the given outcome
- A sample is a subset of a population. Statistical methods and principles are applied to the sample's distribution in order to make inferences about the true distribution -- i.e. the distribution across the population as a whole
- A summary statistic is a value that summarizes sample data, e.g. the mean or the variance
- A sampling distribution is the distribution of a summary statistic (e.g. the mean) calculated from multiple samples drawn from an underlying random variable distribution
- The Central Limit Theorem states that, regardless of the underlying distribution, the sampling distribution of the mean is normally distributed, with mean equal to the underlying population mean and variance equal to the underlying population variance divided by the sample size