A very important and useful concept in statistics is the Central Limit Theorem. There are essentially three things we want to learn about any distribution: 1) The location of its center; 2) its width, 3) and how it is distributed. The central limit theorem helps us approximate all three.
Central Limit Theorem: As sample size increases, the sampling distribution of sample means approaches that of a normal distribution with a mean the same as the population and a standard deviation equal to the standard deviation of the population divided by the square root of n (the sample size). |
Stated another way, if you draw simple random samples (SRS) of size n from any population whatsoever with mean and finite standard deviation , when n is large, the sampling distribution of the sample means is close to a normal distribution with mean and standard deviation / (n). This normal distribution is often denoted by: N(, / (n)).
Confidence Intervals/Margin of Error
The value = / n is often termed the standard error of the mean. It is used extensively to calculate the margin of error which in turn is used to calculate confidence intervals.
Remember, if we sample enough times, we will obtain a very reasonable estimate of both the population mean and population standard deviation. This is true whether or not the population is normally distributed. However, normally distributed populations are very common. Populations which are not normal are often "heap-shaped" or "mound-shaped". Some skewness might be involved (mean left or right of median due to a "tail") or those dreaded outliers may be present. It is good practice to check these concerns before trying to infer anything about your population from your sample.
Since 95.0% of a normally distributed population is within 1.96 (95% is within about 2) standard deviations of the mean, we can often calculate an interval around the statistic of interest which 95% of the time would contain the population parameter of interest. We will assume for sake of discussion that this parameter is the mean.
The margin of error is the standard error of the mean: / n multiplied by the appropriate z-score (1.96 for 95%). |
A 95% confidence interval is formed as: estimate +/- margin of error. |
We can say we are 95% confident that the unknown population parameter lies within our given range. This is to say, the method we use will generate an interval containing the parameter of interest 95% of the time. For life-and-death situations, 99% or higher confidence intervals may quite appropriately be chosen.
Example: Assume the population is the U.S. population with a mean IQ of 100 and standard deviation of 15. Assume further that we draw a sample of n=5 with the following values: 100, 100, 100, 100, 150. The sample mean is then 110 and the sample standard deviation is easily calculated as sqrt((102+102+102+102+402)/(5-1)) =sqrt(500) or approximately 22.4. The standard error of the mean is sqrt(500)/sqrt(5)=sqrt(100)=10. Our 95% confidence intervals are then formed with z=+/-1.96. Thus based on this sample we can be 95% confident that the population mean lies betwen 110-19.6 and 110+19.6 or in (90.4,129.6). Suppose, however, that you did not know the population standard deviation. Then since this is also a small sample you would use the t-statistics. The t-value of 2.776 would give you a margin of error of 27.8 and a corresponding confidence interval of (82.2, 137.8).Finite Population Correction Factor
The finite population correction factor is: ((N-n)/(N-1)). |
If you are sampling without replacement and your sample size is more than, say, 5% of the finite population (N), you need to adjust (reduce) the standard error by multiplying it by the finite population correction factor as specified above. If we can assume that the population is infinite or that our sample size does not exceed 5% of the population size (or we are sampling with replacement), then there is no need to apply this correction factor.