## When do we use "N" and when "N-1" when we calculate the variance of a sample with size N?

I could not understand the following:

We have the whole population of people. We sample N=100 people. We calculate the mean mu.

1. When we calculate the variance of the sample, do we divide the sum of squares to N or to N-1? If we use N-1, why do we use it?
2. Is the variance of the sample the actual or the estimated variance and is there a difference between them?
3. We want to estimate the 95% confidence interval for the true population mean using the information from the sample. In the denominator we use sq.root (N). But when we calculate the standard deviation (the numerator), do we divide the sum of squares to N or to N-1?

6114

accept rate: 0%

Retagbot ♦
1518171

You can find a lengthy discussion of the small-sample N/(N-1) multiplier in Correction Factor. Briefly, we use it when estimating the population variance from the sample. If we want just the sample variance, or if we are computing the population variance from the population itself, there is no need for the correction factor.

When speaking of the variance of the sample, we compute the average squared deviation from the sample mean. It is not an estimate. (Unless, of course, we are theoreticians proving theorems about estimates of the sample variance! :-) In this course, if we speak of estimates they are usually estimates of population parameters. However, sometimes Sebastian will ask you to estimate something about the sample statistics rather than the population statistics.

For N>30, it makes little difference whether we do a correction or not, which is why Sebastian is bypassing discussion of it in Unit 24. To be rigorous, use N-1 in computing the variance estimate if you are averaging squared deviations from the sample mean; use N if using the true population mean. Those are unbiased estimators of the population variance. (The sample standard deviation, however, is not an unbiased estimator of the population standard deviation. See Unbiased estimation of standard deviation if you really want to get into it.)

Kenneth I. L...
26.9k1983184

Thanks for the answer! But then it is best to always use the unbiased estimator, right? I mean, if we are interested in the real populations parameter.

(24 Jul '12, 17:13)

I just saw the second answer and it answered my question when to use the maximum likelihood estimator and when the unbiased estimator, so disregard my previous post. Thanks, guys!

(24 Jul '12, 17:16)

1

@Ken quote: "To be rigorous, use N-1 in computing the variance estimate if you are averaging squared deviations from the sample mean; use N if using the true population mean. Those are maximum likelihood estimators of the population variance."

I believe this not entirely correct. See my earlier post. With N-1 you do have an unbiased estimator, but it is not the max likelihood estimator. Using N instead gives you the max likelihood estimator, which however is (slightly) biased.

(24 Jul '12, 17:22)

Again, in practice the difference is negligible.

Theoretical perspective: if you use a max likelihood estimator, you have the biggest chance to come up with the right parameter. But if you use an unbiased estimator, and doing many estimates, you're right on average.

If you draw a graph of the distribution of the estimator, the max likelihood gives you the point where it has its highest point, while an unbiased estimator gives you the mean value.

(24 Jul '12, 17:33)

Hi, MrBB. You caught me! I posted my answer, saw yours that was posted minutes before, and changed my ML to "unbiased," all within about thirty seconds.

I agree that N gives a maximum likelihood but biased estimate if using the sample mean to estimate the population variance. It is biased because the sample clusters more closely around its own mean than it does around the population mean, so we have to "give up one degree of freedom" and broaden our estimate by N/(N-1) to compensate for the bias.

If we know and use the true population mean, however -- which is what I said in my post -- the computed sample variance already includes that extra spread. We don't have to "give up a degree of freedom" and don't have to use a correction factor. I haven't verified that this is the MLE solution, but it seems correct -- and the N-1 version seems definitely incorrect for this case.

For non-statisticians trying to follow this: Remember that the maximum likelihood parameter estimates are those that [jointly] give the greatest probability of drawing the observed sample values, over the space of any parameter values that we could have chosen. This is similar to a Bayes' Rule computation: we start with observed data, but we are estimating the parameter that best explains how that data might have been generated.

Unfortunately, the MLE is not always an unbiased estimator. (It unbiasedly estimates whatever it estimates, of course, but that may differ slightly on average from the mean or variance of a Gaussian or other source population.) Sometimes we prefer the MLE, sometimes an unbiased estimator, sometimes some other kind of estimator. Often it makes little difference; in other cases we choose one that is easy to compute; but for very difficult problems it may be necessary to choose just the right theoretical constructs.

(24 Jul '12, 18:38)

Hi, Radoslav! As you suggest, it's usually best to use the unbiased estimator -- except for theoretical work in which you wish to maintain the MLE property throughout a series of computations. However, note that you don't use the N/(N-1) correction if computing the variance of the population itself. (Nor would you use if if reporting the sample variance, independent of estimating the population variance.)

Software developers and calculator designers have sometimes built in the small-sample correction and sometimes left it out. Usually they leave it out, unless they can provide both. (The correction causes a division by zero for samples of size 1, which is rather inconvenient.) In any case, it's something we need to be aware of when dealing with other people's work. One more rabbit hole to trip into.

(24 Jul '12, 18:56)

So guys, let me try to get it right:

We have a population with size N = 30. In this population there are only 3 values - 1, 2, 3. Each of these values has equal representation in the population. Therefore, from these 30 points, 10 have the value of 1, 10 have the value of 2, and 10 have the value of 3.
Obviously, the true population mean mu is equal to 2.
Therefore, the true population variance is equal to (10*((1 - 2)^2 + (2 - 2)^2 + (3 - 2)^2))/30 =
=(1+0+1)/3 = 2/3

Assume that we do not know anything about the population. We want to find the true population variance. We take a sample of 5 data points and it turns out to be 2,1,1,3,1.
Sample average xbar = (2+1+1+3+1)/5 = 8/5
Sample variance (biased): ((2-8/5)^2 + (1-8/5)^2 + (1-8/5)^2 + (3-8/5)^2 + (1-8/5)^2)/5 = 0,64
Sample variance (unbiased): ((2-8/5)^2 + (1-8/5)^2 + (1-8/5)^2 + (3-8/5)^2 + (1-8/5)^2)/(5-1) = 0,8

See that in this case the biased variance is closer to the true population variance than the unbiased.

1. If by any chance we know the true population mean mu=2, we can calculate an estimate of the true population variance sigma^2.

sigma^2 = ((2-2)^2 + (1-2)^2 + (1-2)^2 + (3-2)^2 + (1-2)^2)/5 = 4/5

Now, we see that the estimate of 4/5 is NOT equal to the true population variance of 2/3, but is close.

QUESTION 1: When we know the true population mean mu and we have a sample, is it best to disregard the sample mean and variance and just use the difference between the sample data points and the mean?

1. If we do NOT know the true population mean mu=2, we have the option to use the BIASED or UNBIASED sample variance.

However, imagine that we can take only ONE sample. If it turns out that the sample variance is already close to the true population variance, the multiplication with n/(n-1) will make the estimate EVEN LARGER than the true population variance, like in the example above. If we have only two data points the use of n-1 doubles the estimate!

QUESTION 2: When we estimate the true population variance using the UNBIASED variance of a SINGLE sample, don´t we take a large risk to get an unrealistic large estimate?

QUESTION 3: However, if we use the BIASED variance of a SINGLE sample, I realize that the sample variance is generally going to be lower than the true population variance. For what sample sizes is it better to use the biased and for what sample sizes the unbiased? If you have a sample size of 5, which estimator would you prefer to use - the biased or the unbiased? Why?

(25 Jul '12, 08:13)

I think in your first example you just showed that if you had used the true population mean as the pivot to determine your sample variance you get the unbiased sample variance, or something close to it. (Note that what you are computing in your first example is actually the 'variance' of your sample w.r.t. the true population mean. It's not the true population variance which remains at 2/3.)

When your sample set is smaller than the total number of data values that your sample set takes (in this case 3), there is no guarantee of what results you'll get! At the very least, your sample size should be large enough to be somewhat capable of being representative. But yes, you do not know this before hand.

Moral being that with restricted samples you can pretty much blow your confidence intervals out of the water. Anything is possible then.

(25 Jul '12, 10:20)

I think everything you do is correct (though didn't chech each calculation in detail).

Q1: you can best disregard the sample mean, and indeed use mu instead with N=5. Now you have an estimator that is max likelihood and unbiased at the same time: best of both worlds.
Q2: it is inherent from estimating that your estimate can be off. You can estimate too high or too small(if you e.g. sample 5 three's you will estimate mu too high and sigma too small). By using the unbiased estimator, you are guaranteed that the EXPECTATION of your estimator is right. Still in practice you will always be off: too small or too high and only if you're lucky you are spot on. By choosing the number of samples (the N) large, you make the chance that you are off by more than an x number small. If you use N=sample size you will "on average" (actually your expectation) underestimate sigma. So answer is yes, but not because you use N-1 iso N.
Q3: there is no golden rule, I think. Using the unbiased estimator gives you an unbiased estimate. The biased one gives you the max likelihood estimator (so, you have the biggest chance to be right then, but on average you estimate too low). No golden rule here, I think. In general you can just say with N=5your estimate is lousy anyway (large confidence interval) because of the very small number of data points.

Last point, but guess that speaks for itself, from one specific sample outcome example (like the one you use in your calculations), you cannot derive general conclusions. There is samples where the ML estimator gives you the best result and there is samples where the unbiased estimator gives the best result. In statistics it is all about averages and expectations....

(25 Jul '12, 10:35)

I'll leave the discussion to younger, clearer heads. We are getting into graduate level issues, best addressed with formulas and proofs.

In discussing single samples, note that "sample" is ambiguous. It can refer to either a single draw from a random population or to a set of such draws collected together. To reduce the ambiguity, we can speak of "a single draw" or "a sample of one." "Single sample" then refers to a collection of draws, although I wouldn't trust the phrase.

If you use the sample mean and compute the N/(N-1) unbiased population variance estimate for a sample of one, you get a division by zero. In other words, it can't be done. A statistician might say that you have no degree of freedom with which to compute the variance estimate. If you have a sample of two, you can estimate the population variance but it will be a lousy estimate. At these extremely small sample sizes, we get into the Hiawatha problem of estimators missing the target completely.

Roshan is correct that any observed sample is likely to be misleading if it is too small to represent the variability in the population -- though I wouldn't suggest that samples from continuous distributions need to contain an infinite number of draws. For known distributions, it is possible to calculate the sample size needed to achieve specific confidence levels. For general work, statisticians have rules of thumb about samples sizes they would recommend. Typically a sample of 20 is considered acceptable, and typically a statistician -- or at least a statistics student -- will fudge that down to about 5 if necessary.

(25 Jul '12, 12:47)

This forum is extremely cool :) Thanks to all for the great input!

(25 Jul '12, 15:29)

showing 10 of 11 show 1 more comments

There is a famous tale in free verse concerning Hiawatha shooting arrows. The rather overstated lesson is that one may have to choose between a biased process that generally hits to one side of a bulls-eye and one that is perfectly centered and unbiased but so widely scattered that arrows seldom hit the target board at all. See Hiawatha Designs an Experiment by English statistician Sir Maurice George Kendall (1907 - 1983).

Kenneth I. L...
26.9k1983184

nice analogy! great story too

(19 Aug '12, 20:28)

1. First of all, with N=100, using N or N-1 gives you only a small difference. Why would you use N? It gives you the maximum likelihood estimator (explained by Sebastian what that means). Why would you want to use N-1? It gives you an unbiased estimator. Unbiased means that the expectation of the estimator is equal to the parameter (sigma^2) you are estimating. Unfortunately you can't have both at the same time, but fortunately is in most practical cases just a theoretical difference.
2. It is the actual variance of the sample (actually only when you use N; with N-1 it is not exactly the variance of the sample), which is probably a good (either max likelihood or unbiased; see above) estimator of the variance of the population.
3. See 1. Both are possible. Both are valid estimators and none of them is perfect. For large N you shouldn't worry. But it is good to understand what are the pro's of either method.

mrBB-4
3.0k535

Question text:

Markdown Basics

• *italic* or _italic_
• **bold** or __bold__
• image?![alt text](/path/img.jpg "Title")
• numbered list: 1. Foo 2. Bar
• to add a line break simply add two spaces to where you would like the new line to be.
• basic HTML tags are also supported

×9,083
×3,399
×29