## Lecture #1 errata |

However, suppose that, from a sample of *n* objects, we
want to predict the standard deviation of the population from which
they come. Then, the above formula is biased. (That much I got
right.)

The easiest way to see the problem is to consider the case
*n*=1. In that case the formula amounts to the square root of
0/1, which is 0. That can't be right!

The way to fix the problem is to divide by *n*-1 rather
than by *n*. Then, when *n*=1, you get the
indeterminate form 0/0, which is reasonable in that it correctly
represents the fact that a sample of only one element gives no
information whatever as to the standard distribution of a population.

I incorrectly gave *n*+1 instead of *n*-1 in the
heat of the lecture, which is nonsense. (Did I also omit the
symbol?)

Suppose that, at a particular locus, nature provides two alleles,
*A*_{1} and *A*_{2}, with frequencies
*p* and *q*=1-*p*, which, unknown to us, are
½. We assume HWE for the locus in question, so the rate of
heterozygosity, *h*, is also ½, although we don't know
that number either.

We propose to estimate *h* = the rate of heterozygosity, by
a sampling experiment. The experiment consists of examining a sample
of size *n*. Now, if we were to calculate the proportion of
heterozygotes in the sample (by counting them and dividing by
*n*), we would get an estimate of the population
heterozygosity that would be sometimes too large and sometimes too
small, but on the average would be just right. If we did that, there
would be no story.

What we decide to do instead is to take advantage of our
assumption of HWE and to estimate the population heterozygosity based
on the allele frequencies in the sample. (In principle this rates to
give a more accurate estimate of *h*.)

However, this idea will give a biased answer if we apply it in the
obvious way. To see the problem, let's consider as an example the
case of *n*=1; a sample of only one person.

The sample will have one of three genotypes:

- Case
*A*_{1}*A*_{1}– In this case, which occurs 1/4 of the time, we would estimate the gene frequencies as*p*=1 and*q*=0, hence*h*=2*pq*=0. An underestimate.

- Case
*A*_{1}*A*_{2}– In this case, we would estimate the gene frequencies correctly, and hence would estimate*h*correctly as well.

- Case
*A*_{2}*A*_{2}– This case would work out the same as the first case – underestimate*h*.

That proves nothing, because I have only examined the case
*n*=1. However, I hope it does give some insight into what
might happen (and what I claim *does* happen) for larger
sample sizes and for other values of *p*. When the sample size
is larger, there will be some samples that result in an overestimate
of *h*. But the more typical situation is the one illustrated
here; on the average, *h* is underestimated.

Statistics lecture notes

CH Brenner home page