## The littlest database

### The problem:

Suppose you want to estimate allele frequencies for some DNA locus. How big should the database be? Sometimes N=100 individuals (200 alleles) is suggested as a practical size. But surely N=99 will do almost as well. And if that is so, why not N=98? And so on. Naturally the utility gradually diminishes as N becomes smaller. But for what value of N does the utility disappear completely? What is the absolutely smallest database that is any use at all? And what use is it?

N=0 can be useful. Suppose that analysis of a crime stain reveals two alleles, PQ. If a PQ suspect turns up, there is a definite amount of evidence against him, even with no information about frequencies at all. Reason: The alleles P and Q have some (unknown) frequency in the population, call them p and q. Now,
 (i) p = ½ + (p- ½) and p+q ≤ 1 so (ii) q ≤ 1-p = ½ - (p- ½), hence multiplying together (i) and (ii) 2pq ≤ 2(¼ - (p- ½ )2) ≤ ½,
i.e. at most ½ the population is PQ. If we can get the same result in 10 loci, then the suspect is narrowed down to 1 person in 1024 who matches the stain. Not bad for no databases!