Littlest database poser

The littlest database

The problem:

Suppose you want to estimate allele frequencies for some DNA locus. How big should the database be? Sometimes N=100 individuals (200 alleles) is suggested as a practical size. But surely N=99 will do almost as well. And if that is so, why not N=98? And so on. Naturally the utility gradually diminishes as N becomes smaller. But for what value of N does the utility disappear completely? What is the absolutely smallest database that is any use at all? And what use is it?

N=0 can be useful. Suppose that analysis of a crime stain reveals two alleles, PQ. If a PQ suspect turns up, there is a definite amount of evidence against him, even with no information about frequencies at all. Reason: The alleles P and Q have some (unknown) frequency in the population, call them p and q. Now,

(i) p = ½ + (p- ½) and

p+q ≤ 1 so

(ii) q ≤ 1-p = ½ - (p- ½), hence multiplying together (i) and (ii)

2pq ≤ 2(¼ - (p- ½ )²) ≤ ½,

i.e. at most ½ the population is PQ. If we can get the same result in 10 loci, then the suspect is narrowed down to 1 person in 1024 who matches the stain. Not bad for no databases!

Comments? Questions? Disputes?

Links: Forensic mathematics home page. Posers in forensic mathematics.

(i)	p = ½ + (p- ½)	and
	p+q ≤ 1	so
(ii)	q ≤ 1-p = ½ - (p- ½),	hence multiplying together (i) and (ii)
	2pq ≤ 2(¼ - (p- ½ )²) ≤ ½,