Rueful comment

Obviously I no longer believe this page has any forensic relevance — haven't since at least 1998 — but I leave it up anyway because
  1. The beautiful mathematical expression for frequency uncertainty as an exponential of a square root is my own discovery, and no less than Jim Crow complimented me on it;
  2. I enjoy documentation of my past mistakes. It's not that it's a great thing to make mistakes, but I like the clear record of development of my thinking. This page is evidence that I was once exploring the consequences of frequency uncertainty in a pointless direction (as others in the forensic area are doing today). But exploration isn't completely pointless if thinking about a situation causes you eventually to confront deeper questions which turn into puzzles which in turn leads you to back up and start thinking over again from the beginning.

DNA Frequency Uncertainty

Sampling variation

DNA profile frequency estimates are based on population samples of limited size (not the whole population). Since a different population study would probably lead to a different frequency estimate, the estimate has an uncertainty known as sampling variation or sampling error. This is in addition to any other errors there might be, such as bias, population substructure, clerical errors, bad methodology, etc.

Note that the formula doesn't involve the database size or the allele frequency; just the count.



Example

TH01 VWA
allele 6 9 12 14
frequency 31/226 28/226 5/226 48/226
times observed k=31 l=28 m=5 n=48

Uncertainty = exp(sqrt(1/31 + 1/28 + 1/5 + 1/48))=1.7. This figure is divided into and multiplied by the profile frequency.

The profile frequency is 2•(31/226)•(28/226)•2•(5/226)•(48/226) which comes to one person in 3131.

Taking the uncertainty factor of 1.7 (one standard deviation confidence) into account, it is reasonable to think that something like one person in 2000 to 5000 would have the profile.

Moral

The calculated profile frequency does not have even one decimal digit of accuracy.

Approximate Mathematics

Uncertainty being necessarily vague it offers a great chance for some approximate mathematics. Therefore don't expect any right answers on this page; just simple ones. And close enough.

NRC formulas

The NRC II report gives more complicated and less approximate formulas than my simple recommendation.

Their formulas give smaller confidence intervals. For example, with six rare alleles each of frequency 5/200, I would say that the profile is most probably rarer than one person per hundred million; the NRC can say "per three hundred million." For a common profile, six alleles each with a frequency of 50/200, I say once per 450 people while they give the stronger information of once per 500 people.

I think the difference is slight and the extra accuracy is not worth the trouble considering the loss in simplicity and intuitiveness.

The NRC formulae also look very different, but that's an illusion. Just replace each term like 1/k in my formula with (1/k) - (2/N), and you have a formula equivalent to the NRC formula for heterozygous loci. The correction term 2/N accounts for two separate corrections, each in the amount of 1/N:

  1. When an allele is observed k times out of N, I use the simplifying approximation that the variance in k is k. This is obviously not exactly right, because if the allele is seen k times, then it must be not seen N-k times so these two quantities must have the same variance. Statisticians therefore use the more nearly correct symmetrical formula k(N-k)/N.
  2. My formula ignores the covariance between the frequency estimates for two different alleles at the same locus. Covariance is the effect that sample counts of the two alleles are related in this way: If one of the alleles is accidentally underrepresented in the database sample, the tendency is that the other one will not also be underrepresented. Hence by ignoring covariance I am ignoring part of the tendency for errors to cancel, so I slightly overestimate the size of the likely error.

The NRC book has not one formula but several, because at homozygous loci there is no covariance to consider.


Related subject: Why bother? What good is it anyway?
Return to home page of Charles H. Brenner