Rueful comment
Obviously I no longer believe this page has any forensic relevance
— haven't since at least 1998 — but I leave it up anyway because
- The beautiful mathematical expression for frequency uncertainty as an exponential of
a square root is my own discovery, and
no less than Jim Crow
complimented me on it;
- I enjoy documentation of my past mistakes. It's not that it's a great thing to make
mistakes, but I like the clear record of development of my thinking. This page is evidence
that I was once exploring the consequences of frequency uncertainty in a pointless
direction (as others in the forensic area are doing today). But exploration isn't completely
pointless if thinking about a situation causes you eventually to confront deeper questions
which turn into puzzles which in turn leads you to back up and start thinking over again
from the beginning.
|
DNA Frequency Uncertainty
Sampling variation
DNA profile frequency estimates are based on population samples of limited size (not the whole
population). Since a different population study would probably lead to a different frequency
estimate, the estimate has an uncertainty known as sampling variation or sampling error. This is
in addition to any other errors there might be, such as bias, population substructure, clerical
errors, bad methodology, etc.
|
Note that the formula doesn't involve the
database size or the allele frequency; just the
count. |
Example
|
TH01 |
VWA |
allele |
6 |
9 |
12 |
14 |
frequency |
31/226 |
28/226 |
5/226 |
48/226 |
times observed |
k=31 |
l=28 |
m=5 |
n=48 |
Uncertainty = exp(sqrt(1/31
+ 1/28 + 1/5
+ 1/48))=1.7.
This figure is divided into and multiplied by the profile frequency.
The profile frequency is
2(31/226)(28/226)2(5/226)(48/226)
which comes to one person in 3131.
Taking the uncertainty factor of 1.7 (one standard deviation
confidence) into account, it is reasonable to think that something
like one person in 2000 to 5000 would have the profile.
Moral
The calculated profile frequency does not have even
one decimal digit of accuracy.
Approximate Mathematics
Uncertainty being necessarily vague it offers a great chance for some
approximate mathematics. Therefore don't expect any right answers on
this page; just simple ones. And close enough.
NRC formulas
The NRC II report gives more
complicated and less approximate formulas than my simple
recommendation.
Their formulas give smaller confidence intervals. For example,
with six rare alleles each of frequency 5/200, I would say that the
profile is most probably rarer than one person per hundred million;
the NRC can say "per three hundred million." For a common profile,
six alleles each with a frequency of 50/200, I say once per 450
people while they give the stronger information of once per 500
people.
I think the difference is slight and the extra accuracy is not
worth the trouble considering the loss in simplicity and
intuitiveness.
The NRC formulae also look very different, but that's an illusion.
Just replace each term like 1/k
in my formula with (1/k) -
(2/N), and you have a formula equivalent to the NRC
formula for heterozygous loci. The correction term 2/N accounts for two separate
corrections, each in the amount of 1/N:
- When an allele is observed k times out of N, I use the simplifying approximation that the
variance in k is k. This is obviously not exactly right, because if the allele is seen k times, then it
must be not seen N-k times so these two quantities must have the same variance. Statisticians
therefore use the more nearly correct symmetrical formula k(N-k)/N.
- My formula ignores the covariance between the frequency estimates for two different alleles at
the same locus. Covariance is the effect that sample counts of the two alleles are related in this
way: If one of the alleles is accidentally underrepresented in the database sample, the tendency
is that the other one will not also be underrepresented. Hence by ignoring covariance I am
ignoring part of the tendency for errors to cancel, so I slightly overestimate the size of the
likely error.
The NRC book has not one formula but several, because at homozygous loci there is no
covariance to consider.
Related subject:
Why bother? What good is it anyway?
Return to home page of Charles H. Brenner