DNA Frequency Uncertainty Why Bother?
Sampling variation
DNA profile frequency estimates are based on population samples of limited size (not the whole
population). Since a different population study would probably lead to a different frequency
estimate, the estimate has an uncertainty known as
sampling variation.
Benefit of sampling variation
Suppose you plan to drive to some point in the desert and must
carry enough fuel for the round trip. Your best estimate is that
ten gallons will be enough, but you know that this estimate carries some
uncertainty, and there is, let us say, a 1% chance that you really will
need 15 gallons. So 15 gallons is the "98% (or maybe 99%) upper
confidence estimate", and you may well judge it prudent to carry
this amount of gas, rather than the "point estimate" of 10 gallons.
Stupidity of sampling variation
In dealing with DNA matching frequencies, the
NRC II report discusses (but does not recommend!) an
analogous approach. Let's suppose we have a DNA profile, shared
between the suspect and the crime scene. We have some (necessarily
limited) databases from which to estimate the prevalence of this
profile in the general population, and our best guess the "point
estimate" is that the profile is shared by 1/5000 of the general
population. Then NRC II shows
how to compute a "95% lower
confidence" number, which is, let us suppose, 1/3000.
My question about the 1/3000 number is
What damn good is
it?
For the sake of argument, let's imagine that the point estimate,
1/5000, if accepted by the jury, would result in a conviction, whereas
the jury would feel that if the weaker number, 1/3000, is the true
chance for a person to match by chance, then they will not convict.
If the jury understands the matter correctly, what will they do?
Roughly speaking, we can imagine for simplicty that the facts are
these:1
- Were we to go back and collect databases again, possibly the
new databases would result in a matching estimate of 1/3000.
We can suppose that this more suspect-friendly estimate would
occur with 5% of the re-collected databases.
- However, equally the new databases might another 5% of
the time give a matching
estimate that is rarer than 1/5000 say 1/15000.
- The possibilities 1. and 2. cancel each other out.
Based on all the evidence before us, we can say that the
point estimate is correct2:
There is 1 chance in 5000 that a randomly selected person
will match the stain.
As far as I can see, the only use the jury can make of the
"distribution" information of the confidence limits is
somehow to distill it into a single number along the lines
I have indicated above. Finally, they will act on the single
number. And that single number is the point estimate. So why
give them more than the single number in the first place?
Digression a silly bet
Now, I can imagine a situation where the confidence interval
would be useful: Suppose you offer to bet me that the number of
matching people, in a city of 1 million, is between 1/4000 of the
people and 1/6000 of the people. In deciding whether to take the
bet I of course would like to know the confidence interval around
the 1/5000 point estimate.
But that is nothing like the question that the jury has to
answer. So why burden them with an extra number a useless
number?
A challenge
Will someone tell me, please, what rational difference it ever can
make to know the confidence limits in addition to knowing the best
point estimate? Specifically, can you give premises under which,
for a fixed point estimate,
the decision to convict or not to convict would depend on the
size of the confidence interval?
email to
noconfidence@dna-view.com
Note 1: This rendition might offend
a statistical "frequentist" purist. However, have I offended in a
material way has my unsophisticated rendering disguised the reason
for reporting the confidence limit? Or is there nothing to disguise
and I am just being unsophisticated.
Note 2: Actually the weighted average is
slightly more (more common) than the point estimate. Nobody seems to
be aware of this, and it will complicate but not fundamentally affect
the argument, so let's ignore it.
Return to home page of Charles H. Brenner