Computation of sampling variation
Benefit of sampling variation
Stupidity of sampling variation
What damn good is it?
Digression – a silly bet
A challenge

DNA Frequency Uncertainty – Why Bother?

Sampling variation

DNA profile frequency estimates are based on population samples of limited size (not the whole population). Since a different population study would probably lead to a different frequency estimate, the estimate has an uncertainty known as sampling variation.

Benefit of sampling variation

Suppose you plan to drive to some point in the desert and must carry enough fuel for the round trip. Your best estimate is that ten gallons will be enough, but you know that this estimate carries some uncertainty, and there is, let us say, a 1% chance that you really will need 15 gallons. So 15 gallons is the "98% (or maybe 99%) upper confidence estimate", and you may well judge it prudent to carry this amount of gas, rather than the "point estimate" of 10 gallons.

Stupidity of sampling variation

In dealing with DNA matching frequencies, the NRC II report discusses (but does not recommend!) an analogous approach. Let's suppose we have a DNA profile, shared between the suspect and the crime scene. We have some (necessarily limited) databases from which to estimate the prevalence of this profile in the general population, and our best guess – the "point estimate" – is that the profile is shared by 1/5000 of the general population. Then NRC II shows how to compute a "95% lower confidence" number, which is, let us suppose, 1/3000.

My question about the 1/3000 number is

What damn good is it?

For the sake of argument, let's imagine that the point estimate, 1/5000, if accepted by the jury, would result in a conviction, whereas the jury would feel that if the weaker number, 1/3000, is the true chance for a person to match by chance, then they will not convict.

If the jury understands the matter correctly, what will they do?

Roughly speaking, we can imagine for simplicty that the facts are these:1

  1. Were we to go back and collect databases again, possibly the new databases would result in a matching estimate of 1/3000. We can suppose that this more suspect-friendly estimate would occur with 5% of the re-collected databases.
  2. However, equally the new databases might – another 5% of the time – give a matching estimate that is rarer than 1/5000 – say 1/15000.
  3. The possibilities 1. and 2. cancel each other out. Based on all the evidence before us, we can say that the point estimate is correct2: There is 1 chance in 5000 that a randomly selected person will match the stain.
As far as I can see, the only use the jury can make of the "distribution" information – of the confidence limits – is somehow to distill it into a single number along the lines I have indicated above. Finally, they will act on the single number. And that single number is the point estimate. So why give them more than the single number in the first place?

Digression – a silly bet

Now, I can imagine a situation where the confidence interval would be useful: Suppose you offer to bet me that the number of matching people, in a city of 1 million, is between 1/4000 of the people and 1/6000 of the people. In deciding whether to take the bet I of course would like to know the confidence interval around the 1/5000 point estimate.

But that is nothing like the question that the jury has to answer. So why burden them with an extra number – a useless number?

A challenge

Will someone tell me, please, what rational difference it ever can make to know the confidence limits in addition to knowing the best point estimate? Specifically, can you give premises under which, for a fixed point estimate, the decision to convict or not to convict would depend on the size of the confidence interval?

email to noconfidence@dna-view.com


Note 1: This rendition might offend a statistical "frequentist" purist. However, have I offended in a material way – has my unsophisticated rendering disguised the reason for reporting the confidence limit? Or is there nothing to disguise and I am just being unsophisticated.

Note 2: Actually the weighted average is slightly more (more common) than the point estimate. Nobody seems to be aware of this, and it will complicate but not fundamentally affect the argument, so let's ignore it.


Return to home page of Charles H. Brenner