DNA profile probability Fallacies Limitations Omitted topics footnote: origin of prosecutor's fallacy |

A typical DNA case involves the comparison of two samples – an
unknown or *evidence* sample, such as semen from a rape, and a
known or *reference* sample, such as a blood sample from a
suspect.

If the DNA profile obtained from the two samples are indistinguishable (they "match"), that of course is evidence for the court that the samples have a common source – in this case, that the suspect contributed the semen.

How strong is the evidence? If the DNA profile consists of a combination of traits that figure to be extremely rare, the evidence is very strong that the suspect is the contributor. To the extent that the DNA profile is not so rare, it is easier to imagine that the suspect might be unrelated to the crime and that he matches only by chance.

DNA Profile | Allele frequency from database | Genotype frequency for locus | |||||

Locus | Alleles | times allele observed | size of database | Frequency | formula | number | |

CSF1PO | 10 | 109 | 432 | p= |
0.25 | 2pq |
0.16 |

11 | 134 | q= |
0.31 | ||||

TPOX | 8 | 229 | 432 | p= |
0.53 | p^{2} |
0.28 |

8 | |||||||

THO1 | 6 | 102 | 428 | p= |
0.24 | 2pq |
0.07 |

7 | 64 | q= |
0.15 | ||||

vWA | 16 | 91 | 428 | p= |
0.21 | p2^{} |
0.05 |

16 | |||||||

profile frequency= |
0.00014 |

The allele 10 at the locus CSF1PO was observed 109 times in a
population sample of 432 alleles (216 people). Therefore it is
reasonable to estimate that there is a chance *p*=0.25 that
any particular CSF1PO allele, selected at random, would be a 10.
Similarly, the chance is about *q*=0.31 for a random CSP1PO
allele to be 11. Prior to typing the suspect, if we assume that he is
not the donor of the evidence then we can think of him as someone who
received a CSF1PO allele at random from each of his parents. The
chance to receive 10 from his mother and 11 from his father is
therefore *pq*, and to receive 11 from mother and 10 from
father is another *pq*, so the probability to be 10,11 by
chance is 2*pq*. Hence about 16% of people have the 10,11
genotype at the CSF1PO locus.

At the TPOX locus, since both alleles are the same there is only
one term – *pp* or *p ^{}*2, which
represents the combined probability of inheriting the allele 8 from
each parent. Hence about 28% of people have the same TPOX genotype
as does the evidence. It is to be expected that the proportion of
TPOX 8,8 people is still 28% even if attention is restricted only to
people who have a particular CSF1PO genotype such as 10,11. Therefore
the chance for a person to have the combined genotype in the two loci
is 28% of 16% – about 4%.

The calculations for the THO1 and vWA loci are similar, and taking them into account whittles the overall chance for a random person to have the combined genotype from 4% down to about 1/7000.

The profile frequency is sometimes referred to as the *random
match probability*, or the chance of a random match.

either the suspect contributed the evidence, or an unlikely coincidence happened – the once-in-7000 coincidence that an unrelated person would by chance have the same DNA profile as that obtained from the evidence.

A shorter summary is "common source, or unlikely coincidence."

It seems logical therefore that DNA evidence alone cannot be a
proof – some additional information is necessary. However, the
amount of additional information that is necessary might be a very
small amount. For example, add to the DNA matching evidence (of 7000
to one) the mere knowledge that the suspect was arrested
*before* his DNA type was known, and you have something like a
proof.

Besides
"common source", and
"unlikely coincidence",
a third possible explanation for a match between suspect and evidence
is error. The chance of an error that would cause a spurious match –
mishandling the evidence, PCR contamination – although
unquantifiable, is probably very small. Nonetheless, it seems likely
that the chance of error is often much larger than the extremely
small random match chances (such as 1 in 10^{8}) that occur,
so it may be more realistic and more fair in such cases to say "same
source, or (unlikely) error" rather than to say "same source, or
unlikely coincidence."

Sometimes the defense points out that there are sequence variations in most alleles, so the suspect's allele 10 and the evidence allele 10, which were reported by the analyst as matching, may in reality be different. That's true, but irrelevant since the difference is undetectable to the analysis methods used. The analysis and statistics are consistent in treating "match" merely to mean "same category," so the statistical conclusion of "either common source, or once-in-7000 coincidence" is still correct.

The method of calculation described above makes several assumptions, and in some cases some of those assumptions may be false so it is important to be aware of them. There is a more thorough discussion of all these issues in "The Evaluation of Forensic DNA Evidence".

The analysis above assumes that if suspect is not the donor, he is unrelated to the donor. But common sense shows immediately that if the suspect can make a case that a relative of his, especially his brother, is the donor, then that goes a long way towards explaining away the coincidental similarity between the suspect and the evidence. The defense always needs to be aware of this possibility. There are other computations that can be made to deal with situations where relatives of the suspect (even distant relatives) may be worth considering.

The application of the product rule presumes that the relevant loci and population be in Hardy-Weinberg equilibrium and linkage equilibrium. These population genetic concepts have been found to hold to a reasonable degree of accuracy for major populations and typical forensically used loci. For mixed populations and inbred populations the product rule is not as accurate. To the extent that the product rule is inaccurate, the error usually works against suspect, unfairly exaggerating the strength of the evidence.

It is beyond the ambition of this section to discuss computations for DNA identification when the evidence consists of a mixture ("Interpreting DNA mixtures") from several people, or how to analyze when the suspect is found through a database search, or how to analyze relationship cases including paternity and missing bodies.

1. | Contact: forensic mathematics |