Forensic mathematics of DNA matching

Charles H. Brenner, Ph.D.¹

A typical DNA case involves the comparison of two samples – an unknown or evidence sample, such as semen from a rape, and a known or reference sample, such as a blood sample from a suspect.

If the DNA profile obtained from the two samples are indistinguishable (they "match"), that of course is evidence for the court that the samples have a common source – in this case, that the suspect contributed the semen.

How strong is the evidence? If the DNA profile consists of a combination of traits that figure to be extremely rare, the evidence is very strong that the suspect is the contributor. To the extent that the DNA profile is not so rare, it is easier to imagine that the suspect might be unrelated to the crime and that he matches only by chance.

DNA profile probability

Therefore it is essential to have some idea as to the probability that a match would occur by chance. It is easiest to illustrate by example how the probability is determined:

DNA Profile		Allele frequency from database				Genotype frequency for locus
Locus	Alleles	times allele observed	size of database	Frequency		formula	number
CSF1PO	10	109	432	p=	0.25	2pq	0.16
	11	134		q=	0.31
TPOX	8	229	432	p=	0.53	p²	0.28
	8
THO1	6	102	428	p=	0.24	2pq	0.07
	7	64		q=	0.15
vWA	16	91	428	p=	0.21	p²	0.05
	16
			profile frequency=				0.00014

The allele 10 at the locus CSF1PO was observed 109 times in a population sample of 432 alleles (216 people). Therefore it is reasonable to estimate that there is a chance p=0.25 that any particular CSF1PO allele, selected at random, would be a 10. Similarly, the chance is about q=0.31 for a random CSP1PO allele to be 11. Prior to typing the suspect, if we assume that he is not the donor of the evidence then we can think of him as someone who received a CSF1PO allele at random from each of his parents. The chance to receive 10 from his mother and 11 from his father is therefore pq, and to receive 11 from mother and 10 from father is another pq, so the probability to be 10,11 by chance is 2pq. Hence about 16% of people have the 10,11 genotype at the CSF1PO locus.

At the TPOX locus, since both alleles are the same there is only one term – pp or p2, which represents the combined probability of inheriting the allele 8 from each parent. Hence about 28% of people have the same TPOX genotype as does the evidence. It is to be expected that the proportion of TPOX 8,8 people is still 28% even if attention is restricted only to people who have a particular CSF1PO genotype such as 10,11. Therefore the chance for a person to have the combined genotype in the two loci is 28% of 16% – about 4%.

The calculations for the THO1 and vWA loci are similar, and taking them into account whittles the overall chance for a random person to have the combined genotype from 4% down to about 1/7000.

product rule

In summary, the probability of a particular multiple-locus genotype is obtained by multiplication – by multiplying together the frequencies of the per-locus genotypes, which is to say, by multiplying together the frequencies of all the individual alleles and including in addition a factor of 2 for each heterozygous locus. This way to obtain the frequency of a DNA profile is called the product rule.

The profile frequency is sometimes referred to as the random match probability, or the chance of a random match.

verbal explanation

In the example case, the overall profile frequency is 0.00014 or about 1/7000. Therefore, a summary of the evidence is that

either the suspect contributed the evidence, or an unlikely coincidence happened – the once-in-7000 coincidence that an unrelated person would by chance have the same DNA profile as that obtained from the evidence.

A shorter summary is "common source, or unlikely coincidence."

Fallacies

"Prosecutor's fallacy"

correct statement vs. prosecutor's fallacy
Correct statement	Prosecutor's fallacy
The chance is 1/7000 that some (particular) person other than the suspect would leave a stain like the actual stain.	The chance is 1/7000 that someone (anyone) other than the suspect left the stain.
are obviously different when shown side-by-side, but there is some similarity. For example, both statments might carelessly be paraphrased by the ambiguous statement The chance is 1/7000 for someone other than the suspect to produce the observed evidence. Maybe this is how the "prosecutor's fallacy" got started.

Newspapers almost always write, incorrectly, that this means there is only 1 chance in 7000 that a person other than the suspect left the semen. (Why? See box.) To make such a statement is to commit the prosecutor's fallacy. It is a fallacy because it pretends that the probability that the suspect might be the donor can be computed from the DNA evidence alone, which implies illogically that other evidence in the case (even if the "suspect" is a dead woman, or even if the suspect was filmed in the act) makes no difference at all.

It seems logical therefore that DNA evidence alone cannot be a proof – some additional information is necessary. However, the amount of additional information that is necessary might be a very small amount. For example, add to the DNA matching evidence (of 7000 to one) the mere knowledge that the suspect was arrested before his DNA type was known, and you have something like a proof.

"Defense attorney's fallacy"

Sometime the defense tries to minimize the impact of 7000 to one matching odds by saying, "Since that means that there are hundreds of men in this city with the same profile, there is only one chance in several hundred that my client is the donor of the semen." That would be good logic if the other evidence suggests that every man in the city had the same access to the crime scene as did the suspect; not otherwise.

Laboratory error

Besides "common source", and "unlikely coincidence", a third possible explanation for a match between suspect and evidence is error. The chance of an error that would cause a spurious match – mishandling the evidence, PCR contamination – although unquantifiable, is probably very small. Nonetheless, it seems likely that the chance of error is often much larger than the extremely small random match chances (such as 1 in 10⁸) that occur, so it may be more realistic and more fair in such cases to say "same source, or (unlikely) error" rather than to say "same source, or unlikely coincidence."

Microvariants

Sometimes the defense points out that there are sequence variations in most alleles, so the suspect's allele 10 and the evidence allele 10, which were reported by the analyst as matching, may in reality be different. That's true, but irrelevant since the difference is undetectable to the analysis methods used. The analysis and statistics are consistent in treating "match" merely to mean "same category," so the statistical conclusion of "either common source, or once-in-7000 coincidence" is still correct.

Limitations

The method of calculation described above makes several assumptions, and in some cases some of those assumptions may be false so it is important to be aware of them. There is a more thorough discussion of all these issues in "The Evaluation of Forensic DNA Evidence".

relatives

The analysis above assumes that if suspect is not the donor, he is unrelated to the donor. But common sense shows immediately that if the suspect can make a case that a relative of his, especially his brother, is the donor, then that goes a long way towards explaining away the coincidental similarity between the suspect and the evidence. The defense always needs to be aware of this possibility. There are other computations that can be made to deal with situations where relatives of the suspect (even distant relatives) may be worth considering.

heterogeneous population

The application of the product rule presumes that the relevant loci and population be in Hardy-Weinberg equilibrium and linkage equilibrium. These population genetic concepts have been found to hold to a reasonable degree of accuracy for major populations and typical forensically used loci. For mixed populations and inbred populations the product rule is not as accurate. To the extent that the product rule is inaccurate, the error usually works against suspect, unfairly exaggerating the strength of the evidence.

Omitted topics

It is beyond the ambition of this section to discuss computations for DNA identification when the evidence consists of a mixture ("Interpreting DNA mixtures") from several people, or how to analyze when the suspect is found through a database search, or how to analyze relationship cases including paternity and missing bodies.

1.	Contact: forensic mathematics

Top of this page