Matching probability isn't frequency

The subtitle is based on a communication from John Buckleton. John knows that I say confidence intervals have no place in reporting DNA matching probabilities and he wonders how then I cater to the lesser information available from a smaller database:

Good question. Let's interpret the question as follows:

I claim we can never get anywhere computing matching probability unless we have some modelling assumptions, some kind of theory about where the data comes from and what produces it. The stipulation above that the reference sample is relevant is a start, but we need more, at least some information about the mechanism that produces the data. Suppose for example that we knew somehow that R is almost invariably lethal and that only a handful of R individuals ever survived. If we believed that strongly enough, then we'd strongly discount the R (or even several R's) in our reference database as mere coincidence. So at a minimum we need the modelling assumption that that particular fact about R is not true. This extreme example makes the point that you can't get anywhere without some modelling assumptions.

One kind of model that is plausible, quite likely to be supported by data and theory, is something about a prior probability distribution for allele frequencies. That is, perhaps nature favors rare alleles over common ones or at least doesn't do the opposite. For example, a high mutation rate tends to discourage common types (but positive selection for example could have the contrary effect).

A simple model

Therefore as a simplified example of a model adequate for illustrating and explaining the mathematics for using a database of size 1, we can assume the following discrete frequency spectrum as a model:

That may seem like an arbitrary and strange model, but actually it is somewhat realistic. It is a simplified version ("toy") of what I call Brenner's Law, a reasonable description of the real frequency spectrum for forensic STR loci. I've simplified the continuous frequency spectrum down to a discrete one so that we can examine the mathematics in an elementary way instead of coping with integrals or scholarly facts about the β distribution, and to make it maximally simple and yet adequate to explain the point.

The data

For purposes of analysis it's most general and best to think of the database as part of the evidence. Hence the evidence E consists of

We're going to need Pr(E | H) for several different hypotheses H:

Note the idea of the computation. The insight of treating the database as part of the evidence leads to defining

Evidence E	Pr(Q)	likelihood L=Pr(E \| H)	likelihood ratio L₁/L₀
crime scene	database	suspect	frequency q	prior: Pr( q=Pr(Q) )	H₁, suspect is donor	H₀, suspect is random
L₁(·)	L₁(q)	ΣrL₁	L₀(·)	L₀(q)	ΣrL₀
Q	R	Q	q=¼	r_(¼)=Pr(q=¼)=½	q(1−q)	¹²/₆₄	²⁸/₁₂₈	q²(1−q)	³/₆₄	¹¹/₁₂₈	²⁸/₁₁
q=½	r_(½)=½	¹⁶/₆₄	⁸/₆₄

We need to calculate likelihood expressions such as Pr(E|H₁}. The trick to do so is to partition E according to the set of mutually exclusive and exhaustive events {q=½, q=¼}. Each of them has prior probability r=½, so

As shown in the table, a similar computation leads to L₀=¹¹/₁₂₈ and therefore to a likelihood ratio LR=L₁/L₀ = ²⁸/₁₁ = 2.78 supporting the suspect as the donor of the crime scene allele.

If the database were very large and practically definitively supported q=¼ then the LR would be very nearly 4 and the smaller value we have calculated for a database of size n=1 has indeed "punished" the database. On the other hand if (somewhat less likely) the larger database makes clear that q=½, then LR=2, smaller than we have calculated. Hence the result LR=²⁸/₁₁ represents a compromise — a mathematically correct and exact compromise based on the evidence — between the two frequency possibilities of ¼ and ½.

To see if in general the 1-allele database is fair, we need to make a similar comparison for the other database possibility then do some sort of averaging. Consider then the case that E={crime scene=Q, database={Q}, suspect=Q}.

Evidence E	Pr(Q)	likelihood L=Pr(E \| H)	likelihood ratio L₁/L₀
crime scene	database	suspect	frequency q	prior: Pr( q=Pr(Q) )	H₁, suspect is donor	H₀, suspect is random
L₁(·)	L₁(q)	ΣrL₁	L₀(·)	L₀(q)	ΣrL₀
Q	Q	Q	q=¼	r_(¼)=½	q²	⁴/₆₄	²⁰/₁₂₈	q³	¹/₆₄	⁹/₁₂₈	²⁰/₉
q=½	r_(½)=½	¹⁶/₆₄	⁸/₆₄

To evaluate the above two numbers we could compute the average (geometrical average is appropriate) LR from an n=1 database with the average LR from an infinite database. Doing the arithmetic exactly is a bit tedious but unnecessary; it's easy to see how it will turn out. On the 50% of occasions when q=¼ for the evidential, crime scene allele, the average LR is between ²⁰/₉ and ²⁸/₁₁ so roughly LR(n=1, q=¼)=2.5, notably "conservative" compared to LR(n=∞, q=¼)=4. On the other 50% of occasions when q=½ for the evidential allele, LR(n=1, q=½) is a shade smaller than before but still around 2.5, which this time is only slightly "anti-conservative" compared to LR(n=∞, q=½)=2. Hence on average the small database is "punished" in that on average we impute less evidence from it against a matching suspect.

More realistic models

It would not be difficult to extend the above analysis to more complicated models.

First step, extend the discrete two-frequency model to a discrete model with say 100 possible frequencies q_i=0.01i, i∈{1, 2, ..., 100} and allow arbitrary weighting factors r_i. To accomodate this model just change the formula (**) to

Second step, we can handle a continuous frequency model such as Ewen's frequency spectrum for the "infinite alleles" model by changing the summation to integration.

In every case the resultant LR can be derived rigorously. There is no philosophy involved in the sense of decisions based on taste or judgment. Note that confidence intervals never entered the analysis, nor did even sample frequency (which is what a confidence interval is typically attached to). Numbers, mathematics, probability. That's all there is.

Modelling to the hilt

Exercise for the reader: With the stated model, what is the LR if the database size n=0?

Concluding remarks

There is though room for philosophy in reflecting upon the theory described above. On what assumptions does my analysis rest? I did assume an explicit and quantitative model, I assumed that one knows what frequency spectrum nature tends to provide. In real life that's an exaggeration. We don't know the model, not exactly, and therefore I would not claim that results of the kind of calculation illustrated above give precise results. If the model is to be used in court to influence grave matters, then we should have some good understanding and reasons to trust the model and to know how far to trust it. I would like to be able to quantify limitations of the model and while I may be able to do that in particular cases, I admit that from a theoretical perspective I don't know how to do it in general. It is true that having a large database affords some protection against uncertainty about the model — innaccuracy in choosing the weighting function r_i tends to be papered over by a large n.

And large n of course also correlates with small confidence intervals. But it does not follow and is not true that confidence intervals, computation of sampling variation, quantify model uncertainty. The example above proves that point: The one-allele database exemplifies enormous "sampling uncertainty" but if the model is correct the result is exact. How uncertain the model is depends on a lot of things but sampling variation isn't one of them.

L₁ = Pr(E\|H₁)	= Pr(q=¼ & E\| H₁) + Pr(q=½ & E\| H₁)
	= Pr(q=¼)Pr(E \| H₁ & q=¼) + Pr(q=½)Pr(E \| H₁ & q=½)	(**)
	= ½(q(1−q))\|_q=¼ + ½(q(1−q))\|_q=½	\|_q=¼ means "evaluated at q=¼"
	= ½·¹²/₆₄ + ½·¹⁶/₆₄
	= ²⁸/₁₂₈.

(Matching) probability isn't (population) frequency

"punishing" a database of size 1

A simple model

Consequence of the model

The data

More realistic models

Modelling to the hilt

Concluding remarks