# (Matching) probability isn't (population) frequency

## "punishing" a database of size 1

The subtitle is based on a communication from John Buckleton. John knows that I say confidence intervals have no place in reporting DNA matching probabilities and he wonders how then I cater to the lesser information available from a smaller database:

What would you do with a database of size 1? Do you think we should “punish” small databases and reward large ones?

Good question. Let's interpret the question as follows:

• For simplicity let's say we're dealing with a haploid population.
• A single allele Q is found as crime scene evidence.
• The "database of size 1" means a single allele sampled from some reference person randomly selected from a suitably chosen "population of plausible suspects."
• Assume for example the database allele is R, some allele other than Q.
• A suspect is tested who has the allele Q.
• What is the likelihood ratio supporting the suspect as the crime scene donor, as opposed to being a random individual?

I claim we can never get anywhere computing matching probability unless we have some modelling assumptions, some kind of theory about where the data comes from and what produces it. The stipulation above that the reference sample is relevant is a start, but we need more, at least some information about the mechanism that produces the data. Suppose for example that we knew somehow that R is almost invariably lethal and that only a handful of R individuals ever survived. If we believed that strongly enough, then we'd strongly discount the R (or even several R's) in our reference database as mere coincidence. So at a minimum we need the modelling assumption that that particular fact about R is not true. This extreme example makes the point that you can't get anywhere without some modelling assumptions.

One kind of model that is plausible, quite likely to be supported by data and theory, is something about a prior probability distribution for allele frequencies. That is, perhaps nature favors rare alleles over common ones or at least doesn't do the opposite. For example, a high mutation rate tends to discourage common types (but positive selection for example could have the contrary effect).

## A simple model

Therefore as a simplified example of a model adequate for illustrating and explaining the mathematics for using a database of size 1, we can assume the following discrete frequency spectrum as a model: • There are three allelic types.
• One of them has population frequency 1/2
• The other two have population frequency 1/4
• We don't know which is which. There is no information in the allele name.

That may seem like an arbitrary and strange model, but actually it is somewhat realistic. It is a simplified version ("toy") of what I call Brenner's Law, a reasonable description of the real frequency spectrum for forensic STR loci. I've simplified the continuous frequency spectrum down to a discrete one so that we can examine the mathematics in an elementary way instead of coping with integrals or scholarly facts about the β distribution, and to make it maximally simple and yet adequate to explain the point.

### Consequence of the model

Suppose an allele is selected at random. According to the model it is equally likely to be the common allele carried by half the population, or to be one of the two less common alleles which are together carried by half the population. Writing

r(q) for the probability that the allele of an individual chosen at random from the population belongs to an allelic type of population frequency q,
r(¼)=r(½)=½.

## The data

For purposes of analysis it's most general and best to think of the database as part of the evidence. Hence the evidence E consists of

• Allele Q is found at the crime scene.
• The reference database has one allele, which let us suppose isn't Q.
• The suspect is Q.

We're going to need Pr(E | H) for several different hypotheses H:

• the suspect is or is not the donor — H1 or H0
• the allele frequency is ¼ or ½
crime scene database suspect frequency q prior:Pr( q=Pr(Q) ) H1, suspect is donor H0, suspect is random L1(·) L1(q) ΣrL1 L0(·) L0(q) Evidence E Pr(Q) likelihood L=Pr(E | H) likelihood ratio L1/L0 Q R Q q=¼ r(¼)=Pr(q=¼)=½ q(1−q) 12/64 28/128 q2(1−q) 3/64 11/128 28/11 q=½ r(½)=½ 16/64 8/64

Note the idea of the computation. The insight of treating the database as part of the evidence leads to defining

E={crime scene=Q, database={R}, suspect=Q}.

We need to calculate likelihood expressions such as Pr(E|H1}. The trick to do so is to partition E according to the set of mutually exclusive and exhaustive events {q=½, q=¼}. Each of them has prior probability r=½, so

 L1 = Pr(E|H1) = Pr(q=¼ & E| H1) + Pr(q=½ & E| H1) = Pr(q=¼)Pr(E | H1 & q=¼) + Pr(q=½)Pr(E | H1 & q=½) (**) = ½(q(1−q))|q=¼ + ½(q(1−q))|q=½ |q=¼ means "evaluated at q=¼" = ½·12/64 + ½·16/64 = 28/128.

As shown in the table, a similar computation leads to L0=11/128 and therefore to a likelihood ratio LR=L1/L0 = 28/11 = 2.78 supporting the suspect as the donor of the crime scene allele.

If the database were very large and practically definitively supported q=¼ then the LR would be very nearly 4 and the smaller value we have calculated for a database of size n=1 has indeed "punished" the database. On the other hand if (somewhat less likely) the larger database makes clear that q=½, then LR=2, smaller than we have calculated. Hence the result LR=28/11 represents a compromise — a mathematically correct and exact compromise based on the evidence — between the two frequency possibilities of ¼ and ½.

To see if in general the 1-allele database is fair, we need to make a similar comparison for the other database possibility then do some sort of averaging. Consider then the case that E={crime scene=Q, database={Q}, suspect=Q}.

crime scene database suspect frequency q prior:Pr( q=Pr(Q) ) H1, suspect is donor H0, suspect is random L1(·) L1(q) ΣrL1 L0(·) L0(q) Evidence E Pr(Q) likelihood L=Pr(E | H) likelihood ratio L1/L0 Q Q Q q=¼ r(¼)=½ q2 4/64 20/128 q3 1/64 9/128 20/9 q=½ r(½)=½ 16/64 8/64

To evaluate the above two numbers we could compute the average (geometrical average is appropriate) LR from an n=1 database with the average LR from an infinite database. Doing the arithmetic exactly is a bit tedious but unnecessary; it's easy to see how it will turn out. On the 50% of occasions when q=¼ for the evidential, crime scene allele, the average LR is between 20/9 and 28/11 so roughly LR(n=1, q=¼)=2.5, notably "conservative" compared to LR(n=∞, q=¼)=4. On the other 50% of occasions when q=½ for the evidential allele, LR(n=1, q=½) is a shade smaller than before but still around 2.5, which this time is only slightly "anti-conservative" compared to LR(n=∞, q=½)=2. Hence on average the small database is "punished" in that on average we impute less evidence from it against a matching suspect.

## More realistic models

It would not be difficult to extend the above analysis to more complicated models.

First step, extend the discrete two-frequency model to a discrete model with say 100 possible frequencies qi=0.01i, i∈{1, 2, ..., 100} and allow arbitrary weighting factors ri. To accomodate this model just change the formula (**) to

Pr( E | H1) = Σi ri Pr(E | H1 & q=qi)
and write a similar formula for Pr( E | H0).

Second step, we can handle a continuous frequency model such as Ewen's frequency spectrum for the "infinite alleles" model by changing the summation to integration.

In every case the resultant LR can be derived rigorously. There is no philosophy involved in the sense of decisions based on taste or judgment. Note that confidence intervals never entered the analysis, nor did even sample frequency (which is what a confidence interval is typically attached to). Numbers, mathematics, probability. That's all there is.

## Modelling to the hilt

Exercise for the reader: With the stated model, what is the LR if the database size n=0?

## Concluding remarks

There is though room for philosophy in reflecting upon the theory described above. On what assumptions does my analysis rest? I did assume an explicit and quantitative model, I assumed that one knows what frequency spectrum nature tends to provide. In real life that's an exaggeration. We don't know the model, not exactly, and therefore I would not claim that results of the kind of calculation illustrated above give precise results. If the model is to be used in court to influence grave matters, then we should have some good understanding and reasons to trust the model and to know how far to trust it. I would like to be able to quantify limitations of the model and while I may be able to do that in particular cases, I admit that from a theoretical perspective I don't know how to do it in general. It is true that having a large database affords some protection against uncertainty about the model — innaccuracy in choosing the weighting function ri tends to be papered over by a large n.

And large n of course also correlates with small confidence intervals. But it does not follow and is not true that confidence intervals, computation of sampling variation, quantify model uncertainty. The example above proves that point: The one-allele database exemplifies enormous "sampling uncertainty" but if the model is correct the result is exact. How uncertain the model is depends on a lot of things but sampling variation isn't one of them. 