Allele Probabilities

		Allele probability – the (x+1)/(N+1) rule
		Charles Brenner, et al June 22, 2008

Forensic mathematics home page

The (x+1)/(N+1) rule

Introduction
Discussion

Introduction

Context

ISFG: Recommendations on Biostatistics in Paternity Testing

et al

R2 Population Genetics

R2.1 Allele probabilities

The probability of observing an allele, i, can be estimated as:

(x_i+1)/(N+1)

where x_i is the number of i alleles and N is the total number of alleles in the existing database.

Guidance: The relevant probability of observing an allele is its conditional probability given observation among tested individuals. The database sample frequency of x_i/N, ignoring a new observation in a tested trio, is regularly biased toward paternity . Extending the database with one extra observation is a simple and nearly accurate procedure to overcome the bias.
[The above is the part I want to discuss here.]

[The rest of the recommendation reads as follows:]
In particular, occasionally, a new allele not present in a reference database is observed in routine testing. Then the formula reduces to 1/(N+1) since the marker went unobserved among the N previous alleles in the database. Additionally, laboratories may choose to follow a minimum count policy, such as the NRC II recommendation of a minimum numerator of 5.

Reaction

For example, I’ve been asked why not +2 in the denominator, i.e. a formula of

(x_i+1)/(N+2).

Logic

A mother, her child, and an alleged father present for paternity testing.
Put the man aside for the moment. Don’t observe his type.
Determine the genetic (i.e. DNA) types of mother and child at some locus. Suppose the child type is PQ and from comparison with the mother we can see that Q is the paternal type.
At this moment we ask the question: Suppose a non-father is tested for paternity. What is the probability that a (particular) allele of his will be a Q?
Before this case, our experience was a population study of size N of which x were type Q. However, the additional observation of Q just noticed in the child is as good as any other, so we should toss it into the population study. That is, our experience to date is having observed x+1 Q’s out of a total of N+1 observations.
... or maybe out of a total of N+2 observations if we also notice the other, non-Q in the child, the maternal P.

So I agree that the N+2 formula isn’t illogical. However it seems to me an unnecessary complication for negligible benefit.

Discussion

recommendation

formula

Simple

For the very simplest situation – the thought experiment of one chromosome found at a crime scene – the formula is appropriate. (Still not 100% accurate – see below – but the +1 terms are correct.)
Admittedly it ignores part of the data. It ignores the other alleles observed in the reference mother and child.
As we advance in complexity from that haploid identification problem through paternity attribution to very complicated problems such as kinship and mixed stains, there is no limit to the complexities one could introduce in trying to account for all the genetic evidence in detail.
Therefore where is one to stop? In a spirit of practicality, I am inclined to stop at the earliest step as long as it is adequately accurate. Otherwise, we can spend endless hours on trivialities.

Reasonably accurate

The information of the other alleles from the reference mother and child which are ignored in coming up with the recommended formula is on average unbaised information.
The difference between the recommended formula and the N+2 formula is about 1/N², a very small number (in the direction that the recommended formula is slightly "conservative" – favoring non-paternity).
Before worrying about so small a numerical discrepancy, it would be well instead to consider the very basis, the mathematical philosophy, of inferring a probability (or of estimating a population frequency if you prefer to think of it that way) from a sample frequency. Bridging the gap requires introducing another modelling step, making an assumption about the prior expected distribution of allele frequencies in the population (the expected "frequency spectrum"). The idea that sample frequency is an unbiased estimate for population frequency is not built into the universe. Rather, natural and obvious-seeming as this assumption is, it is at best only approximately true and examples could be given where it is far from true. It is probably a reasonably close estimate for individual DNA STR loci, but the error is surely greater than 1/N².

Endless complications

Suppose we don’t stop at the earliest step, and try to incorporate some complications. Here are some examples of the tangled morass we can enjoy.

The N+2 formula

N+2

x_q

the N+2 formula above

Mother PR, Child PQ

(x_q+1)/(N+3)?

Mother PQ, Child QQ

two

(x_q+2)/(N+3).

Mother PQ, Child PQ

Don’t give up! There must be a probabilistic answer. Let’s put

Pr(P | Mother, Child types) = (x_p+j) / (N+3)
Pr(Q | Mother, Child types) = (x_q+k) / (N+3)

j+k=3.

x_p

x_q,

Pr(Q | Mother, Child types) = x_q[1+3/(x_q+x_p)] / (N+3).

Body identification using three siblings who are PQ, PQ, and PR

Where do you stop?

The formula given is simple and reasonably accurate. Nothing is completely accurate.

The simpler formula x/N is too simple; it’s biased against the alleged father and the bias is significant for small x. (The formula (x+1)/N may seem simpler, but I don’t like it for two reasons:

It looks odd because it doesn’t correspond to any model.
It fails badly when N is 0 or 1.

To be fair, no one suggested it.)

In the pursuit of greater accuracy you generally run into endless complication with zero or negligible benefit. Only in the homozygous child QQ case it might be justifed to accept the complication of +2, rather than +1. It does correct a situation where the recommendation is anti-conservative. However, I am reluctant to recommend it because:

It sacrifices the simplicity of always using the same number for the probability of the same allele.
The error that it remedies is significant only in the very rare situation of homozygosity for a rare allele.
The only natural stopping point is the one taken by the recommendation.

Allele probability – the (x+1)/(N+1) rule