# Allele probability – the (x+1)/(N+1) rule

Charles Brenner, et al
June 22, 2008

## Introduction

### Context

In the 2007 committee publication ISFG: Recommendations on Biostatistics in Paternity Testing we (Gjertson, Brenner, et al) included this recommendation:

### R2 Population Genetics

#### R2.1 Allele probabilities

The probability of observing an allele, i, can be estimated as:
(xi+1)/(N+1)
where xi is the number of i alleles and N is the total number of alleles in the existing database.

Guidance: The relevant probability of observing an allele is its conditional probability given observation among tested individuals. The database sample frequency of xi/N, ignoring a new observation in a tested trio, is regularly biased toward paternity . Extending the database with one extra observation is a simple and nearly accurate procedure to overcome the bias.
[The above is the part I want to discuss here.]

[The rest of the recommendation reads as follows:]
In particular, occasionally, a new allele not present in a reference database is observed in routine testing. Then the formula reduces to 1/(N+1) since the marker went unobserved among the N previous alleles in the database. Additionally, laboratories may choose to follow a minimum count policy, such as the NRC II recommendation of a minimum numerator of 5.

### Reaction

For example, I’ve been asked why not +2 in the denominator, i.e. a formula of

(xi+1)/(N+2).
I partially understand the sense of that. If the +1 in the recommendation is to condition the probability on the observed instance of the paternal allele in the child, then wouldn’t it be more accurate to condition also on the non-instance of that allele as the child’s other allele?

### Logic

Let’s lay that out more explicitly. Here’s a way to look at the paternity question:
1. A mother, her child, and an alleged father present for paternity testing.
2. Put the man aside for the moment. Don’t observe his type.
3. Determine the genetic (i.e. DNA) types of mother and child at some locus. Suppose the child type is PQ and from comparison with the mother we can see that Q is the paternal type.
4. At this moment we ask the question: Suppose a non-father is tested for paternity. What is the probability that a (particular) allele of his will be a Q?
5. Before this case, our experience was a population study of size N of which x were type Q. However, the additional observation of Q just noticed in the child is as good as any other, so we should toss it into the population study. That is, our experience to date is having observed x+1 Q’s out of a total of N+1 observations.
6. ... or maybe out of a total of N+2 observations if we also notice the other, non-Q in the child, the maternal P.

So I agree that the N+2 formula isn’t illogical. However it seems to me an unnecessary complication for negligible benefit.

## Discussion

As we stated in the recommendation, the formula is "simple and reasonably accurate".

### Simple

• For the very simplest situation – the thought experiment of one chromosome found at a crime scene – the formula is appropriate. (Still not 100% accurate – see below – but the +1 terms are correct.)
• Admittedly it ignores part of the data. It ignores the other alleles observed in the reference mother and child.
• As we advance in complexity from that haploid identification problem through paternity attribution to very complicated problems such as kinship and mixed stains, there is no limit to the complexities one could introduce in trying to account for all the genetic evidence in detail.

Therefore where is one to stop? In a spirit of practicality, I am inclined to stop at the earliest step as long as it is adequately accurate. Otherwise, we can spend endless hours on trivialities.

Reasonably accurate
• The information of the other alleles from the reference mother and child which are ignored in coming up with the recommended formula is on average unbaised information.
• The difference between the recommended formula and the N+2 formula is about 1/N2, a very small number (in the direction that the recommended formula is slightly "conservative" – favoring non-paternity).
• Before worrying about so small a numerical discrepancy, it would be well instead to consider the very basis, the mathematical philosophy, of inferring a probability (or of estimating a population frequency if you prefer to think of it that way) from a sample frequency. Bridging the gap requires introducing another modelling step, making an assumption about the prior expected distribution of allele frequencies in the population (the expected "frequency spectrum"). The idea that sample frequency is an unbiased estimate for population frequency is not built into the universe. Rather, natural and obvious-seeming as this assumption is, it is at best only approximately true and examples could be given where it is far from true. It is probably a reasonably close estimate for individual DNA STR loci, but the error is surely greater than 1/N2.

### Endless complications

Suppose we don’t stop at the earliest step, and try to incorporate some complications. Here are some examples of the tangled morass we can enjoy.

#### The N+2 formula

From the observation that the child type is PQ, we have in total N+2 observations of which xq are of allele Q. Hence the sample frequency is given by the N+2 formula above.

#### Mother PR, Child PQ

But wait – we have observations of the mother type. Why not
(xq+1)/(N+3)?

#### Mother PQ, Child QQ

Now there are two instances of Q in the family. No problem, you may say; just put +2 in the numerator:
(xq+2)/(N+3).

#### Mother PQ, Child PQ

That was fine, but how about this case where maternal and paternal alleles are not known? We’ll need probability estimates for both P and Q. Which gets +1 and which gets +2?

Don’t give up! There must be a probabilistic answer. Let’s put

Pr(P | Mother, Child types) = (xp+j) / (N+3)
Pr(Q | Mother, Child types) = (xq+k) / (N+3)
with
j+k=3.
Maybe it is reasonable to take j and k to be proportional to xp and xq, giving rise to formulas like
Pr(Q | Mother, Child types) = xq[1+3/(xq+xp)] / (N+3).
Is that attractive?

#### Body identification using three siblings who are PQ, PQ, and PR

Try this as an exercise. I think it will involve square roots.

### Where do you stop?

The formula given is simple and reasonably accurate. Nothing is completely accurate.

The simpler formula x/N is too simple; it’s biased against the alleged father and the bias is significant for small x. (The formula (x+1)/N may seem simpler, but I don’t like it for two reasons:

1. It looks odd because it doesn’t correspond to any model.
2. It fails badly when N is 0 or 1.
To be fair, no one suggested it.)

In the pursuit of greater accuracy you generally run into endless complication with zero or negligible benefit. Only in the homozygous child QQ case it might be justifed to accept the complication of +2, rather than +1. It does correct a situation where the recommendation is anti-conservative. However, I am reluctant to recommend it because:

• It sacrifices the simplicity of always using the same number for the probability of the same allele.
• The error that it remedies is significant only in the very rare situation of homozygosity for a rare allele.
• The only natural stopping point is the one taken by the recommendation.