### Table of contents

The Problem (pdf link)
1. Summary of solution
2. Autosomal analysis
1. Deriving the formulas
3. Y-chromosome analysis
4. Approach to "frequencies"

Forensic mathematics home page
Comments are welcome (see home page for email)

# Paper Challenge

Here are some suggestions for approaching the ESWG ISFG 2004 paper challenge.

The problem is to evaluate given DNA data for paternity attribution in a case where the alleged father is not tested. Instead, DNA profiles are available for his sister (SI) and his brother (BR).

There is both autosomal (Identifiler) data, and a Y-haplotype.

## Summary

There's no argument that I know of against combining the autosomal and Y-haplotype results, so overall we have:
 Combined Likelihood ratio 90000 78200 1.15
Obviously this is powerful evidence in favor of paternity. Assuming a prior probability of 50%, the posterior probability of paternity is around 99.999%.

## The autosomal analysis

Naturally I use the DNA•VIEW Kinship module to work out the autosomal part. The problem is stated to kinship as:
 `C : M+ Fa/? `;; C's father untyped Fa, or unknown ? ` Fa, U, A : ? +? `;; Fa, U(ncle), A(unt) are siblings ;; (U and A correspond to BR and SI ;; in the official statement of the problem)

and the following is the result –
 likelihood ratio (PI) formula at this locus Allele frequencies used Mother Child SI BR cumulative LR for autosomal loci 78200 (meaning of the letters) M C A U D8S1179 1.45 (1+2p+r) / (4r+8pr) p=0.143 r=0.199 14 14 12 13 12 14 D21S11 4.37 (2+7q+r) / (4q+4r+8qq+8qr) q=0.112 r=0.0193 31 32 31 32 30 31 31 32 D7S820 1.43 (1+2p+s+4t+ps+pt+5st+tt) / (4p+4t+4ps+4pt+4st+4tt+8pst+8stt) p=0.137 s=0.215 t=0.146 8 12 8 12 11 12 11 12 CSF1PO 1.17 (2+p+7q) / (4p+4q+8pq+8qq) p=0.262 q=0.328 10 11 10 11 11 12 10 11 D3S1358 1.33 (1+5p) / (4p+8pp) p=0.291 17 18 15 17 15 18 15 16 TH01 2.29 (1+a+r) / (4r+4ar) a=0.349 r=0.119 6 7 6 8 8 9.3 9.3 D13S317 1.36 (1+5q) / (4q+8qq) q=0.284 12 12 11 12 12 13 D16S539 18.9 (1+4p+q+pp+5pq) / (4p+4pp+4pq+8ppq) p=0.0138 q=0.148 11 8 11 8 9 8 9 D2S1338 0.433 (2+r+s) / (4+4r+4s+8rs) r=0.113 s=0.164 19 20 17 20 19 20 19 20 D19S433 3.94 1 / 4p p=0.0634 13 16 12 13 13 14 12 15 VWA 3.26 (1+5q) / (4q+8qq) q=0.0952 16 17 15 16 14 15 15 17 TPOX 1.17 (1+p+s) / (4s+4ps) p=0.575 s=0.247 8 8 11 8 11 8 D18S51 30.2 1 / 4z z=0.00828 16 17 17 22 16 18 12 22 D5S818 0.221 1 / (4+8r) r=0.0662 9 11 11 8 10 9 10 FGA 2.18 (1+5p) / (4p+8pp) p=0.156 23 21 23 21 22 21 23
Meaning of the letters – The letter p is for the expected frequency of the smallest allele that appears in a locus, then the subsequent few letters q, r, ... correspond to sizes 1 step, 2 steps, etc. larger. The sequence of consecutive letters is broken – e.g. a for the 9.3 allele in TH01 – to indicate an allele that is remote, mutationally speaking.

### Deriving the formulas

The general method of deriving the formulas involves listing every genotype combination for all the relevant people.

For the problem here, to compute X (for example), the relevant people would be Ma, Child, SI, BR, the father, and the paternal grandparents Gma and Gpa. If we consider the locus D8S1179 as an example, then the genotype combinations to consider for each person are as much as

• for Ma, 14
• for Child, 14
• for SI, 12,13
• for BR, 12,14
• for Gma, all 10 combinations of 12, 13, 14, or z, where z=all others
• for Gpa, all 10 combinations of 12, 13, 14, or z,
• for Father, all 10 combinations of 12, 13, 14, or z,
which means the total number of combinations to consider is potentially
1 x 1 x 1 x 1 x 10 x 10 x 10 = 1000. It's not really as bad as that might make it seem though, because for various reasons a lot of the combinations are irrelevant.

However, there is an essential complication that applies to this problem compared to if there were only one aunt or uncle for example, as there was in last year's paper challenge. Suppose for example that the type or BR were not given, just of SI. Then two "slots" among the four grandparental alleles would be known to be 12 and 13, and the other two slots would be known to be unknown. From this, the probability that Father would pass a 14 allele is easily seen to be the probability that he receives and in turn transmits one of the two empty slots, times the chance that that slot is a 14.

When also BR's type is known, the above reasoning breaks down because we don't know if the 12 from BR and SI is the same 12, or two different 12's. The number of slots accounted for in the grandparents is somewhere between 3 and 4 slots (probabalistically speaking). Last year's shortcut doesn't work.

## The Y-haplotype analysis

The tested child C and alleged uncle BR (=U) would inherit the same Y-haplotype, barring mutation, if they are related. The data shows that they have the same haplotype except for a one-step difference at DYS390:
 C U (i.e. BR) person DYS390 24 25 other loci 16, 11, ... 16, 11, ... # observations in database N=170 # matching observations k=1 k=0 name for the haplotype YC YU notation for probability to see the type in unrelated person c u DYS390 mutation frequency μ=0.009

### mutation in the Y haplotype

Obviously, mutation cannot be ignored in this case. Since μ is the probability of any mutation, but nearly all (90-95%) STR mutations are one-step and expansion and contraction are about equally common, to a reasonable approximation the probability to mutate in either direction between YC and YU is μ/2.

There are several possible approaches. We use the notation LR for the likelihood ratio, and
LR = X/Y, where
X = Prob(observed haplotypes | BR an uncle of C) and
Y = Prob(observed haplotypes | BR unrelated to C).

Y = cu. X is more difficult.

#### child-centric approach

The child has YC, inherited from his father, who inherited it from his father, who passed it to the child's uncle. At each of these three transmissions a mutation between YC and YU may have occurred, with probability μ/2 each time. Therefore, given that a child is type c the probability is approximately 3μ/2 that his uncle is type YU.

Hence
X = c•3μ/2 and
LR = X/Y = X/cu = 3μ/2u.

It remains to estimate u.

#### uncle-centric approach

In a symmetrical way we could begin with the alleged uncle, and obtain instead the formula
LR = 3μ/2c.

#### grandfather-centric approach

The child's grandfather had some Y-haplotype, which for simplicity let's assume was either YC or YU, with respective probabilities c and u. In the former case, the observed types occur if the mutation occurs from grandfather to uncle. In the latter case, the observed types occur if a mutation occurs at either of two transmissions between grandfather and child.

Hence
X = /2 + u•2μ/2, so
LR = X/Y = X/cu = μ(1/2u + 1/c).

### Which approach is right? How to estimate c and/or u?

 Postscript June 2008. Thanks to Steve Myers & John Planz for noting that the grandfather-centric approach is the logical one. Also, on reflection I'm not worried as to whether the present population data is appropriate for previous generations. Of course the population frequencies have changed, but frequency was never the issue anyway, it's a question of probability. And data about the present, if that's all you've got, is a valid and I think unbiased indication of the past state equally as of the present.
Deep questions. What is right depends on such things as what you think the population database represents – grandfather's generation? the child's? If the population were in drift and mutation equilibrium, then I suppose all methods would give the same answer.

#### Pragmatic estimate of the Y-haplotype evidence

Note that all formulas are equivalent if c = u. Therefore to be conservative let's take the uncle-centric view and take c=2/171.
 Hence LR = 3•0.009/2(2/171) = 1.15.

The meaning of this neutral result is that the chance to see so rare a haplotype by mutation is about the same as the chance to see it at random in an unrelated individual.

## Approach to "frequencies"

### (k+1)/(N+1) rule

As a general rule, if an allele arises in a case and it has been observed k times in a database of N observations (i.e. of N/2 people), then I like the estimate (k+1)/(N+1) for the probability that it will be the next allele that I see.

The justification is that I consider the case occurrence of the allele in an unknown sample (or a child) as a (k+1)st observation of the allele out of now N+1 total observations. In other words, I (temporarily) toss it into the official database, producing what I call an "extended database."

At that point, (conceptually) before examining the suspect's (or the father's) allele, we ask what it the probability that an innocent suspect will match. Assuming the extended database is representative of the universe, the answer is (k+1)/(N+1).

Postscript June 2008: see Allele probability – the (x+1)/(N+1) rule for further discussion.

#### Controversy

It's a simple rule, having the merit that it gives the same answer every time for a given allele. Some people (e.g. in the FSS) suggest entending the database with all the alleles in the case. While this is not wrong it is extremely complicated because
1. it gives a different estimate for the same allele depending on the case
2. it is nearly impossible to assess how many times an allele has been "seen" when considering a group of relatives
(Example: Suppose mother=8, three children are 8,9, 8,9, 8,9. How many 9's are accounted for? If 9 is a common allele quite likely the father is 9,9, so toss nearly two 9's into the database. But if 9 is rare, toss in barely over one 9.)
In most cases where we differ, my rule is slightly more conservative.

Others have claimed (Stockmarr, maybe Balding) that you should add the allele twice – once for the stain and once for the suspect. I think this is illogical because you should (in concept) evaluate the match probability before you know if the suspect matches. If you evaluate it afterwards, when you know if there is a match, isn't the probability either 0 or 1? Besides, Dawid and Mortera worked it out mathematically and their formula has +1.

#### Rare haplotypes

The situation with rare traits – such a Y-haplotypes – is somewhat different, mainly because the extended database is not so representative. Rather, it makes sense that in a database with many singly-represented traits, most of them are over-represented compared to the population.

So the (k+1)/(N+1) rule is rather conservative for rare haplotype systems.

#### Conclusion

For better or worse, the evaluations on this page use the (k+1)/(N+1) rule.

For example, the 22 allele in D18S51 was observed 5/724 times in the database. Hence z=6/725=0.00828.

### Why the quotes on "frequency"?

It's slang, that's why. What we really want, in a casework situation, is the probability to see a given allele at random in the population. A probability is by definition a summary of available information. Were the population frequency of an allele known, then it would likely equate to the desired probability, but normally only a sample frequency is known.

Another way to see clearly the distinction is to consider the so-called "frequency" of a full DNA profile. Typically the matching probability equates to far less than 1/world population, whereas a frequency would by definition need to be some integer out of the world (or whichever) population. It might help to realize that the probability experiment implied by saying that a matching probability is 1 in a trillion is not to consider trillions of different people and count how many match the given profile; it is to consider trillions of repetitions of the circumstances of this case (e,g. in parallel universes or over a long period of time).

### Types not seen

In the present case suppose we collect the child data and at that point try to calculated the LR values corresponding to various combinations that might occur in the other people, including of course the one that happens actually to turn up. In making that prospective evaluation, what would we reckon the probability to see the Y-haplotype YU, which is neither in the database nor, as yet, observed in the present case? I suppose we could say that we can probabilistically infer it μ/2 = 0.0045 times in the child's father. Hence we should add it that many times to the database, which gives a sample frequency of 0.0045/170.0045 = 1/38000 in the "extended database" based on the thinking above. I won't use this estimate though!

Go to top