Summary of solution
Autosomal analysis
1. Deriving the formulas
Y-chromosome analysis
1. mutation in the Y-haplotype
2. Which is right?
  1. Pragmatic estimate
Approach to "frequencies"

Forensic mathematics home page
Comments are welcome (see home page for email)

Paper Challenge

Here are some suggestions for approaching the ESWG ISFG 2004 paper challenge.

The problem is to evaluate given DNA data for paternity attribution in a case where the alleged father is not tested. Instead, DNA profiles are available for his sister (SI) and his brother (BR).

There is both autosomal (Identifiler) data, and a Y-haplotype.

Summary

autosomal

Y-haplotype

Combined Likelihood ratio	90000
Autosomal (Identifiler)	78200
Y-haplotype	1.15

prior probability of 50%

99.999%

The autosomal analysis

Kinship

C : M+ Fa/? ;; C's father untyped Fa, or unknown ?
Fa, U, A : ? +? ;; Fa, U(ncle), A(unt) are siblings
;; (U and A correspond to BR and SI
;; in the official statement of the problem)

	likelihood ratio (PI)	formula at this locus	Allele frequencies used	Mother	Child	SI	BR
cumulative LR for autosomal loci	78200	(meaning of the letters)		M	C	A	U
D8S1179	1.45	(1+2p+r) / (4r+8pr)	p=0.143 r=0.199	14	14	12 13	12 14
D21S11	4.37	(2+7q+r) / (4q+4r+8qq+8qr)	q=0.112 r=0.0193	31 32	31 32	30 31	31 32
D7S820	1.43	(1+2p+s+4t+ps+pt+5st+tt) / (4p+4t+4ps+4pt+4st+4tt+8pst+8stt)	p=0.137 s=0.215 t=0.146	8 12	8 12	11 12	11 12
CSF1PO	1.17	(2+p+7q) / (4p+4q+8pq+8qq)	p=0.262 q=0.328	10 11	10 11	11 12	10 11
D3S1358	1.33	(1+5p) / (4p+8pp)	p=0.291	17 18	15 17	15 18	15 16
TH01	2.29	(1+a+r) / (4r+4ar)	a=0.349 r=0.119	6 7	6 8	8 9.3	9.3
D13S317	1.36	(1+5q) / (4q+8qq)	q=0.284	12	12	11 12	12 13
D16S539	18.9	(1+4p+q+pp+5pq) / (4p+4pp+4pq+8ppq)	p=0.0138 q=0.148	11	8 11	8 9	8 9
D2S1338	0.433	(2+r+s) / (4+4r+4s+8rs)	r=0.113 s=0.164	19 20	17 20	19 20	19 20
D19S433	3.94	1 / 4p	p=0.0634	13 16	12 13	13 14	12 15
VWA	3.26	(1+5q) / (4q+8qq)	q=0.0952	16 17	15 16	14 15	15 17
TPOX	1.17	(1+p+s) / (4s+4ps)	p=0.575 s=0.247	8	8 11	8 11	8
D18S51	30.2	1 / 4z	z=0.00828	16 17	17 22	16 18	12 22
D5S818	0.221	1 / (4+8r)	r=0.0662	9 11	11	8 10	9 10
FGA	2.18	(1+5p) / (4p+8pp)	p=0.156	23	21 23	21 22	21 23

Meaning of the letters – The letter p is for the expected frequency of the smallest allele that appears in a locus, then the subsequent few letters q, r, ... correspond to sizes 1 step, 2 steps, etc. larger. The sequence of consecutive letters is broken – e.g. a for the 9.3 allele in TH01 – to indicate an allele that is remote, mutationally speaking.

Deriving the formulas

For the problem here, to compute X (for example), the relevant people would be Ma, Child, SI, BR, the father, and the paternal grandparents Gma and Gpa. If we consider the locus D8S1179 as an example, then the genotype combinations to consider for each person are as much as

for Ma, 14
for Child, 14
for SI, 12,13
for BR, 12,14
for Gma, all 10 combinations of 12, 13, 14, or z, where z=all others
for Gpa, all 10 combinations of 12, 13, 14, or z,
for Father, all 10 combinations of 12, 13, 14, or z,

However, there is an essential complication that applies to this problem compared to if there were only one aunt or uncle for example, as there was in last year's paper challenge. Suppose for example that the type or BR were not given, just of SI. Then two "slots" among the four grandparental alleles would be known to be 12 and 13, and the other two slots would be known to be unknown. From this, the probability that Father would pass a 14 allele is easily seen to be the probability that he receives and in turn transmits one of the two empty slots, times the chance that that slot is a 14.

When also BR's type is known, the above reasoning breaks down because we don't know if the 12 from BR and SI is the same 12, or two different 12's. The number of slots accounted for in the grandparents is somewhere between 3 and 4 slots (probabalistically speaking). Last year's shortcut doesn't work.

The Y-haplotype analysis

person	C	U (i.e. BR)
DYS390	24	25
other loci	16, 11, ...	16, 11, ...
# observations in database	N=170
# matching observations	k=1	k=0
name for the haplotype	Y_C	Y_U
notation for probability to see the type in unrelated person	c	u
DYS390 mutation frequency	μ=0.009

mutation in the Y haplotype

Y_C

Y_U

There are several possible approaches. We use the notation LR for the likelihood ratio, and
LR = X/Y, where
X = Prob(observed haplotypes | BR an uncle of C) and
Y = Prob(observed haplotypes | BR unrelated to C).

Y = cu. X is more difficult.

child-centric approach

Y_C

Y_U

Hence
X = c•3μ/2 and
LR = X/Y = X/cu = 3μ/2u.

It remains to estimate u.

uncle-centric approach

3μ/2c

grandfather-centric approach

Y_C

Y_U

Hence
X = cμ/2 + u•2μ/2, so
LR = X/Y = X/cu = μ(1/2u + 1/c).

Which approach is right? How to estimate c and/or u?

Postscript June 2008.

Thanks to Steve Myers & John Planz for noting that the grandfather-centric approach is the logical one.

Also, on reflection I'm not worried as to whether the present population data is appropriate for previous generations. Of course the population frequencies have changed, but frequency was never the issue anyway, it's a question of probability. And data about the present, if that's all you've got, is a valid and I think unbiased indication of the past state equally as of the present.

Pragmatic estimate of the Y-haplotype evidence

Note that all formulas are equivalent if c = u. Therefore to be conservative let's take the uncle-centric view and take c=2/171.

Hence LR = 3•0.009/2(2/171) = 1.15.

The meaning of this neutral result is that the chance to see so rare a haplotype by mutation is about the same as the chance to see it at random in an unrelated individual.

Approach to "frequencies"

(k+1)/(N+1) rule

N/2

(k+1)/(N+1)

The justification is that I consider the case occurrence of the allele in an unknown sample (or a child) as a (k+1)^st observation of the allele out of now N+1 total observations. In other words, I (temporarily) toss it into the official database, producing what I call an "extended database."

At that point, (conceptually) before examining the suspect's (or the father's) allele, we ask what it the probability that an innocent suspect will match. Assuming the extended database is representative of the universe, the answer is (k+1)/(N+1).

Postscript June 2008: see Allele probability – the (x+1)/(N+1) rule for further discussion.

Controversy

all

it gives a different estimate for the same allele depending on the case
it is nearly impossible to assess how many times an allele has been "seen" when considering a group of relatives

Others have claimed (Stockmarr, maybe Balding) that you should add the allele twice – once for the stain and once for the suspect. I think this is illogical because you should (in concept) evaluate the match probability before you know if the suspect matches. If you evaluate it afterwards, when you know if there is a match, isn't the probability either 0 or 1? Besides, Dawid and Mortera worked it out mathematically and their formula has +1.

Rare haplotypes

So the (k+1)/(N+1) rule is rather conservative for rare haplotype systems.

Conclusion

(k+1)/(N+1)

For example, the 22 allele in D18S51 was observed 5/724 times in the database. Hence z=6/725=0.00828.

Why the quotes on "frequency"?

probability

summary of available information

Another way to see clearly the distinction is to consider the so-called "frequency" of a full DNA profile. Typically the matching probability equates to far less than 1/world population, whereas a frequency would by definition need to be some integer out of the world (or whichever) population. It might help to realize that the probability experiment implied by saying that a matching probability is 1 in a trillion is not to consider trillions of different people and count how many match the given profile; it is to consider trillions of repetitions of the circumstances of this case (e,g. in parallel universes or over a long period of time).