Fundamental problem of forensic mathematics – The Evidential Value of a Rare Haplotype

Table of contents

  1. Introduction
    1. mtDNA & Y-chromosomes
    2. The problem
    3. Guide to this paper
  2. Analysis
    1. Extended database
    2. A simplified approach – counting
    3. Analysis of singletons
    4. Fundamental question – simple argument
  3. Results and validation
    1. Inflation factor
      1. Pathological inflation
    2. Accuracy – sample overrepresentation
      1. A diverse artificial population
      2. Very popular haplotypes
      3. Probable populations
    3. Real and simulated validation data
      1. Data Population simulations
      2. Neutrality test
    4. Observations from the simulation models
      1. Preliminary observations
      2. Evaluation of the κ model
  4. Discussion
    1. Matching probabilities for non-singletons
      1. Matching LR for multiply observed type
      2. Average matching LR
    2. Comparison with other methods
      1. Haplotype survey
      2. The "FBI method"
    3. Comment – Probability vs. Frequency
    4. Conclusion
  5. Appendix
    1. Notation
    2. Statistical estimates
      1. Maximum likelihood estimate of single trait probability
      2. Comparing the t model with the κ model
      3. Effective number of types
      4. Matching LR for new type from me
    3. Population genetic estimates
      1. Effective population size and effective number of alleles
      2. Expected value of κ
      3. Value of θ
    4. Condition of equal overrepresentation
  6. Acknowledgments

Brenner CH
Forensic Science International: Genetics 2010 4 281–291
This is an author's version of the paper, reformatted for the web.


Y-chromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identification. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical one: When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? Assume a reference population sample is available which contains n−1 haplotypes. The most interesting situation as well as the most common one is that the crime scene haplotype was never observed in the population sample. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction κ of the population sample that consists of "singletons" – of once-observed types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (1−κ)/n – distinctly less than 1/n, likely ten times less. The robust validity of this model is confirmed by testing it against a range of population models.

This paper hinges above all on one key insight: probability is not frequency. The common but erroneous "frequency" approach adopts population frequency as a surrogate for matching probability and attempts the intractible problem of guessing how many instances exist of the specific haplotype at a certain crime. Probability, by contrast, depends by definition only on the available data. Hence if different haplotypes but with the same data occur in two different crimes, although the frequencies are different (and are hopelessly elusive), the matching probabilities are the same, and are not hard to find.

Keywords: haplotype; stain matching; mtDNA; Y-haplotype; forensic mathematics; likelihood ratio; matching probability

  1. Introduction
    1. mtDNA & Y-chromosomes
    2. In recent years there has been increasing interest in using Y-chromosomal haplotypes [Prinz, Mayntz-Press, Malsom, Kayser, Jobling] or mtDNA for forensic identification. These haplotype systems are also much used for body identification, especially for old graves [Ivanov, Keyser-Tracqui]. The advantages for some kinds of problems are considerable. Both methods are desirable for attacking kinship problems involving remote relatives because sex-linked traits are not diluted 50% each generation. Both methods are useful for some scant crime stains – mtDNA because of the high copy number, the Y-chromosome because it can be detected and amplified unambiguously even when it is a minor component compared to the female victim in a rape sample. However, an mtDNA or a Y-chromosomal haplotype must be treated mathematically as a single indivisible ("atomic") trait; so unlike those traditional DNA methods which examine several traits that are approximately independent of one another, no multiplication of probabilities is possible. Therefore it is vital to have a sound fundamental understanding of atomic trait matching probabilities in order to make a reasonable assessment of the strength of identification evidence.

    3. The problem
    4. Evidence – such as a Y-haplotype – is identical between a crime scene and a suspected donor. How strong is the evidence that the suspect is the donor? In particular, this paper discusses the critical matching probability question, "What is the probability that a random non-donor would by chance match a newly observed type?"

      Such interesting questions and complications as possible dependence among traits and suitability of the reference population sample are outside the domain of this paper, essential though they may be to the comprehensive evaluation of DNA evidence. Moreover, to keep the discussion focused and simple, I ignore complications such as mitochondrial heteroplasmy, or adjustments appropriate because of multiple samples or preliminary phenotypic analysis. I take the DNA sequence of a haplotype as being no more than an arbitrary name; in this respect my model differs from that of Krawczak [Roewer].

      The matching probability will be derived by analysis of a population sample. For the purpose of this paper, assume that the population sample is appropriate, that it is randomly representative of possible innocent suspects. It may well be true that geographical clustering or other population genetic phenomena make it difficult to obtain a perfectly representative reference sample (subset, or "database") but for the purpose of this paper put such difficulties aside.

      The problem attacked here is therefore a modest one. But it is also a fundamental one, for without understanding the proper analysis of an individual indivisible trait under the simplest assumptions it is impossible to give a proper analysis of DNA evidence in any situation. I therefore consider this the fundamental question of forensic mathematics.

      I shall focus on the case – typical for mtDNA and Y-haplotypes – that the observed crime scene type is not found in the population sample.

      Evidentiary strength of a matching type is given by a likelihood ratio (LR), called here the matching LR:

      LR=x/y, where x and y are likelihoods
      x=Pr(suspect matches crime scene type | suspect is the donor)
      y=Pr(suspect matches crime scene type | suspect unconnected to the crime).

      The simple view is to assume x=1. That isn't strictly true – one could model typing fallibility (laboratory error), heteroplasmy (in the case of mitochondrial evidence), and mutation (when identification is via relatives). But to keep to the point, settle for x=1. That being the case, we have simply LR=1/y, where y is the matching probability.

      The formulation of y here states "suspect unconnected to the crime." I prefer this to saying "suspect unrelated to the donor" since after all everyone is genetically related, and when dealing with a Y-chromosomal or mtDNA match to pretend otherwise is particularly artificial. As an assumption that is both realistic enough to be useful and ideal enough to permit analysis, think of the suspect, if innocent, as being randomly selected, hence having the same random chance as anyone else in the population to have any particular relationship to the donor. (Similarly, while a reference population sample should be gathered without a bias toward related people it also should not be purged of relatives. It should contain people who are related randomly.)

    5. Guide to this paper
    6. For the practical forensic scientist, the essential sections of this paper, amounting to about one third of the total bulk, are The problem, Analysis which derives the main result, the likelihood ratio for matching a previously unobserved trait, Conclusions and Probability vs. Frequency. It is also useful to be aware of Appendix Notation which gathers together the symbols encountered in the paper.

      The remainder of the paper includes validation and consideration of non-new traits in §III Results, some theoretical context in Appendix §V.C, and §IV.B comments on rival approaches. The mathematics in the paper does not go beyond straightforward algebra and probabilties, and for the most part is relegated to appendices.

  2. Analysis
    1. Extended database
    2. Denote by S0 the crime scene haplotype, and let D- ("database") be a population sample of size n−1. Since D- is stipulated to be an appropriate and representative population sample, there's no harm in thinking of D- as a collection of crime scene haplotypes from other similar but unrelated crimes. Assume that the type S0 does not appear in D-. There is some collection of types that occur exactly once in D-. Call them singletons (Figure 1), and let α1−1 be the number of singletons. To avoid double subscripts I shall generally omit the subscript 1 and write α for α1. Label the singleton types S1, S2, ..., Sα−1. These and S0 are all by definition distinct. We shall eventually
      eventually determine... The point of view here is prospective, contrary to the classical preference of some mathematical writers who evaluate the likelihood ratio in retrospect after S0 and (matching) suspect type T are both observed. I prefer my approach because I think it is slightly less formal, but the mathematics end up the same either way.
      determine the haplotype T of some innocent suspect, a person unrelated to the crime scene donor, or more precisely at most randomly related like any two people in D-. The situation of concern is when T=S0, when our innocent suspect is unluckily inculpated. Therefore the matching probability y=Pr(T=S0) is of vital interest.

      DYS390DYS391 DYS392
      249 11*
      Figure 1 Example population sample D with n=10 and α=4 singletons (starred). Hence κ=0.4.

      The first key observation is that the probabilities Pr(T=Si) are all the same. Probability is a summary of information and the information about the all the Si is exactly the same: all have been observed exactly once and all lack any connection to the innocent suspect. Consequently it is convenient and appropriate to think of the extended database D=D-∪{S0} obtained by tossing S0 into D-. D is a population sample of size n, with α singletons: S0, S1, S2, ..., Sα−1. The matching probability to a crime sample of an new type and with database D- is equivalent to the matching probability to a singleton in database D when there has been no crime.

    3. A simplified approach – counting
    4. Imagine comparing T with the n types in D. Since T by stipulation has no connection to the crime or to D-, even if we assume that T matches some type among the n, the probability y is at most 1/n that T match any particular unique type – call it a singleton – such as S0. Therefore if LR is the likelihood ratio that a matching suspect is the source of S0 we can say confidently that

      LR > LRc = n (1)
      where the subscript c indicates reference to what we may refer to as the "counting model." Note though that the count includes S0.

      It is surely possible to do even better. A major task of this paper is deriving (§II.D) and justifying an inflation factor(§III.A) by which (1) can be improved if we additionally take into account the considerable chance T doesn't match any of the n previously observed types.

      It could be that even a further improvement is possible by arguing that, even when T matches some type among the n, singletons such as S0 have less than their 1/n fair share of probability to be matched. This possibility is examined in §III.B.2. However, while there may be such an effect if some haplotypes are associated with a fitness advantage, it is not very large and is hard to model in a convincing way.

    5. Analysis of singletons
    6. Let κ=α/n be the proportion of singletons among a sample of n haplotypes. Obviously κ depends on the testing system. All other things being equal, testing 17 loci on the Y-chromosome produces a larger proportion of singletons than does testing 7 loci. Also, κ depends on the sample size n. If n is small it may be that κ=1, but we concentrate on κ<1.
      kappa change with sample growth Figure 2 Change in κ(n), the expected fraction of once-observed types, with increasing sample size n for Y-filer haplotypes (ABI Caucasian). Each κ(n) is estimated by averaging the κ values for several hundred random n-subsets of the original 1272 haplotypes. The best fit to a curve of the form κ(n)=θ/(n+θ) is with θ=10930.
      Conversely if n is a substantial fraction of the world population most sampled men will have a son, father, or brother also sampled, so there will be relatively few singleton Y-haplotypes. In the limit κ is about twice the per-generation mutation rate. Certainly κ changes only slowly with n (see Figure 2). Appendix §V.C indicates why it is natural to expect that κ(n) behaves approximately as θ/(n+θ) for some constant θ, and interprets θ.

      What is the probability that the next haplotype observed will be new? Answer: Since κ grows only slowly, it should be about the same as the probability that the last one was, which is to say κ. This is close to a theorem of H. Robbins [Robbins]. It follows as an immediate corollary that about κ of the types in the population are not represented in D, have not been seen. That's how typical it is that a crime scene type is previously unseen.

    7. Fundamental question – simple argument
    8. Now we can evaluate y=Pr(S0=T). The analysis proceeds in three simple steps. If S0=T, then each of the following must be true:

      A. (Observed) T matches something in D. Pr(Observed) means the probability that a haplotype matching T is found in D.
      Since T is a new type with probability κ, Pr(Observed)=1−κ. (2)
      B. (Singleton) T matches some singleton in D. (3) Pr(Singleton) would be the probability for T to match a haplotype among the singletons of D.
      Pr(Singleton | Observed) is the probability for T to match a singleton given that it matches something (given "Observed"). The "obvious" idea is to assume that a match will be to a singleton or to a non-singleton pro-rata according to the singleton/non-singleton ratio in D
      An exact answer may be difficult, but an obvious claim is


      which is considered in more detail in §III.B.
      C. (Match) T=S0. There are α singletons, so given a match to a singleton it is 1/α to match any particular one.
      Given that T matches some singleton Si in D, the subscript i is equally likely to have one value as another, hence

      Pr(Match|Observed & Singleton)=1/α.


      Ploddingly putting the above together –

      y = Pr(Match)
      = Pr(Match & Singleton & Observed)
      = Pr(Match|Singleton & Observed)Pr(Singleton|Observed)Pr(Observed)
      ≈ (1−κ)κ/α
      = (1−κ)/n.(5)
      Consequently the evidential strength for matching a previously unseen haplotype
      LR ≈ LRκ = n/(1−κ) (6)

      where the subscript κ corresponds to the above analysis according the "kappa model."

  3. Results and validation
    1. Inflation factor
    2. It has occasionally been said that the matching LR can be no larger than n−1, the size of the database. Formula (6) gives matching LR that are larger than n by a factor of 1/(1−κ), which I therefore call the inflation factor in Table 1.

      sample size singletons singleton proportion inflation factor matching LR
      for new type (1/y) for any type (me =1/q)
      κ model t model empirical - per pairwise match rate
      n-1 α-1 κ=α/n 1/(1-κ) LRκ=n/(1-κ) t=n/ln κ
      mtDNA Mitotyping Technologies2000~1200~60% 2.5~50003917
      mtDNA AFDIL
      all races786749410.63 2.72115716919
      Caucasians12197500.62 2.631742514
      7 Y loci Charité Caucasians 243811010.451.8 44493070
      Y-Plex (11 loci) Reliagene
      African American160512050.75 4.0644856071410
      Caucasian12426910.562.3 28042122285
      Hispanic4523480.774.4 19731737465
      Native American104770.743.9 408353153
      Yfiler 17-plex; ABI
      African American9779170.94 16.3159411544714020
      African59390.673.0 18014890
      Asian3273090.9518.2 597758114100
      Caucasian127211480.9010.3 13069124218883
      Filipino1051010.9626.5 280927562730
      SE Hispanic5975430.91 11.1662263195233
      Native American106106≥0.99*107* 11500*11400*
      * To avoid κ=1, instead of α=n, analyzed on the basis that α>n–1.
      Vietnamese103970.9417.3 180317501751
      17 Y loci Gaeda Portuguese3132950.94 17.4547853192713
      10 Y loci Hammer Poland/Hungary Ashkenazi Jews 4961760.361.6 77248197
      Table 1 some haplotype populations sample statistics.

      1. Pathological inflation
      2. For small enough samples or exceedingly polymorphic traits it may happen that κ=1 in which case formula (6) gives an infinite LR and infinite inflation factor, which obviously cannot be accurate. Possible remedies include a careful and refined statistical treatment, or simply avoiding κ too close to 1. For the present study I prefer the latter approach.

    3. Accuracy – sample overrepresentation
    4. Let's consider more carefully question B, Pr(Singleton|Observed). If there are any types in the population that don't appear in the sample, they are a fortiori underrepresented in the sample. Therefore to compensate, types that do occur in the sample tend to be overrepresented [Morton] This would be most obviously true for represented rare types. The intuition behind (3) is that singleton types in D are on average at least as overrepresented, compared to their population frequency, as are non-singleton types. That is, let f=1 denote the (normally unknown) combined population frequency of the types {Si} which are singletons in D, and f>1 denote the combined population frequency of the types that occur multiply in D. The corresponding sample frequencies are κ and 1−κ. Then κ/f=1 is the overrepresentation rate for singletons and (1−κ)/f>1 is the overrepresentation rate for non-singletons. In claiming (3) we are assuming that (see Appendix, Condition of equal overrepresentation)

      κ/f=1 ≈ (1−κ)/f>1, (7)
      though for the courtroom context we will normally also be quite happy if the inequality
      κ/f=1 ≥ (1−κ)/f>1 (8)

      holds as it is "conservative" in the sense that it implies that the LR given by (6) at worst understates the evidence connecting suspect to crime.

      Is (8) justified? That depends on the frequency spectrum – the distribution of frequencies of types – that the evolutionary mechanisms such as mutation, drift, and selection tend to produce. First consider two unrealistic extreme frequency spectra, then reality.

      1. A diverse artificial population
      2. At one extreme, the worst case for (8), imagine the sample D of size n to be drawn from an artificial population Ωt consisting of some large unknown number t, approximately in the range n < t < n2/2, of equally rare types. Then D, in addition to α1 singletons we may have some α2>0 doubletons and even some tripletons, i.e. duplicated types, which are overrepresented two and three times more than the singletons are overrepresented. In short, the singletons are overrepresented the least and (8) fails; the recommendation (6) would be anti-conservative. Formula (3) is therefore not as obvious as it looks. However, even if we erroneously apply the κ model formula (6) to the Ωt model, the expected error in using LRκ instead of the correct LRt=t would be small (Appendix, §V.B.1).

      3. Very popular haplotypes
      4. At the opposite extreme consider a population Ωz,t with a common type Z of substantial frequency and a large number of very rare types of frequency 1/t. As an example, Ω0.2, 10000 has a 20% type and 8000 rare types. A sample D from Ω0.2, 10000 of size n=100 rates to include nearly 80 singletons (thus κ≈0.8 and f=180/10000), about 20 copies of Z and perhaps a few doubletons of rare types (f>1=0.2+). Hence κ/f=1=100 while (1−κ)/f>1≈1, so (8) holds with a factor of 100 to spare.

      5. Probable populations
      6. The frequency spectrum of a real haplotype systems lies in between. Constant t is unrealistic.
        Constant t is unrealistic providing we except extremes such as sequencing the entire Y chromosome which might realize the model ΩN – a different type for everyone.
        Different lineages will have differing numbers of descendants so by random drift Ωt is not stable. Significantly common types Z as in the Ωz,t model tend to crumble out of existence because of mutation, but this depends on the rate of mutation, which in turn is proportional to the number of variable sites in the haplotype. For example if we momentarily limit our attention to the 9-locus "minimal haplotype" [Roewer] projection of 1272 17-locus ABI Caucasians (Table 1), one type occurs 66 times (5%) and another 36 times (3%) – common types like Z. Considering all 17 loci breaks these two large groups into 86 groups, mostly singletons of course, and the largest clump is 4 men (the second most popular 17-locus type; the most popular occurs 7 times). Consequently, the two sides of (7) are not far from equal and therefore there is not a lot of scope for improving (6) when the mutation rate is high.

    5. Real and simulated validation data
      1. Data
      2. The application and applicability of the methods developed here are evaluated partly by consideration and analysis of real data. Available population studies include Y haplotype and mtDNA datasets as listed in Table 1. The Y-haplotype databases are ABI data obtained per [Applied Biosystems], [Roewer], ancient Mongolian [Keyser-Tracqui] Reliagene provided by S Sinha, Portuguese from H Gaeda, Macedonian from Z Jakovski, Krakow from M Sanak and from T Kupiec, Malaysian from P Krishnan, and the Ashkenazy data is M Hammer's passed to me via AFDIL. The information about mtDNA databases is from M Coble (AFDIL) and T Melton (Mitotyping Technologies).

        For the purpose of this paper a haplotype database is sufficiently described by its spectrum – the numbers α1, α2, ... of traits seen once, twice, etc. The spectrum was determined for each Y haplotype database after eliminating the handful of irregular observations (profiles with a missing, off-ladder, or extra allele). See Table 2.

        population n κ=α/n #loci spectrum
        α α2 α3 α4 α5 α6 α7 α8 α9 α10 α11 α12 p>12
        Black(US) Reliagene22390.61 1113731874428 195631 111α15=1
        Caucasian Reliagene18260.55 11100511344 1911416 2211 α131415161819262942=1
        Hispanic Reliagene4540.7711 34817622 -2--1- 1
        Native Am Reliagene1040.7411 778-1- -1
        Portuguese3150.9417 29751-1
        Malaysian (Malaysia)3320.9617 31951
        Chinese (Malaysia)3310.8917 29518
        Indian (Malaysia)3050.9417 2879
        Krakow (IES)5230.9317 48519
        Ashkenazy Jewish4960.3510 176281975 1-1--- 1α13=1α14=2α151618=1 α19=2
        Macedonian1000.8317 8371
        Krakow (Jagellonian)1890.9517 1795
        ancient Mongolian270.6710 1831
        Black (US) ABI9850.9417 92528
        African ABI590.6617 3961-1
        Asian ABI3300.9517 3127-1
        Caucasian ABI12760.9017 11524951- -1
        Filipino ABI1050.9617 1012
        SE Hispanic ABI5970.9117 5432221
        Native Am ABI1061.0017 106
        Vietnamese ABI1030.9417 973
        Table 2 spectra and singleton proportion (κ) of some Y haplotype population samples

      3. Population simulations
      4. The conclusions of this paper are partly checked and justified by computer simulation experiments. Each experiment consists of modeling the evolution of a population of haplotypes. Thirty-nine different simulated populations Ω1, ..., Ω39 were thus generated under a variety of modeling conditions of mutation rate, mutation model, population size, and rate of population growth.

        All populations were generated following a Wright-Fisher approach, simulating a generation at a time. Each new generation is obtained as a sample with replacement from the previous generation with some of the haplotypes then modified by a mutation rule. Some populations are grown, starting with a single haplotype and growing at a chosen rate per generation until the target size is reached. Other populations are obtained from a grown population by stabilizing for several hundred generations without growth. Mutation rules include: infinitely many alleles (every mutation produces a brand new haplotype [Crow]), and modified stepwise with a specified number of loci (95% of mutations add or subtract one step at one locus, 5% two steps). Analysis of the first dozen (preliminary) models suggested creating 27 systematically designed population models by choosing all combinations of three values of each of three parameters as follows:

        • population growth of 0, 3% [Slatkin, 2001], or 10%/generation,
        • population size N of 30K, 100K or 300K men,
        • mutation rate (μ) slow, medium, or fast. "Medium" is μ=17/300 per generation, imitating the roughly average 1/300 per locus per generation mutation rate typical of current human identification STR loci including those of the 17-locus Y-filer system. "Slow" is 1/3 of the medium rate, simulated by 17 slower-mutating loci. "Fast" is triple the medium rate, modeled by triple the number of loci. Analysis of the early models showed that the stepwise model isn't significantly different from infinite alleles.

        Within each population Ωm we can then assess the matching probability formula that I suggest. Assessment is done by inspecting samples from the simulated populations. From each Ωm, and for each sample size n=100, 300, 1000, 3000, and 10000, 100 n-samples were drawn and analyzed. Each sample D=D(m,n,j), j=1,...,100, has some collection S0, ..., Si, ..., Sα−1, of α(m,n,j) (0≤αn) distinct haplotypes each of which occur once in D. For any particular type SiD the population frequency fSi=(1/NT∈Ωm I(T=Si) – here I(•) is the indicator function – can be determined by peeking into Ωm (as would not be possible in real life). The formula for matching probability (6) – based on observable data – suggested in this paper is assessed by comparison with fSi.

        A matching probability formula is valid to the extent that it has the same expected value (or in some contexts a larger, "conservative" value is valid) as the expected value of fS. The appropriate interpretation of the term "expected" is

        • since likelihood ratios are multiplied, evaluated in the logarithmic domain, in this case log(1−κ);
        • averaged over indistinguishable cases, i.e. for fixed m, n, all singletons i, and arguably all samples j, are indistinguishable.

        Hence the formula is valid if, for a realistic population model Ω and sample size n, the probability it gives is a good estimate of the expected population frequency for singletons in the sample.

      5. Neutrality test
      6. mtDNA haplotypes or a Y-haplotype system consisting of a large number of STR loci are logically equivalent to a locus with a very high mutation rate, a large number of possible mutational changes and a very large number of potential alleles. As such, it is plausible that these haplotype systems might approximate the infinitely many selectively neutral alleles model [Kimura]. Slatkin's test [Slatkin, 1994] is a Monte Carlo procedure which evaluates a population sample by comparing the distribution of very rare and not-so-rare traits with model expectations. The model predicts a proliferation of rare types (since many are created and many go extinct at each generation), fewer slightly more common types, etc. If the population sample spectrum is relatively unexpected, that is evidence against the model. For the simulated populations which are stabilized, the Slatkin test is used to decide when equilibrium has been reached.

    6. Observations from the simulation models
    7. To assess the behavior of the formula for realistic frequency spectra, populations were simulated by computer under various evolutionary models and numerous samples ("databases") drawn from each one (§III.C.2). In this section the population simulations are used as a proving ground to evaluate the κ model.

      1. Preliminary observations
      2. Following are a few preliminary observations based on examination of the simulations.

        • There are no common types. This is ensured by the high rate of mutation. Even if selection – which was not modeled – favored some mitochondria or Y chromosomes, common types per §III.B.2 are impossible (also see Table 2) so LRκ cannot be highly conservative.

        All else being equal,

        • A smaller population has commoner haplotypes. A larger population supports a proportionally larger number of types, hence the commonness of haplotypes (whether measured by the commonest ones or by the rarer ones) tends to be inversely proportional to population size.
        • Slow mutation means commoner haplotypes. For fixed population size and growth rate haplotype frequencies vary inversely with mutation rate – about proportionally if no growth, more than proportionally if the population is growing.
        • Smaller samples means larger κ. This is obvious from Figure 2.
        • The number of loci across which the mutation is distributed doesn't matter as long as it is large. 17 loci is about the same as infinite alleles.

      3. Evaluation of the κ model
      4. For a given population Ωm and fixed sample size n, there are samples D(m,n,j)=Dj each with some fraction κ(j) of singletons. For each such singleton SiDj there is a population frequency fSi. If (6) is a good formula then on average the matching LR=n/(1−κ(j))≈1/fSi, i.e. the inflation factor 1/(1−κ(j))≈1/nfSi. It turns out, a little surprisingly, that the average value of the population frequencies is independent of κ(j): if D1, by accident of sampling, has more singletons than D2, nonetheless the singletons in those two sets are on average equally rare. Therefore we can estimate the expected population frequency for a singleton by the average f of all the fSi taken over i and j, and correspondingly define the ideal or effective kappa κe(m, n) as that value of κe which satisfies 1/(1−κe)=1/nf.

        The test of the κ model (6) then consists in comparing the average of the calculated inflation factor values 1/(1−κ(j)) with the ideal value 1/(1−κe). Compute κ such that 1/(1−κ)=Average (1/1−κ(j)).
        Average in the sense of geometric mean
        To compare this with the effective κ consider, for each m, n,

        relative κ error = (1−κe)/(1−κ) − 1.

        Error>0 means the κ model exaggerates the strength of the evidence, error<0 means it understates.

        0%/generation population growth 3%/generation population growth 10%/generation population growth
        Figure 3 Relative overstatement (if >0) of recommended LR formula (6) compared to "ideal" LR if the average population frequencies were known. Assessment of the performance of the formula under 27 (=3x3x3) models corresponding to various growth rates, N and μ per §III.C.2. Each bar represents an average computed over 100 samples drawn from one of the 27 simulated populations. Bars with a negative error indicate conservative performance. Omission of sample size n=100 and sporadic very negative errors when n=300 are discussed in the text.

        Figure 3 shows the suitability of the κ model for the 27 simulated populations which systematically investigate the three population parameters mutation rate, growth rate, and population size, as well as sample size. The effects of these parameters are

        • Increasing mutation increases error. At the extreme "fast" mutation rate (17%/generation, more than triple the mutation rate of Y-filer or other haplotype systems in current forensic use) the κ model is moderately anti-conservative – i.e. LR 30% too high – under the population simulation conditions that were used.
        • Increasing growth rate decreases error. At zero growth the κ model is close to accurate. Appendix §V.C gives some indication why this is theoretically expected. Rapid population growth implies negative error, i.e. "conservative" LRs.
        • Larger population size slightly decreases error. It is fortunate that population size isn't a significant factor, because it is by no means clear what population size is appropriate for real populations.
        • Sample size has little effect on error, except for sporadic anomalous very negative errors when the sample size is small.

        Those anomalous negative errors are artifactual. The artifact comes about when κ=1 for a significant proportion of the 100 samples of size n and consequently, to avoid infinite inflation factors (see §III.A.1), the inflation factor is artificially limited to be at most n (which corresponds to α=n−1). This procedure happens to be conservative for the population models examined, as seen by the the sporadic very negative bars for n=300 such as for the models with N=300,000 and fast (17%/generation) mutation. For the same reason most of the statistics for n=100 are meaningless and are therefore omitted from the graphs.

        Those anomalous data points excepted, the graphs visually exhibit a systematic and nearly regular pattern of change in relative error with change in one or another model parameter. Small irregularities therefore indicate the sampling variation that one would see under repeated simulation of the same population model.

        There are various qualities that are likely in a real population that were not modeled. The story of the Genghis Khan Y-haplotype [Zerjal] supports the plausible thesis that some Y chromosomes – or mitochondria – may be associated with superior fitness. If some haplotypes are preferentially selected to reproduce , as is likely through hitchhiking, the consequent excess of multiply-occurring types would supply a safety margin, i.e. tend to make the κ model more conservative.

  4. Discussion
    1. Matching probabilities for non-singletons
      1. Matching LR for multiply observed type
      2. The sample popularity p of an allele is the number of times it is observed in a particular population sample. We are sometimes interested in the matching probability

        yp = Pr(innocent suspect matches crime scene type S | S observed p times in D)

        for p>1. Remember D has been augmented by the crime scene type; yp generalizes y=y1. From preliminary analysis of the simulated populations Ω, the simple rule

        yppy = p(1−κ)/n. (9)

        is, empirically, about right for small p. It is intuitively implausible for large p (for common traits sample frequency must be a good estimate for population frequency meaning that yp p/n; i.e. the inflation factor collapses to 1 for large enough p), but absent selection and rapid population growth all types are rare. That's a fact which contradicts some people's intuition (myself for a long time for example, and [Veldman]) and impression from older data ("minimal" Y-haplotypes of 7 loci). It is predicted by the neutral theory that there should be no common alleles [Ewens, 1972]. Of course, the Y-chromosome as a whole is not neutral. No doubt it includes genes with considerable fitness variation to which the loci of the identification haplotype are hitchhikers par excellence [Zerjal]. The association is never broken by recombination, but nonetheless because of the high mutation rates for the haplotype systems here under consideration common types are not to be expected and are not found.

      3. Average matching LR
      4. We have (5):

        y = Pr(innocent suspect matches crime scene type S | S not observed in D-) = (1−κ)/n

        and would like to estimate

        q = Pr(innocent suspect matches allelic type S)

        in general, without regard to the specific type S. Therefore 1/q=me, the effective number of types (§V.B.3).

        The most straightforward way to estimate me is empirical: count the pairwise matches in a population sample – formula (13). Alternatively there are several theoretical approaches. Assuming the infinitely many neutral alleles model, menκ/(1−κ) (formula (18), Appendix "Value of θ"). Therefore qy. Note that the empirical formula uses all the αp but the theoretical formula seems to use only α. If they tend to agree it must be because the infinite alleles model implies a relationship among the αp and the infinite alleles model approximately holds. Another way is to assume a relationship among the αp; suppose (9). Then q=1/me can be written as a weighted average of the yp. Appendix §V.B.4 carries out this calculation and the consequence, (15), is 1/yme+n−1. This formula has a simple intuitive meaning. Suppose that somehow – by magic or from memory – we know me but lack any reference sample D-. Then the evidentiary significance of a suspect type T matching the crime scene type S is the matching likelihood ratio 1/y=me which indeed corresponds to formula (15) with an empty database D-, whose size n−1=0. If we then sequentially observe reference types each of which does not match S, the effect of each observation is to increment the likelihood ratio 1/y by one.

    2. Comparison with other methods
    3. Other methods for presenting rare haplotype evidence have been suggested.

      1. Haplotype survey
      2. "Haplotype surveying" is an idea proposed in [Roewer] to estimate evidentiary significance for Y-haplotypes. In contrast to my approach, the name (sequence) of a haplotype is assumed to contain information. The model assumption is that haplotypes which are near neighbors in mutations steps are likely to have similar population frequencies. A common allele begets numerous copies of its neighbors at every generation, so the neighbors are also common. However for haplotype systems nearly all types are rare and certainly the types of interest are the rare ones. The frequencies of rare haplotypes are relatively far more affected by drift than by mutation because the effect of drift – essentially sampling variation – is to change the number of instances c of a trait by something like √c, which is a large percentage change when c is small. Therefore a haplotype surveying approach does not seem founded on a plausible model. It may be roughly equivalent to guessing y=1/me±random variation.

      3. The "FBI method"
      4. If the population frequency of a trait is more than 3/n, then the probability that it will be unobserved in a sample of size n is about 5%. Sometimes this inference is illogically inverted to claim that when the sample frequency is 0/n the population frequency is 95% to be in the range [0/n, 3/n]. Treating 3/n as the matching probability amounts to offering LRF=n/3 as the matching likelihood ratio.

        Compared to the κ model (6), this FBI method understates the evidence by a factor of LRκ/LRF=3/(1−κ) or roughly from 7.5 to 30 or more for current systems and data. It has no logical underpinning – it doesn't correspond to any model. If a very conservative approach is deemed desirable, then formula (1) has the advantage of being logical and simple, and incidentally squanders three times less evidence than the FBI method.

    4. Comment – Probability vs. Frequency
    5. "Probability" means the long-run success rate of some conceptually repeatable experiment ("trial"). The starting conditions are the same for every trial, and consist of fixing whatever information is known to the experimenter. Probability is therefore a summary of the data that is available to the experimenter. Population frequency on the other hand is unknown so certainly cannot be the probability of anything.

      Traditionally the forensic community answers the matching probability question by consulting a population sample, typically of several hundred or thousand people. The sample frequency is then taken as an estimate for population frequency which in turn is used as a surrogate for probability. As a result it has become habit in the forensic community to conflate frequency and probability. There is an institutionalized misconception that population frequency is matching probability. For common traits such as an allele at an STR locus the frequency approximation is simple and is reasonable enough. But frequency isn't the goal, probability is, and if matching probability can be decided without the distraction and detour of considering population frequency, so much the better – especially for rare traits for which population frequency can't be accurately estimated.

      Reviewing the logic for step C (4) of the κ model derivation is the clearest way to see that population frequency is irrelevant to the matching probability question. We assume, at point C in the logic, that the innocent suspect has type T and that T coincidentally matches one of the α singletons S0, ..., Si, ..., Sα−1. Which one? The available data is identical for the various Si. They all come from crimes or circumstances with which the suspect is equally unconnected, and they all have been observed only once. Given our stipulation that names don't matter the subscripts could as well be permuted at random. Can it be more obvious that the probability for T to match each of the Si is the same – therefore 1/α? Of course the singletons have different population frequencies but that, and confidence intervals, would only be relevant if we planned to bet on what the population frequency of S0 is, which we do not. We plan to bet on T matching S0 (given that T matches some singleton), and the chance to win that bet is exactly 1/α, neither more nor less.

    6. Conclusion
    7. The fundamental question to decide the evidentiary significance of a trait linking suspect to crime is not one of frequency but of probability: What is the probability for such a match to happen by coincidence when the suspect is innocent? Of JS Mill's description that probability of an event "means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence" [Mill] the last two words are worth particular emphasis. It is not relevant what our probability estimate would be if we had different population data than we have (which is the motive for confidence intervals), any more than makes sense to speculate about alleles in loci that have not been tested. The evidentiary strength of the match is a summary of present evidence.

      A second key insight is that from the perspective of an innocent suspect (the key perspective), the crime scene trait as a datum stands on equal footing with the reference types in the "database" or population sample. Hence it should be considered as part of the sample. In the language of statistics, you must condition on the crime scene observation.

      Third is the fact that, contrary to ordinary experience, sample frequency may not be a good indicator of matching probability (or of population frequency for that matter). In particular, it's necessary to consider the entire sample, not just the one matching trait, and if the sample has predominately unique types then sample frequency overstates matching probability – the "inflation factor".

      inflation factors, various models Figure 4 comparison of matching LR from four different formulas by showing the inflation factors

      Four versions of matching likelihood ratio for matching to a previously unseen trait have been developed in this paper, three of which may be useful depending on circumstance and practicalities. The best estimate is the κ model (6), that LRκ=n/(1−κ). The counting formula (1) – LRc=n – is very robust and even easier to explain because it doesn't take advantage of the inflation factor, and can be used if the highly conservative value it supplies is sufficient for a particular forensic case. An intermediate possibility, robust and slightly conservative would be to use LR=me (12) for new haplotypes. It is certainly the right number to use absent knowing p, which can happen. For large κ it is only a little smaller than LRκ (Figure4) and may be easier to explain. Finally although I can't justify the t model (Appendix §V.B.1) mathematically, it is included in the graph for interest and comparison.

      For the less frequent situation p>1, (9) is an idea but not a recommendation. All we can say for now is that n/p < LR < n/(1−κ)p.

      Finally, it is worth noting that the κ model analysis presented here applies perfectly well to traditional STR systems. If a handful of common alleles make up 99% of the allele population, then κ is near zero so take the inflation factor as 1 and the κ model degenerates to the counting model. For an allele observed p−1 times in a reference database D- of size n−1, the matching probability is therefore p/n.

  5. Appendix
    1. Notation
    2. Si a once-observed haplotype. S0 is the haplotype from the crime scene
      T haplotype of an innocent suspect
      y probability that T matches S0, given that S0 has been observed once
      yp probability that T matches a p-times observed type. (y1=y)
      N population size
      Ω denotes a population of haplotypes. Ω1, ..., Ωm, ... are various simulated populations.
      Ωt population of t equally rare traits
      Ωz,t population of a common trait of frequency z and rare traits of frequency 1/t
      n sample size
      D a sample drawn from some Ω. D- is a sample before including the crime stain.
      p popularity of an allele – number of times it occurs in a sample
      α number of once-observed haplotypes (singletons) in a sample
      α1, α2, ..., αp, ... the number of singletons, doubletons, ... haplotypes of popularity p, in a sample. (α1)
      κ sample proportion α/n of singletons in a sample
      fS population frequency of type S
      f=1, f>1 total population proportion of the haplotypes that are singletons, non-singletons, in some sample
      Ne effective population size, a population parameter
      q probability two randomly selected types match (absent knowing p)
      me effective number of types – reciprocal of the probability two randomly selected types match
      μ mutation rate per generation for a haplotype system

    3. Statistical estimates
      1. Maximum likelihood estimate of single trait probability
      2. Suppose each individual in a certain population Ωt possesses one or another of t distinguishable and equally rare traits. Select a random sample D of n individuals. Let κ be the proportion of singletons, that is, of traits that occur exactly once in D. The expected value of κ, E(κ), is given by

        E(κ) = (1−1/t)n−1, or E(κ)≈ e−(n−1)/t if t is not small.

        Proof: Let P(i) be the probability that the i-th element in the sample is a singleton:


        Over a large number of n-samples, P(i) is also the expected number of singletons at the i-th elements, per sample. These expectancies are additive (regardless that the events being summed are not independent), so

        E(α) = Σ i=1, 2, ..., n P(i) = nP(i), and
        E(κ) = E(α/n) = E(α)/n = P(i) = (1−1/t)n−1 (10)

        as claimed. Using the fact that (1−1/t)t→e−1 (e=2.71828...) as t→∞, (1−1/t)n−1 = (1−1/t)t(n−1)/t ≈ e−(n−1)/t, so from (10),

        E(κ) ≈ e−(n−1)/t (11)


        The value of κ can be observed in a sample. If we assume that the observed value is close to the expected value, then t can be estimated from formula (11).

        κ ≈ E(κ) ≈ e−(n−1)/t ≈ en/t, whence
        t ≈ −n/ln κ. (12)

      3. Comparing the t model with the κ model
      4. Suppose that true state of nature is Ωt. Then the correct likelihood ratio for a match would be LRt = t.

        Perhaps from ignorance of the true state of nature we instead estimate the likelihood ratio using the κ model (6). What error would we commit?

        LRκ/LRt = n/(1−κ)t
        ≈ − (ln κ)/(1−κ) (by (12))
        = [(1−κ)+(1−κ)2/2 + ...]/(1−κ) (expanding ln κ in Taylor series around κ=1),
        LRκ ≈ LRt{1 + (1−κ)/2}.

        Although the above is well short of rigorous it does seem in the domain of interest – κ close to 1 – that LRκ is not far different from LRt. See Figure 4. Hence even in the pathological t-model, where the assumption (7) underlying the κ model is violated to the maximum extent possible, the formula (6) is nonetheless not far off.

      5. Effective number of types
      6. The effective number of types, me, is the likelihood ratio supporting identity when two randomly selected types match – i.e. it is the number of types which, if they were equally frequent, would provide diversity equal to the actual diversity.

        The effective number of types can easily be estimated directly from the population sample D. Simply count the number of ways that a matching pair can be drawn from the sample and compare it to the number Cn2 of ways to draw any pair. A matching pair is obtained only when both members have the same popularity p and belong to the same one of the αp identity cohort groups of that popularity. Hence (summation is always over p)

        me = Cn2/Σαp Cp2. (13)

        Thus 1/me is the probability for the type T of an innocent suspect to match a database type S absent knowing the sample popularity of S.

      7. Matching LR for new type from me
      8. When the sample popularity of S is given as p, then the matching probability is some different amount yp, 1≤p, where 1/me is a weighted average of the yp, bigger than y1=y and smaller than yp for large p. Specifically there are pαp objects S in D of popularity peach of which has probability yp to match T, so

        1/me = Σyppαp/n. (14)
        Assume the condition yppy of equal overrepresentation (9). Then
        Σyppαp/n yΣp2αp/n
        = yΣ(pαp +2Cp2αp)/n (now use npαp and (13))
        = y[n+2Cn2/me]/n
        = y(1+(n−1)/me).
        Substituting this into (14) gives the elegant result
        1/y me+n−1. (15)

        Alternatively, (15) follows from the neutral theory (§V.C). The neutral theory imples that β(0,θ) is the prior probability distribution for haplotype frequencies ([Ewens, 2003], page 116), from which (15) follows ([Dawid] "Use of databases") by standard properties of the beta function.

        Apparently the neutral model and equal overrepresentation are related assumptions.

    4. Population genetic estimates
      1. Assume the infinitely many neutral alleles model [Kimura] or "neutral theory" for short.

      2. Effective population size and effective number of alleles
      3. A classical formula from population genetic theory is me≈1+2Neμ (for haploid populations), Ne being the effective population size – the size of an ideal equilibrium population with equivalent me, so roughly Neme/2μ. This formula is the motivation for the example population sizes chosen for simulations. From for example the ABI U.S. Caucasian Yfiler database, me≈8883 (Table 1) and μ≈17/300, hence Ne≈80,000.

      4. Expected value of κ
      5. Under the neutral theory [Ewens, 2003] (p 94), gives

        Pr(nth sampled allele is previously unobserved) = θ/(θ+n),

        where we can define θ = me −1 ≈ 2Neμ. Equating with Robbins' result, we have

        κ≈ θ/(θ+n) ≈ 1/[1 + n / 2Neμ ]. (16)

        Hence κ will be smaller for larger samples, and larger for larger populations or larger mutation rates. Note that the mutation rate μ for a Y-haplotype is the combined mutation rate for all of the included loci, hence μ≈17/300 per generation for the 17-locus Y-filer haplotype.

      6. Value of θ
      7. We can solve (16) for θ:

        θ nκ/(1−κ), (17)
        or, perhaps more robustly, we could use the same relation to infer θ from κ values for a range of n using the bootstrap and curve-fitting method of Figure 2.

        Since for rare haplotypes me >>1, we can say me ≈θ, so for situations of interest for this paper

        me nκ/(1−κ). (18)

    5. Condition of equal overrepresentation
    6. I claim that the condition (3), Pr(Singleton|Observed)=κ, is equivalent to the condition of equal overrepresentation (7) that κ/f=1 = (1-κ)/f>1. These claims are relative to a fixed sample D and a haplotype T randomly selected from the general population. By way of proof, note that f=1=Pr(Singleton)=Pr(Singleton & Observed) while f>1=Pr(~Singleton & Observed). So (7) is

      κ/Pr(Singleton & Observed) = (1-κ)/Pr(~Singleton & Observed)
      orκ/Pr(Observed)Pr(Singleton|Observed) = (1-κ)/Pr(Observed)Pr(~Singleton|Observed)
      orκ/Pr(Singleton|Observed) = (1-κ)/[1-Pr(Singleton|Observed)],

      which is true if and only if (3), Q.E.D.

  6. Acknowledgments
  7. I am indebted to Terry Speed whose broad hint in pointing me to several key references [Robbins, Ewens, 1972, Dawid] years ago suggest that he anticipated the main idea of this paper. Thanks to Tom Parsons, Monty Slatkin, Keiji Tamaki, Jim Crow, Bruce Weir and Steve Lee for encouragement and discussions. This work was supported in part by the DNA·VIEW User's Group.


Prinz, M., Advantages and Disadvantages of Y-Short Tandem Repeat Testing in Forensic Casework. Forensic Sci Rev, 2003. 15 p. 189-196.

Mayntz-Press, K.A., et al., abstract Y-STR Profiling in Extended Interval (> or =3 days) Postcoital Cervicovaginal Samples. J For Sci, 2008. 53(2): p. 342-348.

Malsom, S., et al., abstract The prevalence of mixed DNA profiles in fingernail samples taken from couples who co-habit using autosomal and Y-STRs. For Sci Int: Genetics, 2009. 3: p. 57-62.

Kayser, M., et al., abstract Evaluation of Y-chromosomal STR's: a multi-center study. Int. J Leg Med, 1997. 110: p. 125-133.

Jobling, M., A. Pandya, and C. Tyler-Smith, abstract The Y chromosome in forensic analysis and paternity testing. Int J Legal Med 1997. 110: p. 118-124.

Ivanov, P., et al., abstract Mitochondrial DNA sequence heteroplasmy in the Grand Duke of Russia Georgij Romanov establishes the authenticity of the remains of Tsar Nicholas II. Nature Genetics, 1996. 12: p. 417-420

Keyser-Tracqui, C., E. Crubézy, and B. Ludes, abstract Nuclear and Mitochondrial DNA Analysis of a 2,000-Year-Old Necropolis in the Egyin Gol Valley of Mongolia. Am J Hum Genet, 2003. 73: p. 247-260.

Roewer, L., et al., abstract A new method for the evaluation of matches in non-recombining genomes: Application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. Forensic Science International 2000. 114: p. 31-43.

Robbins, H., full text Estimating the total probability of the unobserved outcomes of an experiment. The Annals of Mathematical Statistics, 1968.

Morton, N., full text Genetic structure of forensic populations. Proc Natl Acad Sci USA, 1992. 89: p. 2556-2560.

Applied_Biosystems, Yfiler haplotype database.

Crow, J., full text Twenty-five Years Ago in Genetics: The Infinite Allele Model. Genetics, 1989. 121: p. 631-634.

Slatkin, M. and G. Bertorelle, full text The use of intraallelic variability for testing neutrality and estimating population growth rate. Genetics, 2001. 158(2): p. 865-74.

Kimura, M. and J. Crow, full text The number of alleles that can be maintained in a finite population. Genetics, 1964. 49: p. 725-738.

Slatkin, M., abstract An exact test for neutrality based on the Ewens sampling distribution. Genetical Research, 1994. 64(1): p. 71-74.

Zerjal, T., et al., full text The genetic legacy of the Mongols. Am J Hum Genet., 2003. 72(3): p. 717-721.

Veldman, A., full text Evidential strength of Y-STR haplotype matches in forensic DNA casework. 2007, Universiteit Leiden.

Ewens, W., commentary by Ewens The sampling theory of selectively neutral alleles. Theoretical population biology, 1972. 3: p. 87-112.

Mill, J., A System of Logic: Ratiocinative and Inductive; Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation. Vol. 2, Ch 18. 1868, see page 62 London: Longmans, Green, Reader, and Dyer.

Ewens, W., Mathematical Population Genetics: I. Theoretical introduction, 2nd edition. 2003: Springer.

Dawid, A. and J. Mortera, abstract & first page Coherent Analysis of Forensic Identification Evidence. J.R.Statist.Soc B, 1996. 58(2): p. 425-443.