Table of contents

Ychromosomal and mitochondrial haplotyping offer special advantages for criminal (and other) identification. For different reasons, each of them is sometimes detectable in a crime stain for which autosomal typing fails. But they also present special problems, including a fundamental mathematical one: When a rare haplotype is shared between suspect and crime scene, how strong is the evidence linking the two? Assume a reference population sample is available which contains n−1 haplotypes. The most interesting situation as well as the most common one is that the crime scene haplotype was never observed in the population sample. The traditional tools of product rule and sample frequency are not useful when there are no components to multiply and the sample frequency is zero. A useful statistic is the fraction κ of the population sample that consists of "singletons" – of onceobserved types. A simple argument shows that the probability for a random innocent suspect to match a previously unobserved crime scene type is (1−κ)/n – distinctly less than 1/n, likely ten times less. The robust validity of this model is confirmed by testing it against a range of population models.
This paper hinges above all on one key insight: probability is not frequency. The common but erroneous "frequency" approach adopts population frequency as a surrogate for matching probability and attempts the intractible problem of guessing how many instances exist of the specific haplotype at a certain crime. Probability, by contrast, depends by definition only on the available data. Hence if different haplotypes but with the same data occur in two different crimes, although the frequencies are different (and are hopelessly elusive), the matching probabilities are the same, and are not hard to find.
Keywords: haplotype; stain matching; mtDNA; Yhaplotype; forensic mathematics; likelihood ratio; matching probability
In recent years there has been increasing interest in using Ychromosomal haplotypes [Prinz, MayntzPress, Malsom, Kayser, Jobling] or mtDNA for forensic identification. These haplotype systems are also much used for body identification, especially for old graves [Ivanov, KeyserTracqui]. The advantages for some kinds of problems are considerable. Both methods are desirable for attacking kinship problems involving remote relatives because sexlinked traits are not diluted 50% each generation. Both methods are useful for some scant crime stains – mtDNA because of the high copy number, the Ychromosome because it can be detected and amplified unambiguously even when it is a minor component compared to the female victim in a rape sample. However, an mtDNA or a Ychromosomal haplotype must be treated mathematically as a single indivisible ("atomic") trait; so unlike those traditional DNA methods which examine several traits that are approximately independent of one another, no multiplication of probabilities is possible. Therefore it is vital to have a sound fundamental understanding of atomic trait matching probabilities in order to make a reasonable assessment of the strength of identification evidence.
Evidence – such as a Yhaplotype – is identical between a crime scene and a suspected donor. How strong is the evidence that the suspect is the donor? In particular, this paper discusses the critical matching probability question, "What is the probability that a random nondonor would by chance match a newly observed type?"
Such interesting questions and complications as possible dependence among traits and suitability of the reference population sample are outside the domain of this paper, essential though they may be to the comprehensive evaluation of DNA evidence. Moreover, to keep the discussion focused and simple, I ignore complications such as mitochondrial heteroplasmy, or adjustments appropriate because of multiple samples or preliminary phenotypic analysis. I take the DNA sequence of a haplotype as being no more than an arbitrary name; in this respect my model differs from that of Krawczak [Roewer].
The matching probability will be derived by analysis of a population sample. For the purpose of this paper, assume that the population sample is appropriate, that it is randomly representative of possible innocent suspects. It may well be true that geographical clustering or other population genetic phenomena make it difficult to obtain a perfectly representative reference sample (subset, or "database") but for the purpose of this paper put such difficulties aside.
The problem attacked here is therefore a modest one. But it is also a fundamental one, for without understanding the proper analysis of an individual indivisible trait under the simplest assumptions it is impossible to give a proper analysis of DNA evidence in any situation. I therefore consider this the fundamental question of forensic mathematics.
I shall focus on the case – typical for mtDNA and Yhaplotypes – that the observed crime scene type is not found in the population sample.
Evidentiary strength of a matching type is given by a likelihood ratio (LR), called here the matching LR:
LR  =x/y, where x and y are likelihoods 
x  =Pr(suspect matches crime scene type  suspect is the donor) 
y  =Pr(suspect matches crime scene type  suspect unconnected to the crime). 
The simple view is to assume x=1. That isn't strictly true – one could model typing fallibility (laboratory error), heteroplasmy (in the case of mitochondrial evidence), and mutation (when identification is via relatives). But to keep to the point, settle for x=1. That being the case, we have simply LR=1/y, where y is the matching probability.
The formulation of y here states "suspect unconnected to the crime." I prefer this to saying "suspect unrelated to the donor" since after all everyone is genetically related, and when dealing with a Ychromosomal or mtDNA match to pretend otherwise is particularly artificial. As an assumption that is both realistic enough to be useful and ideal enough to permit analysis, think of the suspect, if innocent, as being randomly selected, hence having the same random chance as anyone else in the population to have any particular relationship to the donor. (Similarly, while a reference population sample should be gathered without a bias toward related people it also should not be purged of relatives. It should contain people who are related randomly.)
For the practical forensic scientist, the essential sections of this paper, amounting to about one third of the total bulk, are The problem, Analysis which derives the main result, the likelihood ratio for matching a previously unobserved trait, Conclusions and Probability vs. Frequency. It is also useful to be aware of Appendix Notation which gathers together the symbols encountered in the paper.
The remainder of the paper includes validation and consideration of nonnew traits in §III Results, some theoretical context in Appendix §V.C, and §IV.B comments on rival approaches. The mathematics in the paper does not go beyond straightforward algebra and probabilties, and for the most part is relegated to appendices.
Denote by S_{0} the crime scene haplotype, and let D^{} ("database") be a population sample of size n−1. Since D^{} is stipulated to be an appropriate and representative population sample, there's no harm in thinking of D^{} as a collection of crime scene haplotypes from other similar but unrelated crimes. Assume that the type S_{0} does not appear in D^{}. There is some collection of types that occur exactly once in D^{}. Call them singletons (Figure 1), and let α_{}_{1}−1 be the number of singletons. To avoid double subscripts I shall generally omit the subscript 1 and write α_{} for α_{}_{1}. Label the singleton types S_{1}, S_{2}, ..., S_{α−1}. These and S_{0} are all by definition distinct. We shall eventually

DYS390  DYS391  DYS392  
22  10  11  * 
23  10  11  
23  10  11  
23  10  11  
23  10  11  
23  11  12  * 
24  9  11  * 
24  11  11  * 
25  10  11  
25  10  11  
Figure 1 Example population sample D with n=10 and α_{}=4 singletons (starred). Hence κ=0.4. 
The first key observation is that the probabilities Pr(T=S_{i}) are all the same. Probability is a summary of information and the information about the all the S_{i}_{} is exactly the same: all have been observed exactly once and all lack any connection to the innocent suspect. Consequently it is convenient and appropriate to think of the extended database D=D^{}∪{S_{0}} obtained by tossing S_{0} into D^{}. D is a population sample of size n, with α_{} singletons: S_{0}, S_{1}, S_{2}, ..., S_{α−1}. The matching probability to a crime sample of an new type and with database D^{} is equivalent to the matching probability to a singleton in database D when there has been no crime.
Imagine comparing T with the n types in D. Since T by stipulation has no connection to the crime or to D^{}, even if we assume that T matches some type among the n, the probability y is at most 1/n that T match any particular unique type – call it a singleton – such as S_{0}. Therefore if LR is the likelihood ratio that a matching suspect is the source of S_{0} we can say confidently that
LR > LR_{c} = n  (1) 
It is surely possible to do even better. A major task of this paper is deriving (§II.D) and justifying an inflation factor(§III.A) by which (1) can be improved if we additionally take into account the considerable chance T doesn't match any of the n previously observed types.
It could be that even a further improvement is possible by arguing that, even when T matches some type among the n, singletons such as S_{0} have less than their 1/n fair share of probability to be matched. This possibility is examined in §III.B.2. However, while there may be such an effect if some haplotypes are associated with a fitness advantage, it is not very large and is hard to model in a convincing way.
What is the probability that the next haplotype observed will be new? Answer: Since κ grows only slowly, it should be about the same as the probability that the last one was, which is to say κ. This is close to a theorem of H. Robbins [Robbins]. It follows as an immediate corollary that about κ of the types in the population are not represented in D, have not been seen. That's how typical it is that a crime scene type is previously unseen.
Now we can evaluate y=Pr(S_{0}=T). The analysis proceeds in three simple steps. If S_{0}=T, then each of the following must be true:
Comment  
A. (Observed)  T matches something in D.  Pr(Observed) means the probability that a haplotype matching T is found in D.  
Since T is a new type with probability κ, Pr(Observed)=1−κ.  (2)  
 
C. (Match)  T=S_{0}.  There are α singletons, so given a match to a singleton it is 1/α to match any particular one.  
Given that T matches some singleton S_{i}_{} in D,
the subscript i is equally likely to have one value as another, hence
Pr(MatchObserved & Singleton)=1/α_{}.  (4) 
Ploddingly putting the above together –
y  = Pr(Match)  
= Pr(Match & Singleton & Observed)  
= Pr(MatchSingleton & Observed)Pr(SingletonObserved)Pr(Observed)  
≈ (1−κ)κ/α  
= (1−κ)/n.  (5)  
Consequently the evidential strength for matching a previously unseen haplotype  
LR  ≈ LR_{κ} = n/(1−κ)  (6) 
where the subscript κ corresponds to the above analysis according the "kappa model."
It has occasionally been said that the matching LR can be no larger than n−1, the size of the database. Formula (6) gives matching LR that are larger than n by a factor of 1/(1−κ), which I therefore call the inflation factor in Table 1.

For small enough samples or exceedingly polymorphic traits it may happen that κ=1 in which case formula (6) gives an infinite LR and infinite inflation factor, which obviously cannot be accurate. Possible remedies include a careful and refined statistical treatment, or simply avoiding κ too close to 1. For the present study I prefer the latter approach.
Let's consider more carefully question B, Pr(SingletonObserved). If there are any types in the population that don't appear in the sample, they are a fortiori underrepresented in the sample. Therefore to compensate, types that do occur in the sample tend to be overrepresented [Morton] This would be most obviously true for represented rare types. The intuition behind (3) is that singleton types in D are on average at least as overrepresented, compared to their population frequency, as are nonsingleton types. That is, let f_{=1} denote the (normally unknown) combined population frequency of the types {S_{i}_{}} which are singletons in D, and f_{>1} denote the combined population frequency of the types that occur multiply in D. The corresponding sample frequencies are κ and 1−κ. Then κ/f_{=1} is the overrepresentation rate for singletons and (1−κ)/f_{>1} is the overrepresentation rate for nonsingletons. In claiming (3) we are assuming that (see Appendix, Condition of equal overrepresentation)
κ/f_{=1} ≈ (1−κ)/f_{>1},  (7) 
though for the courtroom context we will normally also be quite happy if the inequality  
κ/f_{=1} ≥ (1−κ)/f_{>1}  (8) 
holds as it is "conservative" in the sense that it implies that the LR given by (6) at worst understates the evidence connecting suspect to crime.
Is (8) justified? That depends on the frequency spectrum – the distribution of frequencies of types – that the evolutionary mechanisms such as mutation, drift, and selection tend to produce. First consider two unrealistic extreme frequency spectra, then reality.
At one extreme, the worst case for (8), imagine the sample D of size n to be drawn from an artificial population Ω^{t}^{} consisting of some large unknown number t, approximately in the range n < t < n^{2}/2, of equally rare types. Then D, in addition to α_{}=α_{}_{1} singletons we may have some α_{}_{2}>0 doubletons and even some tripletons, i.e. duplicated types, which are overrepresented two and three times more than the singletons are overrepresented. In short, the singletons are overrepresented the least and (8) fails; the recommendation (6) would be anticonservative. Formula (3) is therefore not as obvious as it looks. However, even if we erroneously apply the κ model formula (6) to the Ω^{t}^{} model, the expected error in using LR_{κ} instead of the correct LR_{t}_{}=t would be small (Appendix, §V.B.1).
At the opposite extreme consider a population Ω^{z,t} with a common type Z of substantial frequency and a large number of very rare types of frequency 1/t. As an example, Ω^{0.2, 10000} has a 20% type and 8000 rare types. A sample D from Ω^{0.2, 10000} of size n=100 rates to include nearly 80 singletons (thus κ≈0.8 and f_{=1}≈^{80}/_{10000}), about 20 copies of Z and perhaps a few doubletons of rare types (f_{>1}=0.2+). Hence κ/f_{=1}=100 while (1−κ)/f_{>1}≈1, so (8) holds with a factor of 100 to spare.
The frequency spectrum of a real haplotype systems lies in between. Constant t is unrealistic.

The application and applicability of the methods developed here are evaluated partly by consideration and analysis of real data. Available population studies include Y haplotype and mtDNA datasets as listed in Table 1. The Yhaplotype databases are ABI data obtained per [Applied Biosystems], [Roewer], ancient Mongolian [KeyserTracqui] Reliagene provided by S Sinha, Portuguese from H Gaeda, Macedonian from Z Jakovski, Krakow from M Sanak and from T Kupiec, Malaysian from P Krishnan, and the Ashkenazy data is M Hammer's passed to me via AFDIL. The information about mtDNA databases is from M Coble (AFDIL) and T Melton (Mitotyping Technologies).
For the purpose of this paper a haplotype database is sufficiently described by its spectrum – the numbers α_{1}, α_{2}, ... of traits seen once, twice, etc. The spectrum was determined for each Y haplotype database after eliminating the handful of irregular observations (profiles with a missing, offladder, or extra allele). See Table 2.

The conclusions of this paper are partly checked and justified by computer simulation experiments. Each experiment consists of modeling the evolution of a population of haplotypes. Thirtynine different simulated populations Ω_{1}, ..., Ω_{39} were thus generated under a variety of modeling conditions of mutation rate, mutation model, population size, and rate of population growth.
All populations were generated following a WrightFisher approach, simulating a generation at a time. Each new generation is obtained as a sample with replacement from the previous generation with some of the haplotypes then modified by a mutation rule. Some populations are grown, starting with a single haplotype and growing at a chosen rate per generation until the target size is reached. Other populations are obtained from a grown population by stabilizing for several hundred generations without growth. Mutation rules include: infinitely many alleles (every mutation produces a brand new haplotype [Crow]), and modified stepwise with a specified number of loci (95% of mutations add or subtract one step at one locus, 5% two steps). Analysis of the first dozen (preliminary) models suggested creating 27 systematically designed population models by choosing all combinations of three values of each of three parameters as follows:
Within each population Ω_{m}_{} we can then assess the matching probability formula that I suggest. Assessment is done by inspecting samples from the simulated populations. From each Ω_{m}_{}, and for each sample size n=100, 300, 1000, 3000, and 10000, 100 nsamples were drawn and analyzed. Each sample D=D(m,n,j), j=1,...,100, has some collection S_{0}, ..., S_{i}_{}, ..., S_{α−1}, of α_{}(m,n,j) (0≤α_{}≤n) distinct haplotypes each of which occur once in D. For any particular type S_{i}∈D the population frequency f_{Si}=(^{1}/_{N}_{})Σ_{T∈Ωm} I(T=S_{i}) – here I(•) is the indicator function – can be determined by peeking into Ω_{m}_{} (as would not be possible in real life). The formula for matching probability (6) – based on observable data – suggested in this paper is assessed by comparison with f_{Si}.
A matching probability formula is valid to the extent that it has the same expected value (or in some contexts a larger, "conservative" value is valid) as the expected value of f_{S}. The appropriate interpretation of the term "expected" is
Hence the formula is valid if, for a realistic population model Ω and sample size n, the probability it gives is a good estimate of the expected population frequency for singletons in the sample.
mtDNA haplotypes or a Yhaplotype system consisting of a large number of STR loci are logically equivalent to a locus with a very high mutation rate, a large number of possible mutational changes and a very large number of potential alleles. As such, it is plausible that these haplotype systems might approximate the infinitely many selectively neutral alleles model [Kimura]. Slatkin's test [Slatkin, 1994] is a Monte Carlo procedure which evaluates a population sample by comparing the distribution of very rare and notsorare traits with model expectations. The model predicts a proliferation of rare types (since many are created and many go extinct at each generation), fewer slightly more common types, etc. If the population sample spectrum is relatively unexpected, that is evidence against the model. For the simulated populations which are stabilized, the Slatkin test is used to decide when equilibrium has been reached.
To assess the behavior of the formula for realistic frequency spectra, populations were simulated by computer under various evolutionary models and numerous samples ("databases") drawn from each one (§III.C.2). In this section the population simulations are used as a proving ground to evaluate the κ model.
Following are a few preliminary observations based on examination of the simulations.
All else being equal,
For a given population Ω_{m}_{} and fixed sample size n, there are samples D(m,n,j)=D_{j}_{} each with some fraction κ(j) of singletons. For each such singleton S_{i}∈D_{j} there is a population frequency f_{Si}. If (6) is a good formula then on average the matching LR=n/(1−κ(j))≈1/f_{Si}, i.e. the inflation factor 1/(1−κ(j))≈1/nf_{Si}. It turns out, a little surprisingly, that the average value of the population frequencies is independent of κ(j): if D_{1}, by accident of sampling, has more singletons than D_{2}, nonetheless the singletons in those two sets are on average equally rare. Therefore we can estimate the expected population frequency for a singleton by the average f of all the f_{Si} taken over i and j, and correspondingly define the ideal or effective kappa κ_{e}(m, n) as that value of κ_{e} which satisfies 1/(1−κ_{e})=1/nf.
The test of the κ model (6) then consists in comparing the average of the calculated inflation factor values 1/(1−κ(j)) with the ideal value 1/(1−κ_{e}). Compute κ such that 1/(1−κ)=Average (1/1−κ(j)).
Average in the sense of geometric mean 
relative κ error = (1−κ_{e})/(1−κ) − 1.
Error>0 means the κ model exaggerates the strength of the evidence, error<0 means it understates.  
Figure 3 Relative overstatement (if >0) of recommended LR formula (6) compared to "ideal" LR if the average population frequencies were known. Assessment of the performance of the formula under 27 (=3x3x3) models corresponding to various growth rates, N and μ per §III.C.2. Each bar represents an average computed over 100 samples drawn from one of the 27 simulated populations. Bars with a negative error indicate conservative performance. Omission of sample size n=100 and sporadic very negative errors when n=300 are discussed in the text. 
Figure 3 shows the suitability of the κ model for the 27 simulated populations which systematically investigate the three population parameters mutation rate, growth rate, and population size, as well as sample size. The effects of these parameters are
Those anomalous negative errors are artifactual. The artifact comes about when κ=1 for a significant proportion of the 100 samples of size n and consequently, to avoid infinite inflation factors (see §III.A.1), the inflation factor is artificially limited to be at most n (which corresponds to α_{}=n−1). This procedure happens to be conservative for the population models examined, as seen by the the sporadic very negative bars for n=300 such as for the models with N=300,000 and fast (17%/generation) mutation. For the same reason most of the statistics for n=100 are meaningless and are therefore omitted from the graphs.
Those anomalous data points excepted, the graphs visually exhibit a systematic and nearly regular pattern of change in relative error with change in one or another model parameter. Small irregularities therefore indicate the sampling variation that one would see under repeated simulation of the same population model.
There are various qualities that are likely in a real population that were not modeled. The story of the Genghis Khan Yhaplotype [Zerjal] supports the plausible thesis that some Y chromosomes – or mitochondria – may be associated with superior fitness. If some haplotypes are preferentially selected to reproduce , as is likely through hitchhiking, the consequent excess of multiplyoccurring types would supply a safety margin, i.e. tend to make the κ model more conservative.
The sample popularity p of an allele is the number of times it is observed in a particular population sample. We are sometimes interested in the matching probability
y_{p} = Pr(innocent suspect matches crime scene type S  S observed p times in D)
for p>1. Remember D has been augmented by the crime scene type; y_{p} generalizes y=y_{1}. From preliminary analysis of the simulated populations Ω, the simple rule
y_{p} ≈ py = p(1−κ)/n.  (9) 
is, empirically, about right for small p. It is intuitively implausible for large p (for common traits sample frequency must be a good estimate for population frequency meaning that y_{p} p/n; i.e. the inflation factor collapses to 1 for large enough p), but absent selection and rapid population growth all types are rare. That's a fact which contradicts some people's intuition (myself for a long time for example, and [Veldman]) and impression from older data ("minimal" Yhaplotypes of 7 loci). It is predicted by the neutral theory that there should be no common alleles [Ewens, 1972]. Of course, the Ychromosome as a whole is not neutral. No doubt it includes genes with considerable fitness variation to which the loci of the identification haplotype are hitchhikers par excellence [Zerjal]. The association is never broken by recombination, but nonetheless because of the high mutation rates for the haplotype systems here under consideration common types are not to be expected and are not found.
We have (5):
y = Pr(innocent suspect matches crime scene type S  S not observed in D^{}) = (1−κ)/n
and would like to estimate
q = Pr(innocent suspect matches allelic type S)
in general, without regard to the specific type S. Therefore 1/q=m_{e}, the effective number of types (§V.B.3).
The most straightforward way to estimate m_{e} is empirical: count the pairwise matches in a population sample – formula (13). Alternatively there are several theoretical approaches. Assuming the infinitely many neutral alleles model, m_{e}≈nκ/(1−κ) (formula (18), Appendix "Value of θ"). Therefore q≈y/κ . Note that the empirical formula uses all the α_{p} but the theoretical formula seems to use only α_{}. If they tend to agree it must be because the infinite alleles model implies a relationship among the α_{p} and the infinite alleles model approximately holds. Another way is to assume a relationship among the α_{p}; suppose (9). Then q=1/m_{e} can be written as a weighted average of the y_{p}. Appendix §V.B.4 carries out this calculation and the consequence, (15), is 1/y ≈ m_{e}+n−1. This formula has a simple intuitive meaning. Suppose that somehow – by magic or from memory – we know m_{e} but lack any reference sample D^{}. Then the evidentiary significance of a suspect type T matching the crime scene type S is the matching likelihood ratio 1/y=m_{e} which indeed corresponds to formula (15) with an empty database D^{}, whose size n−1=0. If we then sequentially observe reference types each of which does not match S, the effect of each observation is to increment the likelihood ratio 1/y by one.
Other methods for presenting rare haplotype evidence have been suggested.
"Haplotype surveying" is an idea proposed in [Roewer] to estimate evidentiary significance for Yhaplotypes. In contrast to my approach, the name (sequence) of a haplotype is assumed to contain information. The model assumption is that haplotypes which are near neighbors in mutations steps are likely to have similar population frequencies. A common allele begets numerous copies of its neighbors at every generation, so the neighbors are also common. However for haplotype systems nearly all types are rare and certainly the types of interest are the rare ones. The frequencies of rare haplotypes are relatively far more affected by drift than by mutation because the effect of drift – essentially sampling variation – is to change the number of instances c of a trait by something like √c, which is a large percentage change when c is small. Therefore a haplotype surveying approach does not seem founded on a plausible model. It may be roughly equivalent to guessing y=1/m_{e}±random variation.
If the population frequency of a trait is more than 3/n, then the probability that it will be unobserved in a sample of size n is about 5%. Sometimes this inference is illogically inverted to claim that when the sample frequency is 0/n the population frequency is 95% to be in the range [0/n, 3/n]. Treating 3/n as the matching probability amounts to offering LR_{F}=n/3 as the matching likelihood ratio.
Compared to the κ model (6), this FBI method understates the evidence by a factor of LR_{κ}/LR_{F}=3/(1−κ) or roughly from 7.5 to 30 or more for current systems and data. It has no logical underpinning – it doesn't correspond to any model. If a very conservative approach is deemed desirable, then formula (1) has the advantage of being logical and simple, and incidentally squanders three times less evidence than the FBI method.
"Probability" means the longrun success rate of some conceptually repeatable experiment ("trial"). The starting conditions are the same for every trial, and consist of fixing whatever information is known to the experimenter. Probability is therefore a summary of the data that is available to the experimenter. Population frequency on the other hand is unknown so certainly cannot be the probability of anything.
Traditionally the forensic community answers the matching probability question by consulting a population sample, typically of several hundred or thousand people. The sample frequency is then taken as an estimate for population frequency which in turn is used as a surrogate for probability. As a result it has become habit in the forensic community to conflate frequency and probability. There is an institutionalized misconception that population frequency is matching probability. For common traits such as an allele at an STR locus the frequency approximation is simple and is reasonable enough. But frequency isn't the goal, probability is, and if matching probability can be decided without the distraction and detour of considering population frequency, so much the better – especially for rare traits for which population frequency can't be accurately estimated.
Reviewing the logic for step C (4) of the κ model derivation is the clearest way to see that population frequency is irrelevant to the matching probability question. We assume, at point C in the logic, that the innocent suspect has type T and that T coincidentally matches one of the α_{} singletons S_{0}, ..., S_{i}_{}, ..., S_{α−1}. Which one? The available data is identical for the various S_{i}_{}. They all come from crimes or circumstances with which the suspect is equally unconnected, and they all have been observed only once. Given our stipulation that names don't matter the subscripts could as well be permuted at random. Can it be more obvious that the probability for T to match each of the S_{i}_{} is the same – therefore 1/α_{}? Of course the singletons have different population frequencies but that, and confidence intervals, would only be relevant if we planned to bet on what the population frequency of S_{0} is, which we do not. We plan to bet on T matching S_{0} (given that T matches some singleton), and the chance to win that bet is exactly 1/α_{}, neither more nor less.
The fundamental question to decide the evidentiary significance of a trait linking suspect to crime is not one of frequency but of probability: What is the probability for such a match to happen by coincidence when the suspect is innocent? Of JS Mill's description that probability of an event "means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence" [Mill] the last two words are worth particular emphasis. It is not relevant what our probability estimate would be if we had different population data than we have (which is the motive for confidence intervals), any more than makes sense to speculate about alleles in loci that have not been tested. The evidentiary strength of the match is a summary of present evidence.
A second key insight is that from the perspective of an innocent suspect (the key perspective), the crime scene trait as a datum stands on equal footing with the reference types in the "database" or population sample. Hence it should be considered as part of the sample. In the language of statistics, you must condition on the crime scene observation.
Third is the fact that, contrary to ordinary experience, sample frequency may not be a good indicator of matching probability (or of population frequency for that matter). In particular, it's necessary to consider the entire sample, not just the one matching trait, and if the sample has predominately unique types then sample frequency overstates matching probability – the "inflation factor".
Figure 4 comparison of matching LR from four different formulas by showing the inflation factors 
Four versions of matching likelihood ratio for matching to a previously unseen trait have been developed in this paper, three of which may be useful depending on circumstance and practicalities. The best estimate is the κ model (6), that LR_{κ}=n/(1−κ). The counting formula (1) – LR_{c}=n – is very robust and even easier to explain because it doesn't take advantage of the inflation factor, and can be used if the highly conservative value it supplies is sufficient for a particular forensic case. An intermediate possibility, robust and slightly conservative would be to use LR=m_{e} (12) for new haplotypes. It is certainly the right number to use absent knowing p, which can happen. For large κ it is only a little smaller than LR_{κ} (Figure4) and may be easier to explain. Finally although I can't justify the t model (Appendix §V.B.1) mathematically, it is included in the graph for interest and comparison.
For the less frequent situation p>1, (9) is an idea but not a recommendation. All we can say for now is that n/p < LR < n/(1−κ)p.
Finally, it is worth noting that the κ model analysis presented here applies perfectly well to traditional STR systems. If a handful of common alleles make up 99% of the allele population, then κ is near zero so take the inflation factor as 1 and the κ model degenerates to the counting model. For an allele observed p−1 times in a reference database D^{} of size n−1, the matching probability is therefore p/n.
S_{i}_{}  a onceobserved haplotype. S_{0} is the haplotype from the crime scene 
T  haplotype of an innocent suspect 
y  probability that T matches S_{0}, given that S_{0} has been observed once 
y_{p}  probability that T matches a ptimes observed type. (y_{1}=y) 
N  population size 
Ω  denotes a population of haplotypes. Ω_{1}, ..., Ω_{m}_{}, ... are various simulated populations. 
Ω^{t}^{}  population of t equally rare traits 
Ω^{z,t}  population of a common trait of frequency z and rare traits of frequency 1/t 
n  sample size 
D  a sample drawn from some Ω. D^{} is a sample before including the crime stain. 
p  popularity of an allele – number of times it occurs in a sample 
α_{}  number of onceobserved haplotypes (singletons) in a sample 
α_{}_{1}, α_{}_{2}, ..., α_{}_{p}_{}, ...  the number of singletons, doubletons, ... haplotypes of popularity p, in a sample. (α_{}_{1}=α_{}) 
κ  sample proportion α_{}/n of singletons in a sample 
f_{S}  population frequency of type S 
f_{=1}, f_{>1}  total population proportion of the haplotypes that are singletons, nonsingletons, in some sample 
N_{e}  effective population size, a population parameter 
q  probability two randomly selected types match (absent knowing p) 
m_{e}  effective number of types – reciprocal of the probability two randomly selected types match 
μ  mutation rate per generation for a haplotype system 
Suppose each individual in a certain population Ω^{t}^{} possesses one or another of t distinguishable and equally rare traits. Select a random sample D of n individuals. Let κ be the proportion of singletons, that is, of traits that occur exactly once in D. The expected value of κ, E(κ), is given by
E(κ) = (1−1/t)^{n}^{−1}, or E(κ)≈ e^{−(n−1)/t} if t is not small.
Proof: Let P(i) be the probability that the ith element in the sample is a singleton:
P(i)=(1−1/t)^{n}^{−1}.
Over a large number of nsamples, P(i) is also the expected number of singletons at the ith elements, per sample. These expectancies are additive (regardless that the events being summed are not independent), so
E(α) = Σ _{i=1, 2, ..., n} P(i) = nP(i), and  
E(κ) = E(α_{}/n) = E(α_{})/n = P(i) = (1−1/t)^{n}^{−1}  (10) 
as claimed. Using the fact that (1−1/t)^{t}^{}→e^{−1} (e=2.71828...) as t→∞, (1−1/t)^{n}^{−1} = (1−1/t)^{t}^{(n−1)/t} ≈ e^{−(n−1)/t}, so from (10),
E(κ) ≈ e^{−(n−1)/t}  (11) 
Q.E.D.
The value of κ can be observed in a sample. If we assume that the observed value is close to the expected value, then t can be estimated from formula (11).
κ ≈ E(κ) ≈ e^{−(n−1)/t} ≈ e^{−n/t}, whence  
t ≈ −n/ln κ.  (12) 
Suppose that true state of nature is Ω^{t}^{}. Then the correct likelihood ratio for a match would be LR_{t}_{} = t.
Perhaps from ignorance of the true state of nature we instead estimate the likelihood ratio using the κ model (6). What error would we commit?
LR_{κ}/LR_{t}_{}  = n/(1−κ)t  
≈ − (ln κ)/(1−κ)  (by (12))  
= [(1−κ)+(1−κ)^{2}/2 + ...]/(1−κ)  (expanding ln κ in Taylor series around κ=1),  
i.e.  
LR_{κ}  ≈ LR_{t}{1 + (1−κ)/2}. 
Although the above is well short of rigorous it does seem in the domain of interest – κ close to 1 – that LR_{κ} is not far different from LR_{t}_{}. See Figure 4. Hence even in the pathological tmodel, where the assumption (7) underlying the κ model is violated to the maximum extent possible, the formula (6) is nonetheless not far off.
The effective number of types, m_{e}, is the likelihood ratio supporting identity when two randomly selected types match – i.e. it is the number of types which, if they were equally frequent, would provide diversity equal to the actual diversity.
The effective number of types can easily be estimated directly from the population sample D. Simply count the number of ways that a matching pair can be drawn from the sample and compare it to the number C^{n}^{}_{2} of ways to draw any pair. A matching pair is obtained only when both members have the same popularity p and belong to the same one of the α_{p} identity cohort groups of that popularity. Hence (summation is always over p)
m_{e} = C^{n}^{}_{2}/Σα_{p} C^{p}^{}_{2}.  (13) 
Thus 1/m_{e} is the probability for the type T of an innocent suspect to match a database type S absent knowing the sample popularity of S.
When the sample popularity of S is given as p, then the matching probability is some different amount y_{p}, 1≤p, where 1/m_{e} is a weighted average of the y_{p}, bigger than y_{1}=y and smaller than y_{p} for large p. Specifically there are pα_{p} objects S in D of popularity peach of which has probability y_{p} to match T, so
1/m_{e}  = Σy_{p}pα_{p}/n.  (14)  
Assume the condition y_{p}≈py of equal overrepresentation (9). Then  
Σy_{p}pα_{p}/n  ≈ yΣp^{2}α_{p}/n  
= yΣ(pα_{p} +2C^{p}^{}_{2}α_{p})/n  (now use n=Σpα_{p} and (13))  
= y[n+2C^{n}^{}_{2}/m_{e}]/n  
= y(1+(n−1)/m_{e}).  
Substituting this into (14) gives the elegant result  
1/y  ≈ m_{e}+n−1.  (15) 
Alternatively, (15) follows from the neutral theory (§V.C). The neutral theory imples that β(0,θ) is the prior probability distribution for haplotype frequencies ([Ewens, 2003], page 116), from which (15) follows ([Dawid] "Use of databases") by standard properties of the beta function.
Apparently the neutral model and equal overrepresentation are related assumptions.
Assume the infinitely many neutral alleles model [Kimura] or "neutral theory" for short.
A classical formula from population genetic theory is m_{e}≈1+2N_{e}μ (for haploid populations), N_{e} being the effective population size – the size of an ideal equilibrium population with equivalent m_{e}, so roughly N_{e}≈m_{e}/2μ. This formula is the motivation for the example population sizes chosen for simulations. From for example the ABI U.S. Caucasian Yfiler database, m_{e}≈8883 (Table 1) and μ≈17/300, hence N_{e}≈80,000.
Under the neutral theory [Ewens, 2003] (p 94), gives
Pr(n^{th} sampled allele is previously unobserved) = θ/(θ+n),
where we can define θ = m_{e} −1 ≈ 2N_{e}μ. Equating with Robbins' result, we have
κ  ≈ θ/(θ+n) ≈ 1/[1 + n / 2N_{e}μ ].  (16) 
Hence κ will be smaller for larger samples, and larger for larger populations or larger mutation rates. Note that the mutation rate μ for a Yhaplotype is the combined mutation rate for all of the included loci, hence μ≈17/300 per generation for the 17locus Yfiler haplotype.
We can solve (16) for θ:
θ  ≈ nκ/(1−κ),  (17) 
Since for rare haplotypes m_{e} >>1, we can say m_{e} ≈θ, so for situations of interest for this paper
m_{e}  ≈ nκ/(1−κ).  (18) 
I claim that the condition (3), Pr(SingletonObserved)=κ, is equivalent to the condition of equal overrepresentation (7) that κ/f_{=1} = (1κ)/f_{>1}. These claims are relative to a fixed sample D and a haplotype T randomly selected from the general population. By way of proof, note that f_{=1}=Pr(Singleton)=Pr(Singleton & Observed) while f_{>1}=Pr(~Singleton & Observed). So (7) is
κ/Pr(Singleton & Observed)  = (1κ)/Pr(~Singleton & Observed)  
or  κ/Pr(Observed)Pr(SingletonObserved)  = (1κ)/Pr(Observed)Pr(~SingletonObserved) 
or  κ/Pr(SingletonObserved)  = (1κ)/[1Pr(SingletonObserved)], 
which is true if and only if (3), Q.E.D.
I am indebted to Terry Speed whose broad hint in pointing me to several key references [Robbins, Ewens, 1972, Dawid] years ago suggest that he anticipated the main idea of this paper. Thanks to Tom Parsons, Monty Slatkin, Keiji Tamaki, Jim Crow, Bruce Weir and Steve Lee for encouragement and discussions. This work was supported in part by the DNA·VIEW User's Group.
Prinz, M., Advantages and Disadvantages of YShort Tandem Repeat Testing in Forensic Casework. Forensic Sci Rev, 2003. 15 p. 189196.
MayntzPress, K.A., et al., abstract YSTR Profiling in Extended Interval (> or =3 days) Postcoital Cervicovaginal Samples. J For Sci, 2008. 53(2): p. 342348.
Malsom, S., et al., abstract The prevalence of mixed DNA profiles in fingernail samples taken from couples who cohabit using autosomal and YSTRs. For Sci Int: Genetics, 2009. 3: p. 5762.
Kayser, M., et al., abstract Evaluation of Ychromosomal STR's: a multicenter study. Int. J Leg Med, 1997. 110: p. 125133.
Jobling, M., A. Pandya, and C. TylerSmith, abstract The Y chromosome in forensic analysis and paternity testing. Int J Legal Med 1997. 110: p. 118124.
Ivanov, P., et al., abstract Mitochondrial DNA sequence heteroplasmy in the Grand Duke of Russia Georgij Romanov establishes the authenticity of the remains of Tsar Nicholas II. Nature Genetics, 1996. 12: p. 417420
KeyserTracqui, C., E. Crubézy, and B. Ludes, abstract Nuclear and Mitochondrial DNA Analysis of a 2,000YearOld Necropolis in the Egyin Gol Valley of Mongolia. Am J Hum Genet, 2003. 73: p. 247260.
Roewer, L., et al., abstract A new method for the evaluation of matches in nonrecombining genomes: Application to Ychromosomal short tandem repeat (STR) haplotypes in European males. Forensic Science International 2000. 114: p. 3143.
Robbins, H., full text Estimating the total probability of the unobserved outcomes of an experiment. The Annals of Mathematical Statistics, 1968.
Morton, N., full text Genetic structure of forensic populations. Proc Natl Acad Sci USA, 1992. 89: p. 25562560.
Applied_Biosystems, Yfiler haplotype database.
Crow, J., full text Twentyfive Years Ago in Genetics: The Infinite Allele Model. Genetics, 1989. 121: p. 631634.
Slatkin, M. and G. Bertorelle, full text The use of intraallelic variability for testing neutrality and estimating population growth rate. Genetics, 2001. 158(2): p. 86574.
Kimura, M. and J. Crow, full text The number of alleles that can be maintained in a finite population. Genetics, 1964. 49: p. 725738.
Slatkin, M., abstract An exact test for neutrality based on the Ewens sampling distribution. Genetical Research, 1994. 64(1): p. 7174.
Zerjal, T., et al., full text The genetic legacy of the Mongols. Am J Hum Genet., 2003. 72(3): p. 717721.
Veldman, A., full text Evidential strength of YSTR haplotype matches in forensic DNA casework. 2007, Universiteit Leiden.
Ewens, W., commentary by Ewens The sampling theory of selectively neutral alleles. Theoretical population biology, 1972. 3: p. 87112.
Mill, J., A System of Logic: Ratiocinative and Inductive; Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation. Vol. 2, Ch 18. 1868, see page 62 London: Longmans, Green, Reader, and Dyer.
Ewens, W., Mathematical Population Genetics: I. Theoretical introduction, 2nd edition. 2003: Springer.
Dawid, A. and J. Mortera, abstract & first page Coherent Analysis of Forensic Identification Evidence. J.R.Statist.Soc B, 1996. 58(2): p. 425443.