The Power of SNP's Even Without Population Data
Poster presentation at the 10th Promega Symposium on Human Identification,
Orlando, Florida, 29 September 1999
CH Brenner, Consulting in forensic mathematics,
1999: Department of Genetics, Univ of Leicester, UK
2000: Berkeley, California, USA
Feel free to link to this page, but please do not reproduce this material
without permission of the author.
How many SNP's equal one STR? one RFLP?
Single nucleotide polymorphisms (SNP) have the potential to be just
as discriminating as loci of high polymorphism such as VNTR-RFLP or
MVR systems you just need more of them. This is obviously true for
forensic identification. It is also true for more complex problems
such as deciding sibling-ship or drawing inferences from mixed stains.
So there is a tradeoff possible:
What is the rate of tradeoff? How many SNP's per STR? per VNTR-RFLP?
- a handful of highly polymorphic systems, or
- a larger number of less-polymorphic systems.
The rate of tradeoff is not simple.
- The tradeoff is different for different kinds of
casework i.e. stain matching, paternity, mixed stain, kinship
- The tradeoff is sometimes different for
proving, than for disproving.
- The tradeoff depends somewhat on ones
choice of criteria. (The "
typical likelihood ratio" is not always a possible choice.)
SNP's can be valuable in casework even
without population data.
Methods and ground rules
The method of analysis is exact calculation, using an idealized model
of loci, to wit:
- Every allele at a locus is assumed to be equally frequent.
- A "k-locus" means a locus with k equally-frequent alleles, so
Likelihood ratios are used of course for
the analysis. The "typical" likelihood ratio
is defined as a geometric-mean-value i.e. not the arithmetic
mean (which is not sensible) but as an average taken in the
|STR||is roughly equivalent to a 4-locus or a
|VNTR-RFLP||is modeled by 12<k<100 roughly. k=40
From the graph (triangle heights) we have tradeoff rates
|SNP : ||STR : ||VNTR-RFLP |
|1 :||2.6 :||6.4|
for the stain matching ("forensic") problem. It takes 6.4 SNP's
(or 6.4/2.6=2.5 STR's) to equal the power of one VNTR-RFLP.
Paternity casework gives a different set of ratios:
|SNP : ||STR : ||VNTR-RFLP |
|1 :||4 :||12|
To replace three VNTR-RFLP's, you need more than 30 SNP's.
Suppose a laboratory does both kinds of casework, at present using a
battery of non-SNP markers. They consider switching to a battery of
SNP's. They determine that the SNP battery will be equal to what they
now get for forensic work. Then, they will be losing performance on
their paternity work. Paternity is "harder" than forensic work for
I made calculations and comparisons for the following set of problems:
||simple stain matching|
||mixture of victim and suspect|
||mixture of suspect and an unknown|
||true paternity situation: mother, child, and father|
||true paternity case with mother not tested|
||full sibs present; distinguish them from half-sibs|
||half-sibs present; distinguish them from full-sibs|
Absolute difficulty of various problems;
loci needed for likelihood ratio=1000
Different question, different order
|expected # of loci needed for LR=1000
|Number of (equi-frequent) alleles/locus|
Summary Power of SNP's
- The tradeoff amount of polymorphism vs
number of loci depends on several things, especially
the type of problem (forensic, paternity, sibship, mixture).
- Mixed stain problems, and to a lesser extent motherless
paternity, are "relatively hard" for SNP's a lot of SNP's per
STR for equivalent performance. But no problem is impossible. Simple
stain matching is not the "relatively easiest" problem for SNP's;
disproving sibship ("non-sib") is easier.
- The problems that are "relatively hard" are not necessarily the
ones that are "hard" in the sense of requiring many loci. Paternity
and sibling problems are "relatively easier" than mixed stain problems,
but they are "harder" in the sense of requiring more loci for equivalent
Without Population Data
Why is a database of size N=0 big enough?
because you can afford to have a lot of SNP's.
Suppose we have a panel of 100 SNP's for stain matching. Genotypes
are AA, AB, and BB. Suppose a crime is committed in a possibly
highly inbred population for which there are
no population statistics.
Nonetheless, it is reasonable to hope (not assume!) that the
people are not genetic clones, and that the allele frequencies will
generally be in the 30-70% range since the loci were presumably
screened for polymorphism in some population.
Moreover, we assume that the loci have been confirmed to be
selectively neutral, and unlinked.
A 100-locus genotype, matching between suspect and crime stain
Ignore all but the heterozygous loci.
Every heterozygous locus contributes a likelihood ratio > 2
even in a substructured population.
If there are 24 heterozygous loci among the 100, the matching odds
will be > 224 or >10 million practically definitive. Even if only
10 loci are heterogeneous, the matching odds are certainly >1000.
Database of size N=0 some details
(1) Every heterozygous locus contributes a likelihood ratio > 2
[I originally discussed this idea in 1997 under the title
"The littlest database".]
even in a substructured population. Proof: Let p and
q=1-p be the allele frequencies. The proportion
H of heterozygotes is 2pq if genes flow freely in
the population, otherwise even less. So H < 2pq =
2[(0.5)2-(0.5-p)2] < 0.5, thus
1/H > 2, Q.E.D.
(2) 24 heterozygous loci
If all loci have allele frequencies ¼ and ¾, in 99.9% of
cases there will be >23 heterozygous loci out of 100. Even if the
allele frequencies are 0.1 and 0.9, it is 99% to have at least 10
top of SNP Power page
This work was partially supported by table legs and a chair.